Latent Space: The AI Engineer Podcast - RAG Is A Hack - with Jerry Liu from LlamaIndex
Episode Date: October 5, 2023Want to help define the AI Engineer stack? >800 folks have weighed in on the top tools, communities and builders for the first State of AI Engineering survey, which we will present for the first time ...at next week’s AI Engineer Summit. Join us online!This post had robust discussion on HN and Twitter.In October 2022, Robust Intelligence hosted an internal hackathon to play around with LLMs which led to the creation of two of the most important AI Engineering tools: LangChain 🦜⛓️ (our interview with Harrison here) and LlamaIndex 🦙 by Jerry Liu, which we’ll cover today. In less than a year, LlamaIndex has crossed 600,000 monthly downloads, raised $8.5M from Greylock, has a fast growing open source community that contributes to LlamaHub, and it doesn’t seem to be slowing down.LlamaIndex’s Origin (aka GPT Tree Index)Jerry struggled to make large amounts of data work with GPT-3 (which had a 4,096 tokens context window). Today LlamaIndex is at the forefront of the RAG wave (Retrieval Augmented Generation), but in the beginning Jerry wasn’t focused on embeddings and search, but rather on understanding how models could summarize, link, and reason about data. On November 5th, Jerry pushed the first version to Github under the name “GPT Tree Index”: The GPT Tree Index first takes in a large dataset of unprocessed text data as input. It then builds up a tree-index in a bottom-up fashion; each parent node is able to summarize the children nodes using a general summarization prompt; each intermediate node containing summary text summarizing the components below. Once the index is built, it can be saved to disk and loaded for future use.Then, say the user wants to use GPT-3 to answer a question. Using a query prompt template, GPT-3 will be able to recursively perform tree traversal in a top-down fashion in order to answer a question. For example, in the very beginning GPT-3 is tasked with selecting between *n* top-level nodes which best answers a provided query, by outputting a number as a multiple-choice problem. The GPT Tree Index then uses the number to select the corresponding node, and the process repeats recursively among the children nodes until a leaf node is reached.[…]How is this better than an embeddings-based approach / other state-of-the-art QA and retrieval methods?The intent is not to compete against existing methods. A simpler embedding-based technique could be to just encode each chunk as an embedding and do a simple question-document embedding look-up to retrieve the result. This project is a simple exercise to test how GPT can organize and lookup information.The project attracted a lot of attention early on (the announcement tweet has ~330 likes), but it wasn’t until ~February 2023 that the open source community really started to explode, which was around the same time that LlamaHub was released. LlamaHub made it easy for developers to import data from Google Drive, Discord, Slack, databases, and more into their LlamaIndex projects. What is LlamaIndex? As we mentioned, LlamaIndex is leading the charge in the development of the RAG stack. RAG boils down to two parts:* Indexing (i.e. how do you load and index the data in your knowledge base)* Querying (i.e. how do you surface the data and fit it in the model context) IndexingTo get your data from all your sources to your RAG knowledge base, you can leverage a few tools: * Documents / Nodes: A Document is a generic container around any data source - for instance, a PDF, an API output, or retrieved data from a database. A Node is the atomic unit of data in LlamaIndex and represents a “chunk” of a source Document (i.e. one Document has many Node) as well as its relationship to other Node objects.* Data Connectors: A data connector ingest data from different sources and turn them into Document representations (text and simple metadata). These connectors are offered through LlamaHub, and there are over 200 of them today.* Data Indexes: Once you’ve ingested your data, LlamaIndex will help you index the data into a format that’s easy to retrieve. There are many types of indexes (Summary, Tree, Vector, etc). Under the hood, LlamaIndex parses the raw documents into intermediate representations, calculates vector embeddings, and infers metadata. The most commonly used index is the VectorStoreIndex, which can then be paired with any of the vector stores out there (an example with Chroma).QueryingThe RAG pipeline, during the querying phase, sources the most pertinent context from a user's prompt, forwarding it along to the LLM. This equips the LLM with current / private knowledge beyond its foundational training data. LlamaIndex offers adaptable modules tailored for building RAG pathways for Q&A, chatbots, or agent use, since each of them has different requirements. For example, a chatbot should expect the user to interject with follow up questions, while an agent will try to carry out a whole task on its own without user intervention. Building Blocks* Retrievers: A retriever defines how to efficiently retrieve relevant context from a knowledge base (i.e. index) when given a query. Vector index is the most popular mode, but there are other options like Summary, Tree, Keyword Table, Knowledge Graph, and Document Summary. * Node Postprocessors: Once the retriever gets you Node objects back, you will need to do additional work like discarding low similarity ones. There are many options here as well, such as `SimilarityPostprocessor` (i.e. drop nodes below a certain similarity score) or `LongContextReorder` which helps avoid the issues raised in the “Lost in the Middle, U-shaped recollection curve” paper. * Response Synthesizers: Takes a user query and your retrieved chunks, and prompts and LLM with them. There are a few response modes here that balance thoroughness and compactness.Pipelines* Query Engines: A query engine is an end-to-end pipeline that allow you to ask question over your data. It takes in a natural language query, and returns a response, along with reference context retrieved and passed to the LLM. This makes it possible to do things like “Ask panda questions” by leveraging Panda dataframes as a data source. * Chat Engines: A chat engine is an end-to-end pipeline for having a conversation with your data (multiple back-and-forth instead of a single question & answer). This supports traditional OpenAI-style chat interfaces, as well as more advanced ones like ReAct.* Agents: An agent is an automated decision maker (powered by an LLM) that interacts with the world via a set of tools. Agent may be used in the same fashion as query engines or chat engines, but they have the power to both read and write data. For reasoning, you can use either OpenAI Functions or ReAct. Both can leverage the tools offered through LlamaHub for further analysis.RAG vs FinetuningNow that you have a full overview of what LlamaIndex does, the next question is “When should I use this and when should I fine tune?”. Jerry’s TLDR is that “RAG is just a hack”, but a powerful one. Each option has pros and cons:* Lower investment: RAG requires almost 0 upfront investment, unlike finetuning which requires data cleaning, model training, increased costs for finetuned inference, etc.* Stricter access control and higher visibility: when finetuning, the model learns everything. With RAG, you can decide what documents the index should have access to, making it more secure by default. You are also able to see everything that was passed into the context if a response doesn’t look right.* Context window limitation: you can only fit so many tokens into the prompt due to the way models work. Finetuning helps you circumvent that by compressing the knowledge into the model weights rather than putting it in the prompt. As Jerry says, the best way to know this inside out is to learn to build RAG from scratch (without LlamaIndex) - and they have plenty of tutorials on his Twitter and blog to learn this.The other issue is that the math for finetuning isn’t well known yet as we discussed with Quentin Anthony from Eleuther, so unless you have money and time to invest into exploring fine tuning, you’re better off starting with RAG. Full YouTube Discussion!Show Notes* LlamaIndex* LlamaHub* SEC Insights* Robust Intelligence* Quora’s Poe* Chroma* Vespa* Why should every AI engineer learn to build RAG from scratch?* LangChain* Gorilla* Lost in the Middle: How Language Models Use Long ContextsTimestamps* [00:00:00] Introductions and Jerry’s background* [00:04:30] Starting LlamaIndex as a side project* [00:05:11] Evolution from tree-index to current LlamaIndex and LlamaHub architecture* [00:11:39] Deciding to leave Robust to start the LlamaIndex company and raising funding* [00:20:06] Context window size and information capacity for LLMs* [00:21:34] Minimum viable context and maximum context for RAG* [00:22:52] Fine-tuning vs RAG - current limitations and future potential* [00:24:02] RAG as a hack but good hack for now* [00:26:19] RAG benefits - transparency and access control* [00:27:46] Potential for fine-tuning to take over some RAG capabilities* [00:30:04] Baking everything into an end-to-end trained LLM* [00:33:24] Similarities between iterating on ML models and LLM apps* [00:34:47] Modularity and customization options in LlamaIndex: data loading, retrieval, synthesis, reasoning* [00:40:16] Evaluating and optimizing each component of Lama Index system* [00:46:02] Building retrieval benchmarks to evaluate RAG* [00:47:24] SEC Insights - open source full stack LLM app using LlamaIndex* [00:49:48] Enterprise platform to complement LlamaIndex open source* [00:51:00] Community contributions for LlamaHub data loaders* [00:53:21] LLM engine usage - majority OpenAI but options expanding* [00:56:25] Vector store landscape* [00:59:46] Exploring relationships and graphs within data* [01:03:24] Additional complexity of evaluating agent loops* [01:04:01] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence and Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:20]Swyx: And today we finally have Jerry Liu on the podcast. Hey Jerry. [00:00:24]Jerry: Hey guys. Hey Swyx and Alessio. Thanks for having me. [00:00:27]Swyx: It's kind of weird because we keep running into each other in San Francisco AI events, so it's kind of weird to finally just have a conversation recorded for everybody else. [00:00:34]Jerry: Yeah, I know. I'm really looking forward to this, aside from the questions. [00:00:38]Swyx: So I tend to introduce people on their formal background and then ask something on the more personal side. So you are part of the Princeton gang. [00:00:46]Jerry: I don't know if there is like official Princeton gang. [00:00:48]Swyx: No, small Princeton gang. Okay. I attended your meeting. There was like four of you with Prem and the others. And then you have a bachelor's in CS and a certificate in finance. That's also fun. I also did finance and I think I saw that you also interned at Two Sigma where I worked in New York. You were a machine learning engineer. [00:01:06]Jerry: You were at Two Sigma?Swyx: Yeah, very briefly.Jerry: Oh, cool. I didn't know that. [00:01:09]Swyx: That was my first like proper engineering job before I went into DevRel. [00:01:12]Jerry: Oh, okay. Nice. [00:01:14]Swyx: And then you were a machine learning engineer at Quora, AI research scientist at Uber for three years, and then two years machine learning engineer at Robust Intelligence before starting LlamaIndex. So that's your LinkedIn. It's not only LinkedIn that people should know about you. [00:01:27]Jerry: I think back during my Quora days, I had this like three-month phase where I just wrote like a ton of Quora answers. And so I think if you look at my tweets nowadays, you can basically see that as like the V2 of my three-month like Forrestant where I just like went ham on Quora for a bit. I actually, I think I was back then actually when I was working on Quora, I think the thing that everybody was fascinated in was just like general like deep learning advancements and stuff like GANs and generative like images and just like new architectures that were evolving. And it was a pretty exciting time to be a researcher actually, because you were going in like really understanding some of the new techniques. So I kind of use that as like a learning opportunity, basically just like read a bunch of papers and then answer questions on Quora. And so you can kind of see traces of that basically in my current Twitter where it's just like really about kind of like framing concepts and trying to make it understandable and educate other users on it. Yeah. [00:02:17]Swyx: I've said, so a lot of people come to me for my Twitter advice, but like, I think you are doing one of the best jobs in AI Twitter, which is explaining concepts and just consistently getting hits out. Thank you. I didn't know it was due to the Quora training. Let's just sign on on Quora. A lot of people, including myself, like kind of wrote off Quora as like one of the web 1.0 like sort of question answer forums. But now I think it's becoming, seeing a resurgence obviously due to Poe and obviously Adam and D'Angelo has always been a leading tech figure, but what do you think is kind of underrated about Quora? [00:02:46]Jerry: Well, I mean, I like the, I really liked the mission of Quora when I, when I joined. In fact, I interned there like in 2015 and I joined full time in 2017. One is like they had, and they have like a very talented engineering team and just like really, really smart people. And the other part is the whole mission of the company is to just like spread knowledge and to educate people. And to me that really resonated. I really liked the idea of just like education and democratizing the flow of information. If you imagine like kind of back then it was like, okay, you have Google, which is like for search, but then you have Quora, which is just like user generated, like grassroots type content. And I really liked that concept because it's just like, okay, there's certain types of information that aren't accessible to people, but you can make accessible by just like surfacing it. And so actually, I don't know if like most people know that about like Quora and if they've used the product, whether through like SEO, right, or kind of like actively, but that really was what drew me to it. [00:03:39]Swyx: Yeah. I think most people challenges with it is that sometimes you don't know if it's like a veiled product pitch, right? [00:03:44]Jerry: Yeah. Of course, like quality of the answer matters quite a bit. And then you start running into these like- [00:03:47]Swyx: It's like five alternatives and then here's the one I work on. Yeah. [00:03:50]Jerry: Like recommendation issues and all that stuff. I used, I worked on recsys at Quora actually, so I got a taste of some of that stuff. Well, I mean, I kind of more approached it from machine learning techniques, which might be a nice segue into RAG actually. A lot of it was just information retrieval. We weren't like solving anything that was like super different than what was standard in the industry at the time, but just like ranking based on user preferences. I think a lot of Quora was very metrics driven. So just like trying to maximize like daily active hours, like time spent on site, those types of things. And all the machine learning algorithms were really just based on embeddings. You have a user embedding and you have like item embeddings and you try to train the models to try to maximize the similarity of these. And it's basically a retrieval problem. [00:04:30]Swyx: Okay. So you've been working on RAG for longer than most people think? [00:04:33]Jerry: Well, kind of. So I worked there for like a year, right, just transparently. And then I worked at Uber where I was not working on ranking. It was more like kind of deep learning training for self-driving and computer vision and that type of stuff. But I think in the LLM world, it's kind of just like a combination of like everything these days. I mean, retrieval is not really LLMs, but like it fits within the space of like LLM apps. And then obviously like having knowledge of the underlying deep learning architectures helps. Having knowledge of basic software engineering principles helps too. And so I think it's kind of nice that like this whole LLM space is basically just a combination of just like a bunch of stuff that you probably like people have done in the past. [00:05:11]Swyx: It's good. It's like a summary capstone project. Yeah, exactly. [00:05:14]Jerry: Yeah. [00:05:15]Alessio: And before we dive into LlamaIndex, what do they feed you a robust intelligence that both you and Harrison from LangChain came out of it at the same time? Was there like, yeah. Is there any fun story of like how both of you kind of came up with kind of like core infrastructure to LLM workflows today? Or how close were you at robust? Like any fun behind the scenes? [00:05:37]Jerry: Yeah. Yeah. We, um, we work pretty closely. I mean, we were on the same team for like two years. I got to know Harrison and the rest of the team pretty well. I mean, I have a respect that people there, the people that were very driven, very passionate. And it definitely pushed me to be, you know, a better engineer and leader and those types of things. Yeah. I don't really have a concrete explanation for this. I think it's more just, we have like an LLM hackathon around like September. This was just like exploring GPT-3 or it was October actually. And then the day after I went on vacation for a week and a half, and so I just didn't track Slack or anything. And then when I came back, saw that Harrison started LangChain [00:06:09]Swyx: Oh that's cool. [00:06:10]Jerry: I was like, oh, I'll play around with LLMs a bit and then hacked around on stuff. And I think I've told the story a few times, but you know, I was like trying to feed in information into GPT-3. And then, then you deal with like context window limitations and there was no tooling or really practices to try to understand how do you, you know, get GPT-3 to navigate large amounts of data. And that's kind of how the project started. Really was just one of those things where early days, like we were just trying to build something that was interesting. Like I wanted to start a company. I had other ideas actually of what I wanted to start. And I was very interested in, for instance, like multimodal data, like video data and that type of stuff. And then this just kind of grew and eventually took over the other idea. [00:06:48]Swyx: Text is the universal interface. [00:06:50]Jerry: I think so. I think so. I actually think once the multimodal models come out, I think there's just like mathematically nicer properties of you can just get like join multiple embeddings, like clip style. But text is really nice because from a software engineering principle, it just makes things way more modular. You can just convert everything into text and then you just represent everything as text. [00:07:08]Swyx: Yeah. I'm just explaining retroactively why working on LlamaIndex took off versus if you had chose to spend your time on multimodal, we probably wouldn't be talking about whatever you ended up working on. [00:07:18]Jerry: Yeah. [00:07:19]Swyx: That's true. It's troubled. Interesting. So November 9th, that was a very productive month. I guess October, November, November 9th, you announced GPT-3 Index and you picked a tree logo. Very cool. Every project must have an emoji. [00:07:32]Jerry: Yeah. Yeah. I probably was somewhat inspired by a light train, but I will admit, yeah. [00:07:37]Swyx: It uses GPT to build a knowledge tree in a bottoms-up fashion by applying a summarization prompt for each node. Yep. Which I like that original vision. Your messaging roundabout then was also that you're creating optimized data structures. What's the sort of journey to that and how does that contrast with LlamaIndex today? Okay. [00:07:56]Jerry: Maybe I can tell a little bit about the beginning intuitions. I think when I first started, this really wasn't supposed to be something that was like a toolkit that people use. It was more just like a system. And the way I wanted to think about the system was more a thought exercise of how language models with their reasoning capabilities, if you just treat them as like brains, can organize information and then traverse it. So I didn't want to think about embeddings, right? To me, embeddings just felt like it was just an external thing that was like, well, it was just external to trying to actually tap into the capabilities of language models themselves, right? I really wanted to see, you know, just as like a human brain could like synthesize stuff, could we create some sort of like structure where this neural CPU, if you will, can like organize a bunch of information, you know, auto-summarize a bunch of stuff and then also traverse the structure that I created. That was the inspiration for this initial tree index, to be honest. And I think I said this in the first tweet, it actually works super well, right? Like GPT-4 obviously is much better at reasoning. I'm one of the first to say, you know, you shouldn't use anything pre-GPT-4 for anything that requires complex reasoning because it's just going to be unreliable, okay, disregarding stuff like fine tuning. But it worked okay. But I think it definitely struck a chord with kind of like the Twitter crowd, which is just like new ideas at the time, I guess, just like thinking about how you can actually bake this into some sort of application. Because I think what I also ended up discovering was the fact that there was starting to become a wave of developers building on top of GPT-3 and people were starting to realize that what makes them really useful is to apply them on top of your personal data. And so even if the solution itself was kind of like primitive at the time, like the problem statement itself was very powerful. And so I think being motivated by the problem statement, right, like this broad mission of how do I unlock elements on top of the data also contributed to the development of LOM index to the state it is today. And so I think part of the reason, you know, our toolkit has evolved beyond the just existing set of like data structures is we really tried to take a step back and think, okay, what exactly are the tools that would actually make this useful for a developer? And then, you know, somewhere around December, we made an active effort to basically like push towards that direction, make the code base more modular, right, more friendly as an open source library. And then also start adding in like embeddings, start thinking into practical considerations like latency, cost, performance, those types of things. And then really motivated by that mission, like start expanding the scope of the toolkit towards like covering the life cycle of like data ingestion and querying. Where you also added Llamahub and yeah, so I think that was in like January on the data loading side. And so we started adding like some data loaders, saw an opportunity there, started adding more stuff on the retrieval querying side, right? We still have like the core data structures, but how do you actually make them more modular and kind of like decouple storing state from the types of like queries that you could run on top of this a little bit. And then starting to get into more complex interactions, like chain of thought reasoning, routing and, you know, like agent loops. [00:10:44]Alessio: You and I spent a bunch of time earlier this year talking about Llamahub, what that might become. You were still at Robust. When did you decide it was time to start the company and then start to think about what LlamaIndex is today? [00:10:58]Jerry: Yeah, I mean, probably December. It was kind of interesting. I was getting some inbound from initial VCs, I was talking about this project. And then in the beginning, I was like, oh, yeah, you know, this is just like a design project. But you know, what about my other idea on like video data, right? And then I was trying to like get their thoughts on that. And then everybody was just like, oh, yeah, whatever, like that part's like a crowded market. And then it became clear that, you know, this was actually a pretty big opportunity. And like, coincidentally, right, like this actually did relate to like, my interests have always been at the intersection of AI data and kind of like building practical applications. And it was clear that this was evolving into a much bigger opportunity than the previous idea was. So around December, and then I think I gave a pretty long notice, but I left officially like early March. [00:11:39]Alessio: What were your thinkings in terms of like moats and, you know, founders kind of like overthink it sometimes. So you obviously had like a lot of open source love and like a lot of community. And you're like, were you ever thinking, okay, I don't know, this is maybe not enough to start a company or did you always have conviction about it? [00:11:59]Jerry: Oh, no, I mean, 100%. I felt like I did this exercise, like, honestly, probably more late December and then early January, because I was just existentially worried about whether or not this would actually be a company at all. And okay, what were the key questions I was thinking about? And these were the same things that like other founders, investors, and also like friends would ask me is just like, okay, what happens if context windows get much bigger? What's the point of actually structuring data right in the right way? Right? Why don't you just dump everything into the prompt, fine tuning, like, what if you just train the model over this data? And then, you know, what's the point of doing this stuff? And then some other ideas is what if like OpenAI actually just like takes this like builds upwards on top of the their existing like foundation models and starts building in some like built in orchestration capabilities around stuff like RAG and agents and those types of things. And so I basically ran through this mental exercise and, you know, I'm happy to talk a little bit more about those thoughts as well. But at a high level, well, context windows have gotten bigger, but there's obviously still a need for a rag. I think RAG is just like one of those things that like, in general, what people care about is, yes, they do care about performance, but they also care about stuff like latency and costs. And so my entire reasoning at the time was just like, okay, like, yes, maybe you will have like much bigger context windows, as we've seen with like 100k context windows. But for enterprises, like, you know, data, which is not in just like the scale of like a few documents, it's usually in like gigabytes, terabytes, petabytes. How do you actually just unlock language models over that data, right? And so it was clear there was just like, whether it's RAG or some other paradigm, no one really knew what that answer was. And so there was clearly like technical opportunity here. Like there was just stacks that needed to be invented to actually solve this type of problem, because language models themselves didn't have access to this data. The other piece here is just like, and so if like you just dumped all this data into, let's say a model had like hypothetically an infinite context window, right? And you just dump like 50 gigabytes of data into a context window. That just seemed very inefficient to me, because you have these network transfer costs of uploading 50 gigabytes of data to get back a single response. And so I kind of realized, you know, there's always going to be some curve, regardless of like the performance of the best performing models of like cost versus performance. What RAG does is it does provide extra data points along that access, because you kind of control the amount of context you actually wanted to retrieve. And of course, like RAG as a term was still evolving back then, but it was just this whole idea of like, how do you just fetch a bunch of information to actually, you know, like stuff into the prompt. And so people even back then were kind of thinking about some of those considerations. [00:14:29]Swyx: And then you fundraised in June, or you announced your fundraiser in June. Yeah. Take us through that process of thinking about the fundraise and your plans for the company, you know, at the time. Yeah, definitely. [00:14:41]Jerry: I mean, I think we knew we wanted to, I mean, obviously we knew we wanted to fundraise. There was also a bunch of like investor interest, and it was probably pretty unusual given the, you know, like hype wave of generative AI. So like a lot of investors were kind of reaching out around like December, January, February. In the end, we went with Greylock. Greylock's great. You know, they've been great partners so far. And to be honest, like there's a lot of like great VCs out there. And a lot of them who are specialized on like open source, data, infra, and that type of stuff. What we really wanted to do was, because for us, like time was of the essence, like we wanted to ship very quickly and still kind of build Mindshare in this space. We just kept the fundraising process very efficient. I think we basically did it in like a week or like three days. And so, yeah, just like front loaded it and then just like pick the one named Jerry. Yeah, exactly. Yeah. [00:15:27]Swyx: I'm kidding. I mean, he's obviously great and Greylock's a fantastic firm. [00:15:32]Jerry: Embedding some of my research. So, yeah, just we've had Greylock. They've been great partners. I think in general, when I talk to founders about like the fundraise process, it's never like the most fun period, I think, because it's always just like, you know, there's a lot of logistics, there's lawyers you have to, you know, get in the loop. And like a lot of founders just want to go back to building. I think in the end, we're happy that we kept it to a pretty efficient process. [00:15:54]Swyx: And so you fundraise with Simon. How do you split things with him? How big is your team now? [00:15:57]Jerry: The team is growing. By the time this podcast is released, we'll probably have had one more person join the team. So basically, it's between, we're rapidly getting to like eight or nine people. At the current moment, we're around like six. And so just like there'll be some exciting developments in the next few weeks. I'm excited to announce that. So the team is, has kind of like, we've been pretty selective in terms of like how we like grow the team. Obviously, like we look for people that are really active in terms of contributions to Lum Index, people that have like very strong engineering backgrounds. And primarily, we've been kind of just looking for builders, people that kind of like grow the open source and also eventually this like managed like enterprise platform as well with us. In terms of like Simon, yeah, I've known Simon for a few years now. I knew him back at Uber ATG in Toronto. He's one of the smartest people I knew, has a sense of both like a deep understanding of ML, but also just like first principles thinking about like engineering and technical concepts in general. And I think one of my criteria, criteria is when I was like looking for a co-founder for this project with someone that was like technically better than me, because I knew I wanted like a CTO. And so honestly, like there weren't a lot of people that, I mean, there's, I know a lot of people that are smarter than me, but like that fit that bill. We're willing to do a startup and also just have the same like values that I shared. Right. And just, I think doing a startup is very hard work, right? It's not like, I'm sure like you guys all know this, it's, it's a lot of hours, a lot of late nights and you want to be like in the same place together and just like being willing to hash out stuff and have that grit basically. And I really looked for that. And so Simon really fit that bill and I think I convinced him to bring Trump on board. [00:17:24]Swyx: Yeah. And obviously I've had the pleasure of chatting and working with a little bit with both of you. What would you say those, those like your top one or two values are when, when thinking about that or the culture of the company and that kind of stuff? [00:17:36]Jerry: I think in terms of the culture of the company, it's really like, I mean, there's a few things I can name off the top of my head. One is just like passion, integrity. I think that's very important for us. We want to be honest. We don't want to like, obviously like copy code or, or kind of like, you know, just like, you know, not give attribution, those types of things and, and just like be true to ourselves. I think we're all very like down to earth, like humble people, but obviously I think just willingness to just like own stuff and dive right in. And I think grit comes with it. I think in the end, like this is a very fast moving space and we want to just like be one of the, you know, like dominant forces and helping to provide like production quality outline applications. Yeah. [00:18:11]Swyx: I promise we'll get to more technical questions, but I also want to impress on the audience that this is a very conscious and intentional company building. And since your fundraising post, which was in June, and now it's September, so it's been about three months, you've actually gained 50% in terms of stars and followers. You've 3x'd your download count to 600,000 a month and your discord membership has reached 10,000. So like a lot of ongoing growth. [00:18:37]Jerry: Yeah, definitely. And obviously there's a lot of room to expand there too. And so open source growth is going to continue to be one of our core goals because in the end it's just like, we want this thing to be, well, one big, right? We all have like big ambitions, but to just like really provide value to developers and helping them in prototyping and also productionization of their apps. And I think it turns out we're in the fortunate circumstance where a lot of different companies and individuals, right, are in that phase of like, you know, maybe they've hacked around on some initial LLM applications, but they're also looking to, you know, start to think about what are the production grade challenges necessary to actually, that to solve, to actually make this thing robust and reliable in the real world. And so we want to basically provide the tooling to do that. And to do that, we need to both spread awareness and education of a lot of the key practices of what's going on. And so a lot of this is going to be continued growth, expansion, education, and we do prioritize that very heavily. [00:19:30]Alessio: Let's dive into some of the questions you were asking yourself initially around fine tuning and RAG , how these things play together. You mentioned context. What is the minimum viable context for RAG ? So what's like a context window too small? And at the same time, maybe what's like a maximum context window? We talked before about the LLMs are U-shaped reasoners. So as the context got larger, like it really only focuses on the end and the start of the prompt and then it kind of peters down. Any learnings, any kind of like tips you want to give people as they think about it? [00:20:06]Jerry: So this is a great question. And part of what I wanted to talk about a conceptual level, especially with the idea of like thinking about what is the minimum context? Like, okay, what if the minimum context was like 10 tokens versus like, you know, 2k tokens versus like a million tokens. Right. Like, and what does that really give you? And what are the limitations if it's like 10 tokens? It's kind of like, um, like eight bit, 16 bit games, right? Like back in the day, like if you play Mario and you have like the initial Mario where the graphics were very blocky and now obviously it's like full HD, 3d, just the resolution of the context and the output will change depending on how much context you can actually fit in. So the way I kind of think about this from a more principled manner is like you have like, there's this concept of like information capacity, just this idea of like entropy, like given any fixed amount of like storage space, like how much information can you actually compact in there? And so basically a context window length is just like some fixed amount of storage space, right? And so there's some theoretical limit to the maximum amount of information you can compact until like a 4,000 token storage space. And what does that storage space use for these days with LLMs? For inputs and also outputs. And so this really controls the maximum amount of information you can feed in terms of the prompt plus the granularity of the output. If you had an infinite context window, you're going to have an infinitely detailed response and also infinitely detailed memory. But if you don't, you can only kind of represent stuff in more quantized bits, right? And so the smaller the context window, just generally speaking, the less details and maybe the less, um, and for like specific, precise information, you're going to be able to surface any given point in time. [00:21:34]Alessio: So when you have short context, is the answer just like get a better model or is the answer maybe, Hey, there needs to be a balance between fine tuning and RAG to make sure you're going to like leverage the context, but at the same time, don't keep it too low resolution? [00:21:48]Jerry: Yeah, yeah. Well, there's probably some minimum threat, like I don't think anyone wants to work with like a 10. I mean, that's just a thought exercise anyways, a 10 token context window. I think nowadays the modern context window is like 2k, 4k is enough for just like doing some sort of retrieval on granular context and be able to synthesize information. I think for most intents and purposes, that level of resolution is probably fine for most people for most use cases. I think the question there is just like, um, the limitations actually more on, okay, if you're going to actually combine this thing with some sort of retrieval data structure mechanism, there's just limitations on the retrieval side because maybe you're not actually fetching the most relevant context to actually answer this question, right? Like, yes, like given the right context, 4,000 tokens is enough. But if you're just doing like top-k similarity, like you might not be able to be fetching the right information from the documents. [00:22:34]Alessio: So how should people think about when to stick with RAG versus when to even entertain and also in terms of what's like the threshold of data that you need to actually worry about fine tuning versus like just stick with rag? Obviously you're biased because you're building a RAG company, but no, no, actually, um, I [00:22:52]Jerry: think I have like a few hot takes in here, some of which sound like a little bit contradictory or what we're actually building. And I think to be honest, I don't think anyone knows the right answer. I think this is the truth. [00:23:01]Alessio: Yeah, exactly. [00:23:01]Jerry: This is just like thought exercise towards like understanding the truth. [00:23:04]Alessio: Right. [00:23:04]Jerry: So, okay. [00:23:05]Alessio: I have a few hot takes. [00:23:05]Jerry: One is like RAG is basically just, just a hack, but it turns out it's a very good hack because what is RAG rag is you keep the model fixed and you just figure out a good way to like stuff stuff into the prompt of the language model and everything that we're doing nowadays in terms of like stuffing stuff into the prompt is just algorithmic. We're just figuring out nice algorithms to, to like retrieve right information with top case similarity, do some sort of like, uh, you know, hybrid search, some sort of like a chain of thought decomp and then just like stuff stuff into a prompt. So it's all like algorithmic and it's more like just software engineering to try to make the most out of these like existing APIs. The reason I say it's a hack is just like from a pure like optimization standpoint. If you think about this from like the machine learning lens, unless the software engineering lens, there's pieces in here that are going to be like suboptimal, right? Like, like the thing about machine learning is when you optimize like some system that can be optimized within machine learning, like the set of parameters, you're really like changing like the entire system's weights to try to optimize the subjective function. [00:24:02]Jerry: And if you just cobble a bunch of stuff together, you can't really optimize the pieces are inefficient, right? And so like a retrieval interface, like doing top cam batting lookup, that part is inefficient. [00:24:13]Jerry: If you, for instance, because there might be potentially a better, more learned retrieval algorithm, that's better. If you know, you do stuff like some sort of, I know nowadays there's this concept of how do you do like short-term and long-term memory represent stuff in some sort of vector embedding, do trunk sizes, all that stuff. It's all just like decisions that you make that aren't really optimized and it's not really automatically learned. It's more just things that you set beforehand to actually feed into the system. So I do think like there is a lot of room to actually optimize the performance of an entire LLM system, potentially in a more like machine learning based way. Right. [00:24:48]Jerry: And I will leave room for that. And this is also why I think like in the long term, I do think fine tuning will probably have like greater importance. And just like there will probably be new architectures invented that where you can actually kind of like include a lot of this under the black box, as opposed to having like hobbling together a bunch of components outside the black box. That said, just very practically given the current state of things, like even if I said RAG is a hack, it's a very good hack and it's also very easy to use. Right. [00:25:16]Jerry: And so just like for kind of like the AI engineer persona, which to be fair is kind of one of the reasons generative AI has gotten so big is because it's way more accessible for everybody to get into, as opposed to just like traditional machine learning, it tends to be good enough. [00:25:30]Jerry: Right. And if we can basically provide these existing techniques to help people really optimize how to use existing systems without having to really deeply understand machine learning, I still think that's a huge value add. And so there's very much like a UX and ease of use problem here, which is just like RAG is way easier to onboard and use. And that's probably like the primary reason why everyone should do RAG instead of fine tuning to begin with. If you think about like the 80-20 rule, like RAG very much fits within that and fine tuning doesn't really right now. And then I'm just kind of like leaving room for the future that, you know, like in the end, fine tuning can probably take over some of the aspects of like what RAG does. [00:26:04]Swyx: I don't know if this is mentioned in your explainability also allows for sourcing. And at the end of the day, like to increase trust that we have to source documents. Yeah. [00:26:14]Jerry: So, so I think what RAG does is it increases like transparency, visibility into the actual documents, right. [00:26:19]Jerry: That are getting fed into their context. [00:26:21]Swyx: Here's where they got it from. [00:26:22]Alessio: Exactly. [00:26:22]Jerry: That's definitely an advantage. I think the other piece that I think is an advantage, and I think that's something that someone actually brought up is just you can do access control with, with RAG . If you have an external storage system, you can't really do that with, with large language models. [00:26:35]Jerry: It's just like gate information to the neural net weights, like depending on the type of user for the first point, you could technically, you could technically have the language model. [00:26:45]Jerry: Like if it memorized enough information, just like a site sources, but there's a question of just trust whether or not you're actually, yeah, well, but like it makes it up right now because it's like not good enough, but imagine a world where it is good enough and it does give accurate citations. Swyx: No, I think to establish trust, you just need a direct connection.So it's, it's kind of weird. It's, it's this melding of deep learning systems versus very traditional information retrieval. Yeah, exactly. [00:27:11]Jerry: Well, so, so I think, I mean, I kind of think about it as analogous to like humans, right? [00:27:15]Jerry: Like, uh, we as humans, obviously we use the internet, we use tools. Uh, these tools have API interfaces are well-defined. Um, and obviously we're not like the tools aren't part of us. And so we're not like back propping or optimizing over these tools. And so when you think about like RAG , it's basically, um, LLM is learning how to use like a vector database to look up information that it doesn't know. And so then there's just a question of like how much information is inherent within the network itself and how much does it need to do some sort of like tool used to look up stuff that it doesn't know. [00:27:42]Jerry: And I do think there'll probably be more and more of that interplay as time goes on. [00:27:46]Swyx: Yeah. Some followups on discussions that we've had, you know, we discussed fine tuning a bit and what's your current take on whether you can, you can fine tune new knowledge into LLMs. [00:27:55]Jerry: That's one of those things where I think longterm you definitely can. I think some people say you can't, I disagree. I think you definitely can. Just right now I haven't gotten it to work yet. So, so I think like we've tried, yeah, well, um, not in a very principled way, right? Like this is something that requires like an actual research scientist and not someone that has like, you know, an hour or two per night to actually look at this. [00:28:12]Swyx: Like I, you were a research scientist at Uber. I mean, it's like full-time, full-time working. [00:28:16]Jerry: So, so I think, um, what I specifically concretely did was I took OpenAI's fine tuning endpoints and then tried to, you know, it's in like a chat message interface. And so there's like, um, input question, like a user assistant message format. And so what I did was I tried to take just some piece of text and have the LLM memorize it by just asking it a bunch of questions about the text. So given a bunch of context, I would generate some questions and then generate some response and just fine tune over the question responses. That hasn't really worked super well, but that's also because I'm, I'm just like trying to like use OpenAI's endpoints as is. If you just think about like traditional, like how you train a Transformers model, there's kind of like the, uh, instruction, like fine tuning aspect, right? You like ask it stuff when guided with correct responses, but then there's also just like, um, next token production. And that's something that you can't really do with the OpenAI API, but you can do with, if you just train it yourself and that's probably possible if you just like train it over some corpus of data. I think Shashira from Berkeley said like, you know, when they trained Gorilla, they were like, Oh, you know, this, a lot of these LLMs are actually pretty good at memorizing information. Um, just the way the API interface is exposed is just no one knows how to use them right [00:29:22]Alessio: now. Right. [00:29:22]Jerry: And so, so I think that's probably one of the issues. [00:29:24]Swyx: Just to clue people in who haven't read the paper, Gorilla is the one where they train to use specific APIs. [00:29:30]Jerry: Yeah, I think this was on the Gorilla paper. Like the, the model itself could, uh, try to learn some prior over the data to decide like what tool to pick. But there's also, it's also augmented with retrieval that helps supplement it in case like the, the, the, um, prior doesn't actually work. [00:29:45]Swyx: Is that something that you'd be interested in supporting? [00:29:48]Jerry: I mean, I think in the longterm, like if like, this is kind of how fine tuning, like RAG evolves. Like I do think there'll be some aspect where fine tuning will probably memorize some high level concepts of knowledge, but then like RAG will just be there to supplement like aspects of that, that aren't work that don't, that, that it doesn't know. Jerry: Um, the way I think about this is kind of like, obviously RAG is the default way, like to be clear, RAG right now is the default way to actually augment stuff with knowledge. I think it's just an open question of how much the LM can actually internalize both high level concepts, but also details as you can like train stuff over it. And coming from an ML background, there is a certain beauty and just baking everything into some training process of a language model. Like if you just take raw chat, GPT or chat, GPT code interpreter, right? Like GPT four, it's not like you do RAG with it. You just ask it questions about like, Hey, how do I like to find a pedantic model in Python? And I'm like, can you give me an example? Can you visualize a graph? It just does it right. Like, and we'll run it through code interpreters as a tool, but that's not like a source for knowledge. [00:30:46]Jerry: It's just an execution environment. And so there is some beauty in just like having the model itself, like just, you know, instead of you kind of defining the algorithm for what the data structure should look like the model just learns it under the hood. That said, I think the reason it's not a thing right now is just like, no one knows how to do it. [00:31:01]Jerry: It probably costs too much money. And then also like the API interfaces and just like the actual ability to kind of evaluate and improve on performance, like isn't known to most people. [00:31:12]Alessio: Yeah. [00:31:12]Swyx: It also would be better with browsing. [00:31:14]Alessio: Yeah. [00:31:16]Swyx: I wonder when they're going to put that back. [00:31:18]Alessio: Okay. Yeah. [00:31:19]Swyx: So, and then one more follow up before we go into RAG for AI engineers is on your brief mentioned about security or off. How many of your, the people that you talk to, you know, you talk to a lot of people putting LlamaIndex into production. How many people actually are there versus just like, let's just dump a whole company notion into this thing. [00:31:36]Jerry: Wait, are you talking about from like the security off standpoint? [00:31:39]Alessio: Yeah. [00:31:39]Swyx: How big a need is that? Because I, I talked to some people who are thinking about building tools in that domain, but I don't know if people want it. [00:31:47]Jerry: I mean, I think bigger companies, like just bigger companies, like banks, consulting firms, like they all want this requirement, right? The way they're using LlamaIndex is not with this, obviously. Cause I don't think we have support for like access control or author that have stuff like on a hood. [00:32:02]Jerry: Cause we're more just like an orchestration framework. And so the way they build these initial apps is more kind of like prototype. Like, let's kind of, yeah. Like, you know, use some publicly available data. That's not super sensitive. Let's like, you know, assume that every user is going to be able to have access to the same amount of knowledge, those types of things. I think users have asked for it, but I don't think that's like a P zero. Like I think the P zero is more on like, can we get this thing working before we expand this to like more users within the work? [00:32:25]Alessio: There's a bunch of pieces to rag. Obviously it's not a, just an acronym. And you two recently, you think every AI engineer should build the front scratch at least once. Why is that? I think so. [00:32:37]Jerry: I'm actually kind of curious to hear your thoughts about this. Um, but this kind of relates to the initial like AI engineering posts that you put out and then also just like the role of an AI engineer and the skills that they're going to have to learn to truly succeed because there's an entire On one end, you have people that don't really, uh, like understand the fundamentals and just want to use this to like cobble something together to build something. And I think there is a beauty in that for what it's worth. Like, it's just one of those things. And Gen AI has made it so that you can just use these models in inference only mode, call something together, use it, power your app experiences, but on the other end, what we're increasingly seeing is that like more and more developers building with these apps start running into honestly, like pretty similar issues that like we'll play just a standard engineer building like a classifier model, which is just like accuracy problems, like, and hallucinations, basically just an accuracy problem, right? [00:33:24]Like it's not giving you the right results. So what do you do? You have to iterate on the model itself. You have to figure out what parameters you tweak. You have to gain some intuition about this entire process. That workflow is pretty similar, honestly, like even if you're not training the model to just like tuning a ML model with like hyper parameters and learning like proper ML practices of like, okay, how do I have like define a good evaluation benchmark? How do I define like the right set of metrics to do to use, right? How do I actually iterate and improve the performance of this pipeline for [00:33:52]Alessio: production? What tools do I use? [00:33:53]Jerry: Right? Like every ML engineer use like some form of weights and biases, tensor boards, or like some other experimentation tracking tool. What tools should I use to actually help build like LLM applications and optimize it for production? There's like a certain amount of just like LLM ops, like tooling and concepts and just like practices that people will kind of have to internalize if they want to optimize these. And so I think that the reason I think being able to build like RAG from scratch is important is it really gives you a sense of like how things are working to get, help you build intuition about like what parameters are within a RAG system and which ones actually tweak to make them better. Cause otherwise I think that one of the advantages of the LlamaIndex quick start is it's three lines of code. The downside of that is you have zero visibility into what's actually going on [00:34:37]Alessio: under the hood. [00:34:37]Jerry: And I think there's something that we've kind of been thinking about for a while and I'm like, okay, let's just release like a new tutorial series. That's just like, we're in set, not no three lines of code. We're just going to go in and actually show you how the thing actually works on [00:34:47]Alessio: the hood. Right. [00:34:47]Jerry: And so I like, does everybody need this? Like probably not as for some people, the three lines of code might work, but I think increasingly, like honestly, 90% of the users I talked to have questions about how to improve the performance of their app. And so just like, given this, it's just like one of those things that's like better for the understanding. [00:35:03]Alessio: Yeah. [00:35:03]Swyx: I'd say it is one of the most useful tools of any sort of developer education toolkit to write things yourself from scratch. So Kelsey Hightower famously wrote Kubernetes the hard way, which is don't use Kubernetes. Here's everything that you would have to do by yourself. And you should be able to put all these things together yourself to understand the value of Kubernetes. And the same thing for LLlamaIndex. I've done, I was the guy who did the same for React. And it's a pretty good exercise for you to just fully understand everything that's going on under the hood. And I was actually going to suggest while in one of the previous conversations, there's all these like hyperparameters, like the size of the chunks and all that. And I was thinking like, what would hyperparameter optimization for RAG look [00:35:44]Alessio: like? [00:35:44]Jerry: Yeah, definitely. I mean, so absolutely. I think that's going to be an increasing thing. I think that's something we're kind of looking at because like, I think someone [00:35:52]Swyx: should just put, do like some large scale study and then just ablate everything. And just you, you tell us. [00:35:57]Jerry: I think it's going to be hard to find a universal default that works for [00:36:00]Alessio: everybody. [00:36:00]Jerry: I think it's going to be somewhat, I do think it's going to be somewhat like dependent on the data and use case. I think if there was a universal default, that would be amazing. But I think increasingly we found, you know, people are just defining their own like custom parsers for like PDFs, markdown files for like, you know, SEC filings versus like Slack conversations. And then like the use case too, like, do you want like a summarization, like the granularity of the response? Like it really affects the parameters that you want to pick. I do like the idea of hyperparameter optimization though, but it's kind of like one of those things where you are kind of like training the model basically kind of on your own data domain. [00:36:36]Alessio: Yeah. [00:36:36]Swyx: You mentioned custom parsers. You've designed LlamaIndex, maybe we can talk about like the surface area of the [00:36:41]Alessio: framework. [00:36:41]Swyx: You designed LlamaIndex in a way that it's more modular, like you mentioned. How would you describe the different components and what's customizable in each? [00:36:50]Jerry: Yeah, I think they're all customizable. And I think that there is a certain burden on us to make that more clear through the [00:36:57]Alessio: docs. [00:36:57]Jerry: Well, number four is customization tutorials. [00:36:59]Swyx: Yeah, yeah. [00:37:00]Jerry: But I think like just in general, I think we do try to make it so that you can plug in the out of the box stuff. But if you want to customize more lower level components, like we definitely encourage you to do that and plug it into the rest of our abstractions. So let me just walk through like maybe some of the basic components of LlamaIndex. There's data loaders. You can load data from different data sources. We have Llama Hub, which you guys brought up, which is, you know, a collection of different data loaders of like unstructured and unstructured data, like PDFs, file types, like Slack, Notion, all that stuff. Now you load in this data. We have a bunch of like parsers and transformers. You can split the text. You can add metadata to the text and then basically figure out a way to load it into like a vector store. So, I mean, you worked at like Airbrite, right? It's kind of like there is some aspect like E and T, right? And in terms of like transforming this data and then the L, right, loading it into some storage abstraction, we have like a bunch of integrations with different document storage systems. [00:37:49]Alessio: So that's data. [00:37:50]Jerry: And then the second piece really is about like, how do you retrieve this data? How do you like synthesize this data and how do you like do some sort of higher level reasoning over this data? So retrieval is one of the core abstractions that we have. We do encourage people to like customize, define your own retrievers, that section on kind of like how do you define your own, like custom retriever, but also we have like out of the box ones. The retrieval algorithm kind of depends on how you structure the data, obviously. Like if you just flat index everything with like chunks with like embeddings, then you can really only do like top K like lookup plus maybe like keyword search or something. But if you can index it in some sort of like hierarchy, like defined relationships, you can do more interesting things like actually traverse relationships between nodes. Then after you have this data, how do you like synthesize the data? [00:38:32]Alessio: Right. [00:38:32]Jerry: Um, and, and this is the part where you feed it into the language model. There's some response abstraction that can abstract away over like long contacts to actually still give you a response, even if the context overflows a context window. And then there's kind of these like higher level, like reasoning primitives that I'm going to define broadly. And I'm just going to call them in some general bucket of like agents, even though everybody has different definitions of agents, but you're the first to data agents, [00:38:56]Swyx: which I was very excited. [00:38:57]Alessio: Yeah. [00:38:57]Jerry: We, we kind of like coin, coin that term. And the way we, we thought about it was, you know, we wanted to think about how to use agents for, uh, like data workflows basically. And, and so what are the reasoning primitives that you want to do? So the most simple reasoning primitive you can do is some sort of routing module. It's a classifier, like given a query, just make some automated decision on what choice to pick, right? You could use LLMs. You don't have to use LLMs. You could just try and classifier basically. That's something that we might actually explore. And then the next piece is, okay, what are some higher level things? You can have the LLM like define like a query plan, right. To actually execute over the data. You can do some sort of while loop, right? That's basically what an agent loop is, which is like react a chain of thought, like the open AI function calling, like while loop to try to like take a question and try to break it down into some, some, uh, series of steps to actually try to execute to get back a response. And so there's a range and complexity from like simple reasoning primitives to more advanced ones. The way we kind of think about it is like, which ones should we implement and how do [00:39:50]Alessio: they work? [00:39:50]Jerry: Well, like, do they work well over like the types of like data tasks that we give them? [00:39:54]Alessio: How do you think about optimizing each piece? So take, um, embedding models is one piece of it. You offer fine tuning, embedding models. And I saw it was like fine tuning gives you like 5, 10% increase. What's kind of like the Delta left on the embedding side? Do you think we can get models that are like a lot better? Do you think like that's one piece where people should really not spend too much time? [00:40:16]Jerry: I just think it's, it's not the only parameter. Cause I think in the end, if you think about everything that goes into retrieval, the chunking algorithm, um, how you define like metadata will bias your embedding representations. Then there's the actual embedding model itself, which is something that you can try optimizing. And then there's like the retrieval algorithm. Are you going to just do top K? Are you going to do like hybrid search? Are you going to do auto retrieval? Like there's a bunch of parameters. And so I do think it's something everybody should try. I think by default we use like OpenAI's embedding model. A lot of people these days use like sentence transformers because it's, it's just like free open source and you can actually optimize, directly optimize it. This is an active area of exploration. I do think one of our goals is it should ideally be relatively free for every developer to just run some fine tuning process over their data to squeeze out some more points and performance. And if it's that relatively free and there's no downsides, everybody should basically do [00:41:04]Alessio: it. [00:41:04]Jerry: There's just some complexities, right? In terms of optimizing your embedding model, especially in a production grade data pipeline. If you actually fine tune the embedding model and the embedding space changes, you're going to have to reindex all your documents. And for a lot of people, that's not feasible. And so I think like Joe from Vespa on our webinars, like there's this idea that depending on if you're just using like document and query embeddings, you could keep the document embeddings frozen and just train a linear transform on the query or, or any sort of transform on the query, right? So therefore it's just a query side transformation instead of actually having to reindex all the document embeddings. That's pretty smart. We weren't able to get like huge performance gains there, but it does like improve performance a little bit. And that's something that basically, you know, everybody should be able to kick off. You can actually do that on LLlamaIndex too. [00:41:45]Swyx: OpenAIO has a cookbook on adding bias to the embeddings too, right? [00:41:49]Alessio: Yeah. [00:41:49]Jerry: There's just like different parameters that you can, you can try adding to try to like optimize the retrieval process. And the idea is just like, okay, by default you have all this text. It kind of lives in some latent space, right? [00:42:01]Swyx: Yeah. Shut out, shut out latent space. You should take a drink every time. [00:42:05]Jerry: But it lives in some latent space. But like depending on the type, specific types of questions that the user might want to ask, the latent space might not be optimized to actually retrieve the relevant piece of context that the user want to ask. So can you shift the embedding points a little bit, right? And how do we do that? Basically, that's really a key question here. So optimizing the embedding model, even changing the way you like chunk things, these all shift the embeddings. [00:42:26]Alessio: So the retrieval is interesting. I got a bunch of startup pitches that are like, like ragged school, but like there's a lot of stuff in terms of ranking that could be better. There's a lot of stuff in terms of sun setting data. Once it starts to become stale, that could be better. Are you going to move into that part too? So like you have SEC Insights as one of kind of like your demos. And that's like a great example of, Hey, I don't want to embed all the historical documents because a lot of them are outdated and I don't want them to be in the context. [00:42:55]Jerry: What's that problem space? [00:42:57]Alessio: Like how much of it are you going to also help with and versus how much you expect others to take care of? [00:43:03]Jerry: Yeah, I'm happy to talk about SEC Insights in just a bit. I think more broadly about the like overall retrieval space. We're very interested in it because a lot of these are very practical problems that [00:43:11]Alessio: people have asked us. [00:43:11]Jerry: And so the idea of outdated data, I think, how do you like deprecate or time wait data and do that in a reliable manner, I guess. So you don't just like set some parameter and all of a sudden that affects your, all your retrieval items, like is pretty important because people have started bringing [00:43:25]Alessio: that up. [00:43:25]Jerry: Like I have a bunch of duplicate documents, things get out of date. How do I like sunset documents? And then remind me, what was the, what was the first thing you said? Cause I think there was, there was something like the ranking ranking, right? [00:43:35]Alessio: Yeah. [00:43:35]Jerry: So I think this space is not new. I think everybody who is new to this space starts learning some basic concepts of information retrieval, which to be fair has been around for quite a bit. But our goal is to kind of like take some of like just general ranking and information retrieval concepts. So by encoding, like crossing coding, right? Like we're based models versus like kind of keyword based search. How do you actually evaluate retrieval? These things start becoming relevant. And so I think for us, like rather than inventing like new retriever techniques for the sake of like just inventing better ranking, we want to take existing ranking techniques and kind of like package it in a way that's like intuitive and easy for people to understand. That said, I think there are interesting and new retrieval techniques that are kind of in place that can be done when you tie it into some downstream rack system. The reason for this is just like, if you think about the idea of like chunking text, right? Like that just really wasn't a thing, or at least for this specific purpose, like the reason chunking is a thing in RAG right now is because like you want to fit within the context bundle of an LLM, right? Like why do you want to chunk a document? That just was less of a thing. I think back then, if you wanted to like transform a document, it was more for like structured data extraction or something in the past. And so there's kind of like certain new concepts that you got to play with that you can use to invent kind of more interesting retrieval techniques. Another example here is actually LLM based reasoning, like LLM based chain of thought reasoning. You can take a question, break it down into smaller components and use that to actually send to your retrieval system. And that gives you better results. And it's kind of like sending the full question to a retrieval system. That also wasn't really a thing back then, but then you can kind of figure out an interesting way to like blending old and the new, right? With LLMs and data. [00:45:13]Swyx: There's a lot of ideas that you come across. Do you have a store of them? [00:45:17]Jerry: Yeah, I think I, sometimes I get like inspiration. There's like some problem statement and I'm just like, oh, it's like, following you is [00:45:23]Swyx: very hard because it's just a lot of homework. [00:45:25]Jerry: So I think I've, I've started to like step on the brakes just a little bit. Cause then I start, no, no, no. Well, the, the reason is just like, okay, if I just have invent like a hundred more retrieval techniques, like, like sure. But like, how do people know which one is good and which one's like bad. [00:45:41]Alessio: Right. [00:45:41]Jerry: And so have a librarian, right? [00:45:42]Swyx: Like it's going to catalog it and you're going to need some like benchmarks. [00:45:45]Jerry: And so I think that's probably the focus for the next, next few weeks is actually like properly kind of like having an understanding of like, oh, you know, when should you do this or like, what does this actually work well? [00:45:54]Alessio: Yeah. [00:45:54]Swyx: Some kind of like a, maybe like a flow chart, decision tree type of thing. Yeah, exactly. When this do that, you know, something like that, that would be really helpful for me. [00:46:02]Alessio: Thank you. [00:46:02]Swyx: It seems like your most successful side project. Yeah. What is SEC Insights for our listeners? [00:46:07]Jerry: Um, our SEC Insights is a full stack LLM chatbot application, um, that does. Analysis of your sec 10 K and 10 Q filings. And so the goal for building this project is really twofold. The reason we started building this was one, it was a great way to dog food, the production readiness for our library. We actually ended up like adding a bunch of stuff and fixing a ton of bugs because of this. And I think it was great because like, you know, thinking about how we handle like callbacks streaming, actually generating like reliable sub responses and bubbling up sources, citations. These are all things that like, you know, if you're just building the library in isolation, you don't really think about it. But if you're trying to tie this into a downstream application, like it really starts mattering for your error messages. When you talk about bubbling up stuff for like sources, like if you go into SEC Insights and you type something, you can actually see the highlights in the right side. That was something that like took a little bit of like, um, understanding to figure out how to build wall. And so it was great for dog fooding improvement of the library itself. And then as we're building the app, um, the second thing was we're starting to talk to users and just like trying to showcase like kind of, uh, bigger companies, like the potential of LLM index as a framework, because these days obviously building a chatbot, right. With Streamlight or something, it'll take you like 30 minutes or an hour. Like there's plenty of templates out there on LLM index, like train, like you can just build a chatbot, but how do you build something that kind of like satisfies some of these, uh, this like criteria of surfacing, like citations, being transparent, seeing like, uh, having a good UX, um, and then also being able to handle different types of questions, right? Like more complex questions that compare different documents. That's something that I think people are still trying to explore. And so what we did was like, we showed, well, first like organizations, the possibilities of like what you can do when you actually build something like this. And then after like, you know, we kind of like stealth launched this for fun, just as a separate project, uh, just to see if we could get feedback from users who are using this world to see like, you know, how we can improve stuff. And then we were thought, we thought like, ah, you know, we built this, right? Obviously we're not going to sell like a financial app. Like that's not really our, in our wheelhouse, but we're just going to open source the entire thing. And so that now is basically just like a really nice, like full stack app template you can use and customize on your own, right. To build your own chatbot, whether it is a really financial documents or like other types of documents. Um, and it provides like a nice template for basically anybody to kind of like go in and get started. There's certain components though, that like aren't released yet that we're going to going to, and then next few weeks, like one is just like kind of more detailed guides on like different modular components within it. So if you're like a full stack developer, you can go in and actually take the pieces that you want and actually kind of build your own custom flows. The second piece is like, take, there's like certain components in there that might not be directly related to the LLM app that would be nice to just like have people use, uh, an example is the PDF viewer, like the PDF viewer with like citations. I think we're just going to give that right. So, you know, you could be using any library you want, but then you can just, you know, just drop in a PDF viewer. [00:48:53]Alessio: Right. [00:48:53]Jerry: So that it's just like a fun little module that you can do. [00:48:55]Swyx: Nice. That's really good community service right there. I want to talk a little bit about your cloud offering, because you mentioned, I forget the name that you had for it. [00:49:04]Alessio: Enterprise something. [00:49:04]Jerry: Well, one, we haven't come up with a name. Uh, we're kind of calling it LLM index platform, platform LLM index enterprise. I'm open to suggestions here. Um, and the second thing is I don't actually know how much I can, I can share right now because it's mostly kind of like, uh, we, we, yeah, exactly. [00:49:20]Swyx: To the extent that you can talk about LLM index as a business. Um, always just want to give people in the mind, like, Hey, like you sell things too, you know what I mean? [00:49:28]Jerry: Yeah, a hundred percent. So I think the high level of what I can probably say is just like, I think we're looking at ways of like actively kind of complimenting the developer experience, like building LLM index. We've always been very focused on stuff around like plugging in your data into the language model. And so can we build tools that help like augment that experience beyond the open [00:49:47]Alessio: source library? Right. [00:49:48]Jerry: And so I think what we're going to do is like make a build an experience where it's very seamless to transition from the open source library with like a one line toggle, you can basically get this like complimentary service and then figure out a way to like monetize in a bit. I think where our revenue focus this year is less emphasized. Like it's more just about like, can we build some manage offering that like provides complimentary value to what the open source library provides? [00:50:09]Alessio: Yeah. [00:50:10]Swyx: I think it's the classic thing about all open source is you want to start building the most popular open source projects in your category to own that category. You're going to make it very easy to host. Therefore you're just built your biggest competitor, which is you. [00:50:22]Jerry: I think it will be like complimentary. Cause I think it will be like, you know, use the open source library and then you have a toggle and all of a sudden, you know, you can see this basically like a pipeline ish thing pop up and then it will be able to kind of like, you'll have a UI. There'll be some enterprise guarantees and the end goal would be to help you build like a production RAG app more easily. [00:50:42]Alessio: Data loaders. There's a lot of them. What are maybe some of the most popular, maybe under, not underrated, but like underexpected, you know, and how has the open source side of it helped with like getting a lot more connectors, you only have six people on the team today, so you couldn't have done it all yourself. [00:51:00]Jerry: Yeah. I think the nice thing about like Walmart hub itself, it's supposed to be a community driven hub. Um, and so actually the bulk of the peers are completely community contributed. Um, and so we haven't written that many like first party connectors actually for this, it's more just like a kind of encouraging people to contribute to the community in terms of the most popular tools, uh, or the data loaders. I think we have Google analytics on this and I forgot the specifics. It's some mix of like the PDF loaders. We have like 10 of them, but there's some subset of them that are popular. And then there's Google, like I think Gmail and like G drive. Um, and then I think maybe it's like one of Slack or notion. One thing I will say though, uh, and I think like Swix might probably knows this better than I do, given that you were, she used to work at air bite. It's very hard to build, like, especially for full on service, like notion Slack or like Salesforce to build like a really, really high quality loader that really extracts all the information that people want. [00:51:51]Alessio: Right. [00:51:51]Jerry: And so I think the thing is when people start out, like they will probably use these loaders and it's a great tool to get started. And for a lot of people, it's like good enough. And they submit PRs if they want more additional features. But if you get to a point where you actually want to call like an API that hasn't been supported yet, or, you know, you want to load in stuff that like in metadata or something that hasn't been directly baked into the logic of a loader itself, people start adding up, like writing their own custom loaders. And that is a thing that we're seeing. That's something that we're okay with. [00:52:18]Alessio: Right. [00:52:18]Jerry: Cause like a lot of this is more just like community driven. And if you want to submit a PR to improve the existing one, you can, otherwise you can create your own custom ones. [00:52:24]Alessio: Yeah. [00:52:25]Swyx: And all that is custom loaders all supported within LLlamaIndex, or do you pair it with something else? [00:52:29]Jerry: Oh, it's just like, I mean, you just define your own subclass. I think, I think that's it. [00:52:33]Alessio: Yeah. Yeah. [00:52:33]Swyx: Cause typically in the data ecosystem with everybody, everybody has his own strategies with custom loaders, but also you could write your own with like Dagster or like Prefect or one of those tools. [00:52:43]Alessio: Yeah. [00:52:44]Jerry: Yeah, exactly. So I think for us, it's more, we just have a very flexible like document abstraction that you can fill in with any content that you want. [00:52:50]Swyx: Are people really dumping all their Gmail into these things? You said Gmail is number two. Uh, I'm not sure actually. I mean, that's these, you know, that's the most private data source. [00:52:59]Alessio: That's true. [00:53:00]Swyx: So I'm surprised that people are dumping too. I mean, I'm sure some, some people are, but like, I'm sure I'm surprised it's [00:53:06]Alessio: popular. [00:53:06]Swyx: Well, and then, so, uh, the LLM engine, uh, I assume OpenAI is going to be a majority. Is it an overwhelming majority? Uh, how, what's the market share between like OpenAI, Cohere, Anthropic, you know, whatever you're seeing. [00:53:21]Alessio: OpenSource too. [00:53:21]Jerry: Yeah, I think it's probably some, uh, OpenAI has a majority, but then like there's Anthropic and there's also, um, OpenSource. I think there is a lot of people trying out like Llama 2, um, and, and, um, some variant of like a top OpenSource model. [00:53:33]Swyx: Side note, any confusion there, Llama 2 versus Llama? [00:53:36]Jerry: Yeah, I think whenever I go to these talks, I always open it up with like, we started before it. Yeah, exactly. We start before meta, right? [00:53:43]Alessio: I want to point that out. [00:53:43]Jerry: Uh, but no, for us, we try to use it for like branding. We just add two llamas when we have like a Llama 2 integration instead of one llama. So I think a lot of people are trying out the popular OpenSource models. Uh, there's a lot of toolkits and OpenSource projects that allow you to self-host and deploy Llama 2 and like, oh, Llama is just a very recent example. I think that we, we added integration with, and so we just, uh, by virtue of having more of these services, I think more and more people are trying it out. [00:54:07]Swyx: Do you think there's, there's potential there? Is like, um, is that going to be an increasing trend? Like OpenSource? [00:54:12]Alessio: Yeah. [00:54:12]Jerry: Yeah, definitely. I think in general people hate monopolies. And so, um, like there's a, whenever like OpenAI has something really cool or like any, um, company has something really cool, even meta, like there's just going to be a huge competitive pressure from other people to do something that's more open and better. Um, and so I do think just market pressures will, will improve like OpenSource adoption. [00:54:32]Swyx: Last thing I'll say about this, which is just really like, it gets clicks. It's people like psychologically want that, but then at the end of the day, they want, they fall for brand name and popular and performance benchmarks. You know, at the end of the day, OpenAI still wins on that. I think that's true. [00:54:47]Jerry: But I, I just think like, unless you were like an active employee at OpenAI, right? Like all these research labs are putting out like ML, like PhDs or kind of like other companies too, that are investing a lot of dollars. Uh, there's going to be a lot of like competitive pressures developed, like better models. So is it going to be like all fully open source with like a permissive license? Like, I'm not completely sure, but like, there's just a lot of just incentive for people to develop their stuff here. [00:55:09]Swyx: Have you looked at like RAG specific models, like contextual? [00:55:12]Alessio: No. [00:55:13]Jerry: Is it public? [00:55:14]Swyx: No, they literally just, uh, so Dewey Keeler. I think it's his name. And you probably came across him. He wrote the RAG paper at Meta and just started contextual AI to create a RAG specific model. I don't know what that means. I was hoping that you do, cause it's your business. [00:55:29]Jerry: I had insider information. I mean, you know, to be honest, I think this, this kind of relates to my previous point on like RAG and fine tuning, like a RAG specific model is a model architecture that's designed for better RAG and it's less the software engineering principle of like, how can I take existing stuff and just plug and play different components into it? Um, and there's a beauty in that from ease of use and modularity, but when you want to end to end optimize the thing, you might want a more specific model. I think, I think building your own models is honestly pretty hard. Um, and I think the issue is if you also build your own models, like you're also just gonna have to keep up with like the rate of LM advances, like how, like basically the question is when GPT five and six and whatever, like anthropic cloud three comes out, how can you prove that you're actually better than, uh, software developers cobbling together and components on top of a base model. Right. Even if it's just like conceptually, this is better than maybe like GPT three or GPT four. [00:56:21]Alessio: What about vector stores? I know Spooks is wearing a chroma sweatshirt. [00:56:25]Swyx: Yeah, because they use a swagging. [00:56:27]Jerry: I have, I have the mug from Chroma. [00:56:29]Alessio: Yeah. It's been great. Yeah. [00:56:30]Jerry: What do you think there? [00:56:31]Alessio: Like there's a lot of them. Are they pretty interchangeable for like your users use case? Uh, is HNSW all we need? Is there room for improvements? [00:56:40]Swyx: Is NTRA all we need? [00:56:42]Jerry: I think, um, yeah, we try to remain unopinionated about storage providers. So it's not like we don't try to like play favorites. So we have like a bunch of integrations obviously. And we, the way we try to do it is we just tried to find like some standard interfaces, but obviously like different vector stores will support kind of like, uh, slightly additional things like metadata filters and those things. I mean, the goal is to have our users basically leave it up to them to try to figure out like what makes sense for their use case in terms of like the algorithm itself, I don't think the Delta on like improving the vector store, like. Embedding lookup algorithm. [00:57:10]Alessio: Is that high? [00:57:10]Jerry: I think the stuff has been mostly solved or at least there's just a lot of other stuff you can do to try to improve the overall performance. No, I mean like everything else that we just talked about, like in terms of like [00:57:20]Alessio: accuracy, right. [00:57:20]Jerry: To improve rag, like everything that we talked about, like chunking, like metadata, like. [00:57:24]Swyx: I mean, I was just thinking like, maybe for me, the interesting question is, you know, there are like eight, it's a kind of game of thrones. There's like eight, the war of eight databases right now. Oh, I see. Um, how do they stand out and how did they become very good partners? [00:57:36]Alessio: If not my index. [00:57:36]Jerry: Yeah, we're pretty good partners with, with most of them. [00:57:39]Alessio: Uh, let's see. [00:57:39]Swyx: Well, like if you're a, you know, vector database founder, like what do you, what do you work on? [00:57:44]Alessio: It's a good question. [00:57:44]Jerry: I think one thing I'm very interested in is, and this is something I think I've started to see a general trend towards is combining structured data querying with unstructured data querying. Um, and I think that will probably just expand the query sophistication of these vector stores and basically make it so that users don't have to think about whether they would just call this like hybrid querying. [00:58:05]Swyx: Is that what we've it's doing? [00:58:06]Alessio: Yeah. [00:58:07]Jerry: I mean, I think like, if you think about metadata filters, that's basically a structured filter. It's like our select where something equals something, and then you combine that with semantic search. I think like Lance DB or something was like, uh, try, I was trying to do some like joint interface. The reason is like most data is semi-structured. There's some structured annotations and there's some like unstructured texts. And so like, um, somehow combining all the expressivity of like SQL with like the flexibility of semantic search is something that I think is going to be really important. We have some basic hacks right now that allow you to jointly query both a SQL database and like a separate SQL database and a vector store to like combine the information. That's obviously going to be less efficient than if you just combined it into one [00:58:46]Alessio: system. Yeah. [00:58:46]Jerry: And so I think like PG vector, like, you know, that type of stuff, I think it's starting to get there, but like in general, like how do you have an expressive query language to actually do like structured querying along with like all the capabilities, semantic search. [00:58:57]Swyx: So your current favorite is just put it into Postgres. No, no, no. We don't play with Postgres language, the query language. [00:59:05]Jerry: I actually don't know what the best language would be for this, because I think it will be something that like the model hasn't been fine-tuned over. Um, and so you might want to train the model over this, but some way of like expressing structured data filters, and this could be include time too, right? It could, it doesn't have to just be like a where clause with this idea of like a [00:59:26]Alessio: semantic search. Yeah. [00:59:27]Swyx: And we talked about, uh, graph representations. [00:59:30]Alessio: Yeah. Oh yeah. [00:59:30]Jerry: That's another thing too. And there's like, yeah. So that's actually something I didn't even bring up yet. Like there's this interesting idea of like, can you actually have the language model, like explore like relationships within the data too, right? And somehow combine that information with stuff that's like more and more, um, structured within the DB. [00:59:46]Alessio: Awesome. [00:59:46]Swyx: What are your current strong beliefs about how to evaluate RAG ? [00:59:49]Jerry: I think I have thoughts. I think we're trying to curate this into some like more opinionated principles because there's some like open questions here. I think one question I had to think about is whether you should do like evals like component by component first, or is yours do the end to end thing? I think you should, you might actually just want to do the end to end thing first, just to do a sanity check of whether or not like this, uh, given a query and the final response, whether or not it even makes sense, like you eyeball [01:00:11]Alessio: it, right. [01:00:11]Jerry: And then you like try to do some basic evals. And then once you like diagnose what the issue is, then you go into the kind of like specific area to define some more, uh, solid benchmarks and try to like [01:00:21]Alessio: improve stuff. [01:00:21]Jerry: So what is Antoine evals? Like it's, you, um, have a query, it goes in through retrieval system. You get back something, you synthesize response, and that's your final thing. And you evaluate the quality of the final response. And these days, there's plenty of projects like startups, like companies research, doing stuff around like GPT-4, right. As like a human judge to basically kind of like synthetically generate data. [01:00:41]Swyx: I don't know from the startup side. [01:00:43]Jerry: I just know from a technical side, I think, I think people are going to do more of it. The main issue right now is just, uh, it's really unreliable. Like it's, it's just, uh, like there's like variants on the response, whatever you want. [01:00:54]Alessio: They won't do more of it. [01:00:54]Swyx: I mean, cause it's bad. [01:00:55]Jerry: No, but, but these models will get better and you'll probably fine tune a model to [01:00:59]Alessio: be a better judge. [01:00:59]Jerry: I think that's probably what's going to happen. So I'm like reasonably bullish on this because I don't think there's really a good alternative beyond you just human annotating a bunch of data sets, um, and then trying to like just manually go through and curating, like evaluating eval metrics. And so this is just going to be a more scalable solution in terms of the [01:01:17]Alessio: startups. Yeah. [01:01:17]Jerry: I mean, I think there's a bunch of companies doing this in the end. It probably comes down to some aspect of like UX speed, whether you can like fine tune a model. So that's end to end evals. And then I think like what we found is for rag, a lot of times, like, uh, what ends up affecting this, like end response is retrieval. You're just not able to retrieve the right response. And so I think having proper retrieval benchmarks, especially if you want to do production RAG is, is actually quite important. I think what does having good retrieval metrics tell you? It tells you that at least like the retrieval is good. It doesn't necessarily guarantee the end generation is good, but at least it gives you some, uh, sanity track, right? So you can like fix one component while optimizing the rest, what retrieval like evaluation is pretty standard. And it's been around for a while. It's just like an IR problem. Basically you have some like input query, you get back some retrieves out of context, and then there's some ground truth and that ranked set. And then you try to measure it based on ranking metrics. So the closer that ground truth is to the top, the more you reward the evals. And then the closer it is to the bottom where if it's not in the retrieve side at all, then you penalize the evals. Um, and so that's just like a classic ranking problem. I think like most people starting out probably don't know how to do this right [01:02:28]Alessio: now. [01:02:28]Jerry: We, we just launched them like basic retrieval evaluation modules to help users [01:02:32]Alessio: do this. [01:02:32]Jerry: One is just like curating this data set in the first place. And one thing that we're very interested in is this idea of like synthetic data set generation for evals. So how can you give in some context, generate a set of questions with Drupal 2.4, and then all of a sudden you have like question and then context pairs, and that becomes your ground truth. [01:02:47]Swyx: Are data agent evals the same thing, or is there a separate set of stuff for agents that you think is relevant here? [01:02:53]Jerry: Yeah, I think data agents add like another layer of complexity. Cause then it's just like, you have just more loops in the system. Like you can evaluate like each chain of thought loop itself, like every LLM call to see whether or not the input to that specific step in the chain of thought process actually works or is correct. Or you can evaluate like the final response to see if that's correct. This gets even more complicated when you do like multi-agent stuff, because now you have like some communication between like different agents. Like you have a top level orchestration agent passing it on to some low level [01:03:24]Alessio: stuff. [01:03:24]Jerry: I'm probably less familiar with kind of like agent eval frameworks. I know they're, they're starting to be, become a thing. Talking to like June from the Drown of Agents paper, which is pretty unrelated to what we're doing now. But it's very interesting where it's like, so you can kind of evaluate like overall agent simulations by just like kind of understanding whether or not they like modeled the distribution of human behavior. But that's not like a very macro principle. [01:03:46]Alessio: Right. [01:03:46]Jerry: And that's very much to evaluate stuff, to kind of like model the distribution of [01:03:51]Alessio: things. [01:03:51]Jerry: And I think that works well when you're trying to like generate something for like creative purposes, but for stuff where you really want the agent to like achieve a certain task, it really is like whether or not it achieved the task or not. [01:04:01]Alessio: Right. [01:04:01]Jerry: Cause then it's not like, Oh, does it generally mimic human behavior? It's like, no, like did you like send this email or not? [01:04:07]Alessio: Right. [01:04:07]Jerry: Like, cause otherwise like this, this thing didn't work. [01:04:09]Alessio: Awesome. Let's jump into a lightning round. So we have two questions, acceleration, exploration, and then one final tag away. The acceleration question is what's something that already happened in AI that you thought would take much longer to get here? [01:04:23]Jerry: I think just the ability of LLMs to generate believable outputs and for text and also for images. And I think just the whole reason I started hacking around with LLMs, honestly, I felt like I got into it pretty late. I should've gotten into it like early 2022 because UB23 had been out for a while. Like just the fact that there was this engine that was capable of like reasoning and no one was really like tapping into it. And then the fact that, you know, I used to work in image generation for a while. Like I did GANs and stuff back in the day. And that was like pretty hard to train. You would generate these like 32 by 32 images. And then now taking a look at some of the stuff by like Dolly and, and, you know, mid journey and those things. So it's, it's just, it's, it's very good. [01:04:59]Alessio: Yeah. [01:04:59]Swyx: Exploration. What do you think is the most interesting unsolved question in AI? [01:05:03]Jerry: Yeah, I'd probably work on some aspect of, um, like personalization of memory. Like, I think I actually think that I don't think anyone's like, I think a lot of people have thoughts about that, but like, for what it's worth, I don't think the final state will be right. I think it will be some, some like fancy algorithm or architecture where you like bake it into like the, the architecture of the model itself. Like if, if you have like a personalized assistant that you can talk to that will like learn behaviors over time, right. And learn stuff through like conversation history, what exactly is the right architecture there? I do think that will be part of like the wrong continuous fine tuning. [01:05:38]Swyx: Yeah. [01:05:39]Jerry: Like some aspect of that, right. [01:05:40]Alessio: Right. [01:05:40]Jerry: Like these are like, I don't actually know the specific technique, but I don't think it's just going to be something where you have like a fixed vector store and that, that thing will be like the thing that restores all your memories. [01:05:48]Swyx: It's interesting because I feel like using model weights for memory, it's just such an unreliable storage device. [01:05:56]Jerry: I know. But like, I just think, uh, from like the AGI, like, you know, just modeling like the human brain perspective, I think that there is something nice about just like being able to optimize that system. [01:06:08]Alessio: Right. [01:06:08]Jerry: And to optimize a system, you need parameters and then that's where you just get into the neural net piece. [01:06:12]Alessio: Cool. Cool. Uh, and yeah, take away, you got the audience ear. What's something you want everyone to think about or yeah, take away from this conversation and your thinking. [01:06:24]Jerry: I think there were a few key things. Uh, so we talked about two of them already, which was SEC Insights, which if you guys haven't tracked it out, I've definitely encouraged you to do so because it's not just like a random like sec app, it's like a full stack thing that we open source, right. And so if you guys want to track it out, I would definitely do that. It provides a template for you to build kind of like production grade rack apps. Um, and we're going to open source like, and modularize more components of that soon and do a workshop on, um, yeah. And the second piece is I think we are thinking a lot about like retrieval and evals. Um, I think right now we're kind of exploring integrations with like a few different partners. And so hopefully some of that will be, uh, really soon. And so just like, how do you basically have an experience where you just like write law index code, all of a sudden you can easily run like retrievals, evals, and like traces, all that stuff. And, and like a service. And so I think we're working with like a few providers on that. And then the other piece, which we did talk about already is this idea of like, yeah, building like RAG from scratch. I mean, I think everybody should do it. I think I would check out the guide. If you guys haven't already, I think it's in our docs, but instead of just using, you know, either the kind of like the retriever query engine and lamin decks or like the conversational QA train and Lang train, it's, I would take a look at how do you actually chunk parse data and do like top cam batting retrieval, because I really think that by doing that process, it helps you understand the decisions, the prompts, the language models to use. [01:07:42]Alessio: That's it. Yeah. [01:07:44]Swyx: Thank you so much, Jerry. [01:07:45]Alessio: Yeah. [01:07:45]Jerry: Thank you. [01:07:46] This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Hey, everyone. Welcome to the Latenspace podcast. This is Alessio, partner, and CTO and residents of Decibo Partners. And I'm joined by my co-host, Swix, founder of Small AI. And today we finally have Jerry Lee on the podcast. Hey, Jerry. Hey, hey, hey, Swix and Al-Sio. Thanks for having me. It's so weird because we keep running each other in San Francisco AI events. So it's kind of weird to finally just have a conversation recorded for everybody else.
Yeah, I know. I'm really looking forward to this. I have further questions. So I tend to introduce people,
on their formal background and then ask something on the more personal side. So you are part of the
Princeton gang. I don't know if there is like official Princeton gang. I attended your meeting.
There was like four of you. Oh, cool. Okay. Nice. With Prem and the others. Oh yeah, yeah. Yeah. Well, you did
Bachelors and CS and certificate of finance. That's also fun. I also did finance. And I think I saw that you also
interned at Two Sigma where I worked in New York. You were a machine learning engineer. Yeah, very briefly.
Oh, cool. All right. I didn't know that. Okay. That was my first like,
proper engineering job before I went into Deverell.
Oh, okay. Wow.
And then your machine learning engineer at Quora,
AI research scientists at Uber for three years,
and then two years machine learning engineer at robust intelligence
before starting Lama Index.
So that's your LinkedIn.
What's not on your LinkedIn that people should know about you?
I think back during my Quora days,
I had this like three-month phase where I just wrote like a ton of Quora answers.
And so I think if you look at my tweets nowadays,
you could basically see that as like the V2 of my three-months,
month for a stint where I just went ham on Quora for a bit. I actually, I think I was back then,
actually, when I was working on Quora. I think the thing that everybody was fascinated in was
just like general, like, deep learning advancements and stuff like GANS and generative
images and just like new architectures that were evolving. And it was a pretty exciting time to be
a researcher, actually, because you were going in, like, really understanding some of the new
techniques. So I kind of use that as like a learning opportunity, basically, just like read a bunch
papers and then answer questions on Quora. And so you can kind of see traces of that.
that basically in my current Twitter where it's just like really about kind of like framing concepts and
trying to make it understandable and educate other users on it. Yeah, I've said so a lot of people
come to me from my Twitter advice, but like I think you are doing one of the best jobs in the
I Twitter, just explaining concepts and just consistently getting hits out. Thank you.
I didn't know it was due to the Quora training. This is just sign on Quora. A lot of people,
including myself, like kind of wrote off Quora as like one of the Web 1.0 like sort of question
answer forms. But now I think it's becoming a senior resurgence obviously due to
Po. And obviously Adam and DeAngelo has always been a leading tech figure. But what do you think
it's like underrated about Quora? Well, I mean, I really like the mission of Quora when I joined.
In fact, I interned there like in 2015 and I joined full time in 2017. One is like they had and
they have like a very talented engineering team and just like really, really smart people.
And the other part is the whole mission of the company is to just like spread knowledge and to
educate people. And to me that really resonated. I really like the idea.
of just like education and democratizing the flow of information.
And if you imagine like kind of back then, it was like, okay, you have Google, which is like for
search, but then you have Quora, which is just like user generated like grassroots type content.
I really like that concept because it's just like, like, okay, there's certain types of information
that aren't accessible to people, but you can make accessible by just like surfacing it.
And then so actually, I don't know if like most people know that about like Quora, like, and
if they've used the product, whether through like SEO, right, or kind of like actively.
But that really was what it drew me to it.
Yeah. I think most people challenge us with it is that sometimes you don't know if it's like a veiled product pitch, right?
Yeah. Of course, like quality of the answer matters quite a bit and then you start running into these.
And then here's the one I work on. Yeah, like recommendation issues and all that stuff. I worked on Rexis at court actually. So I got a taste of stuff like that. Well, I mean, I kind of more approached it from machine learning techniques, which might be a nice segue into rag actually. A lot of it was just information retrieval. We weren't like solving anything that was like super different than what was standard in the industry at the time. But just like ranking, basically.
on user preferences. I think a lot of Quora was very metrics-driven, so just like trying to maximize, like, you know, daily active hours, like, you know, time spent on site, those types of things. And all the machine learning algorithms were really just based on embeddings. You have a user embedding and you have, like, item embeddings, and you try to train the models to try to maximize the similarity of these. And it's basically a retrieval problem. Okay. So you've been working on RIG for longer than most people think? Well, kind of. So I worked there for like a year, right? Just transparently. And then I worked at
where I was not working on ranking. It was more like kind of deep learning training for
self-driving and computer vision and that type of stuff. But I think in the LLM world, it's kind
of just like a combination of like everything these days. I mean, retrieval is not really LLMs,
but like it fits within the space of like LLM apps. And then obviously like having knowledge of
the underlying deep learning architectures helps. Having knowledge of basic software engineering
principles helps too. And so I think nice, it's kind of nice that like this whole LLM
space is basically just like a combination of just like a bunch of stuff that.
you probably have, like, people have done it in the past.
It's good. It's like a summary capstone project.
Yeah, exactly. Yeah.
And before we dive into Lama Index,
what do they feed you a robust intelligence
that both Hugh and Harrison from Langeen came out of it at the same time?
Was there like, yeah, is there any fun story of like how both of you kind of came
with kind of like core infrastructure to LAM Workpost today?
Or how close were you at robust?
Like, any fun behind the scenes?
Yeah, yeah.
We work pretty closely.
I mean, we were on the same team for like two years.
I got to know Harrison and Brasset a team pretty well.
I mean, I have a respect the people there.
People there were very driven, very passionate.
And it definitely pushed me to be a better engineer and leader and those types of things.
Yeah, I don't really have a concrete explanation for this.
I think it's more just we have like an L-LM hackathon around like September.
This was just like exploring a GPT3 or it was October actually.
And then the day after I went on vacation for a week and a half.
And so I just didn't track slack or anything.
Came back, saw that Harrison started lane chain.
I was like, oh, that's cool.
I was like, oh, I'll play around with LMs a bit and then hacked around on stuff.
And I think I've told the story a few times, but, you know, I was like trying to feed
information into GBT3.
And then then you deal with like context window limitations.
And there was no tooling or really practices to try to understand how do you, you know,
get GBT3 to navigate large amounts of data.
And that's kind of how the project started.
It really was just one of those things where early days, like, we were just trying to build
something that was interesting.
Like, I wanted to start a company.
I had other ideas actually of what I wanted to start.
And I was very interested in, for instance, like multimodal data, like video data and that type of stuff.
And then this just kind of grew and eventually took over the other idea.
Text is the universal interface.
I think so.
I think so.
I actually think once the multimodal models come out, I think there is just like mathematically nicer properties if you can just get like join multiple embeddings like clip style.
But tax is really nice because from a software engineering principle, it just makes things way more modular.
You just convert everything into text and then you just represent everything.
thing as text.
Yeah.
I'm just explaining retroactively why working on Lama Index took off versus if you had
chose to spend your time on multimodal, we probably wouldn't be talking about whatever
you ended up working on.
Yeah.
That's true.
It's struggled.
Interesting.
So November 9th, that was a very productive month, I guess, October, November.
November 9th, you announced GPT tree index and you picked the tree logo.
Very cool.
Every project must have an emoji.
Yeah.
Yeah.
That probably was somewhat inspired by a light train, but I will admit.
Yeah.
It uses GPT to build a knowledge tree in a bottom-up fashion by applying a summarization
prompts for each node.
Yep.
Which I like that original vision.
Your messaging around about then was also that you're creating optimized data structures.
What's the sort of journey to that and how does that contrast with Blamandex today?
Okay.
Maybe I can tell a little bit about like the beginning intuitions.
I think when I first started, this really wasn't supposed to be something that was like
a toolkit that people use.
it was more just like a system.
And the way I wanted to think about the system was more a thought exercise of how language models,
but their reasoning capabilities,
if you just treat them as like brains can organize information and then traverse it.
So I didn't want to think about embeddings, right?
To me, embeddings just felt like it was just an external thing that was like,
well,
it was just external to try to actually tap into the capabilities of language models themselves.
Right?
I really wanted to see, you know,
just as like a human brain could like synthesize stuff.
Could we create some sort of like structure where the,
there's this like neural CPU, if you will,
can organize a bunch of information, you know, auto-summerized a bunch of stuff, and then also
traverse the structure that I created. That was the inspiration for this initial, like,
tree index. To be honest, and I think I said this in the first tweet, it actually works super
well, right? Like, GPD-4 obviously is much better at reasoning. Like, I'm one of the first
to say, like, you know, you shouldn't use anything pre-GPD4 for anything that requires, like,
complex reasoning, because it's just going to be unreliable. Okay, disregarding stuff like fine-tuning.
But it worked okay, but I think it definitely struck a chord with kind of like the Twitter
crowd, which is just like new ideas at the time, I guess just like thinking about how you can
actually bake this into some sort of application, because I think what I also ended up discovering
was the fact that there was starting to become a wave of developers building on top of Tripiti3,
and people were starting to realize that what makes them really useful is to apply them
on top of your personal data. And so even if the solution itself was kind of like primitive at the
time, like the problem statement itself was very powerful. And so I think being motivated by the
problem statement, right, like this broad mission of how do I unlock elements on top of the
data also contributed to the development of Lama Index to the state it is today.
And so I think part of the reason, you know, our toolkit has evolved beyond the just existing
set of like data structures is we really try to take a step back and think, okay, what exactly
are the tools that would actually make this useful for developer?
And then, you know, somewhere around December, we made an active effort to basically like push
towards that direction, make the code base more modular, right, more friendly as an open source
library.
And then also start adding in like embeddings, start thinking into,
practical considerations like latency cost performance, those types of things.
And then really motivated by that mission, like start expanding the scope of the toolkit
towards like covering the life cycle of like data injection and querying.
Where you also added Lama Hub and I don't know.
Yeah.
So I think that was in like January on the data loading side.
And so we started adding like some data loaders, saw an opportunity there, started adding
more stuff on the retrieval querying side, right?
We still have like the core data structures, but how do you actually make them more modular
and kind of like decouple storing state from the types of like queries I could run on top of this a little bit.
And then starting to get into more complex interactions like chain of thought reasoning, routing and, you know, like agent loops.
You and I spent a bunch of time earlier this year talking about Lama Hub, what that might become.
You were still a robust.
When did you decide it was time to start the company and then start to think about what Lama index is today?
Yeah, I mean, probably December.
It was kind of interesting.
I was getting something down from initial V-C...
I was talking about this project.
And then in the beginning, I was like, oh, yeah, you know, this is just like a design project.
But, you know, what about my other idea on, like, video data, right?
And I was trying to get, yeah, there are thoughts on that.
And then everybody was just like, oh, yeah, whatever.
Like, that part's, like, a crowded market.
And then it became clear that, you know, this was actually a pretty big opportunity.
And, like, coincidentally, right, like, this actually did relate to, like, my interests
have always been at the intersection of AI data and kind of, like, building practical
applications and it was clear that this was evolving into a much bigger opportunity than the previous
idea was. So around December and then I think I gave a pretty long notice but I left
officially like early March. What were your thinking in terms of like moats and you know founders
kind of like overthinking sometimes? You obviously had like a lot of open source love and like a lot
of community and yeah like were you ever thinking okay I don't know this is maybe not enough to
start a company or did you always have conviction about it? Oh no I mean.
100%. I felt like I did this exercise, like, honestly, probably more late December and then early
January, because I was just existentially worried about whether or not this would actually be a
company at all. And, okay, what were the key questions I was thinking about? And these were the same
things that like other founders, investors and also like friends would ask me is just like,
okay, what happens if context windows get much bigger? What's the point of actually structuring data,
right, in the right way? Why don't you just dump everything into the prompt? Fine-tuning? Like,
what if you just train the model over this data?
And then, you know, what's the point of doing this stuff?
And then some other ideas is what if like OpenEI actually just like takes this,
like builds upwards on top of the,
their existing like foundation models and starts building in some like built-in
orchestration capabilities around stuff like rag and agents and those types of things.
And so I basically ran through this mental exercise.
And, you know, I'm happy to talk a little bit more about those thoughts as well.
But at a high level, well, context windows have gotten bigger,
but there's obviously still a need for RAG.
I think RAG is just like one of those things that like in general, what people
care about is yes, they do care about performance, but they also care about stuff like
latency and costs.
And my entire reasoning at the time was just like, okay, like, yes, maybe we'll have like much
bigger context windows, as we've seen with like 100K context windows.
But for enterprises like, you know, data, which is not in just like the scale of like a few
documents, it's usually in like gigabytes, terabytes, petabytes.
How do you actually just unlock language models over the?
that data, right? And so it was clear there was just like, whether it's RAG or some other
paradigm, no one really knew what that answer was. And so there was clearly like technical opportunity
here. Like there was just stacks that need to be invented to actually solve this type of problem
because language models themselves didn't have access to this data. The other piece here is just
like, and so if like you just dumped all this data into, let's say a model had like hypothetically
an infinite context window, right? And you just dumped like 50 gigabytes of data into a context window.
That just seemed very inefficient to me because you have these network transfer.
a class of uploading 50 gigabytes of data to get back a single response.
And so I kind of realize, you know, there's always going to be some curve, regardless of
the performance of the best performing models, of like, cost versus performance.
What Ragt does is it does provide extra data points along that access because you can kind
control the amount of context you actually wanted to retrieve.
And of course, like, Rag as a term was still evolving back then, but it was just this whole
idea of like, how do you just fetch a bunch of information to actually, you know, like, stuff
into the prompt. And so people, even back then,
were kind of thinking about some of those considerations.
And then you fundraised in June,
well, you announced your fundraise in June.
Yeah. With Greylock,
take us through that process of thinking about the fundraise
and your plans for the company,
you know, at the time.
Yeah, definitely. I mean, I think we knew we wanted to,
I mean, obviously we knew we wanted to fundraise.
There was also a bunch of like investor interest and it was probably
pretty unusual given the, you know, like,
hype wave of generative AI. So like a lot of investors
are kind of reaching out around like December, January, February.
in the end we went with Greylock.
Greylock's great.
They've been great partners so far.
And to be honest, like, there's a lot of, like, great VCs out there.
And a lot of them who are specialized on, like, open source, data infra and that type of stuff.
What we really wanted to do was because for us, like, time was of the essence.
Like, we wanted to ship very quickly and still kind of build mind sharing in the space.
We just kept the fundraising process very efficient.
I think we basically did it in, like, a week or like three days.
And so just, like, front-loaded it and then just, like...
You picked the one named Jerry.
Yeah, exactly.
Yeah, I'm kidding.
I mean, he's obviously great and Greylock's a fantastic firm.
Embedding similar research.
So, yeah, just we've had Greylock.
They've been great partners.
I think in general, when I talk to founders about the fundraise process,
it's never like the most fun period, I think,
because it's always just like, you know,
there's a lot of logistics, there's lawyers you have to, you know,
get in the loop.
And then, like, a lot of founders just want to go back to building.
I think in the end we're happy that we kept it to a pretty efficient process.
And so you fundraise with Simon.
How do you split things with him?
How big is your team now?
The team is growing.
By the time this podcast is released, we'll probably have had one more person join the team.
So basically, it's between we're rapidly getting to like eight or nine people.
At the current moment, we're around like six.
And so just like there will be some exciting developments in the next few weeks.
I'm excited to announce that.
So the team has kind of like we've been pretty selective in terms of how we like grow the team.
Obviously, like we look for people that are really active in terms of contributions to
Lom Index, people that have very strong engineering backgrounds.
And primarily, we've been kind of just looking for builders, people that kind of like grow the open source and also eventually this like managed like enterprise platform as well with us.
In terms of like Simon, yeah, I've known Simon for a few years now.
I knew him back at Uber ATG in Toronto.
He's one of the smartest people I knew has a sense of both like a deep understanding of ML, but also just like first principles thinking about like engineering and technical concepts in general.
And I think one of my criteria is when I was like looking for a co-founder for this project.
with someone that was like technically better than me because I knew I wanted like a CTO.
And so honestly like there weren't a lot of people that I mean there's I know a lot of people that are smarter than me.
But like that fit that bill. We're willing to do a startup and also just had the same like values that I shared. Right.
And just I think doing a startup is very hard work. Right. It's not like I'm sure like you guys all know this. It's it's a lot of hours. A lot of late nights.
And you want to be like in the same place together and just like being willing to hash out stuff and have that grit basically. And I really look for that.
And so Simon really fit that bill. And I think.
I convince them to bring a jump on board.
Yeah.
And obviously, I've had the pleasure of chatting and working with a little bit with both of you.
What we say, those, like, your top, like, one or two values are when thinking about that or
the culture of the company and that kind of stuff?
I think in terms of the culture of the company, it's really like, I mean, there's a few things
I can aim off the top my head.
One is just, like, passion, integrity.
I think that's very important for us.
We want to be honest.
We don't want to, like, obviously, like, copy code or kind of like, you know, just, like,
you know, not give attribution, those types of things.
and just like be true to ourselves.
I think we're all very like down to earth like humble people.
But obviously I think just willingness to just like own stuff and dive right in.
And I think grit comes with it.
I think in the end, like this is a very fast moving space.
And we want to just like be one of the, you know, like dominant forces and helping to provide like production quality all in applications.
Yeah.
I promise we'll get to a more technical question.
But I also want to impress on the audience that this is a, you know, very conscious and intentional company building.
And since your fundraising post, which is in June, and now it's September, so it's been about three months, you've actually gained 50% in terms of stars and followers.
You've 3x your download count to 600,000 a month and your Discord membership has reached 10,000.
So like a lot of ongoing growth.
Yeah, definitely.
And obviously there's a lot of room to expand there too.
And so open source growth is going to continue to be one of our core goals.
Because in the end, it's just like we want this thing to be, well, one big.
We all have, like, big ambitions, but to just, like, really provide value to developers in helping them in prototyping and also productionization of their apps.
And I think it turns out we're in the fortunate circumstance where a lot of different companies and individuals, right, are in that phase of like, you know, maybe they've hacked around on some initial L on applications.
But they're also looking to, you know, start to think about what are the production grade challenges necessary to actually, that to solve to actually make this thing robust and reliable in the real world.
And so we want to basically provide the tooling to do that.
And to do that, we need to both spread awareness and education of a lot of the key practices of what's going on.
And so a lot of this is going to be continued growth, expansion, and education.
And we do prioritize that very happily.
Let's dive into some of the questions you were asking yourself initially around fine-tuning and rag, how these things play together.
You mentioned context.
What is the minimum viable context for rag?
So what's like a context window too small?
And at the same time, maybe what's like a maximum context window, we talked before about the LLMs are U-shaped reasoners.
So as the context got larger, like it really only focuses on the end and the start of the prompt and then it kind of peters down.
Any learnings, any kind of like tips you want to give people as they think about it.
So this is a great question.
And part of what I wanted to talk about a conceptual level, especially with the idea of like thinking about what is a minimum context.
like, okay, what if the minimum context was like 10 tokens versus like, you know,
2K tokens versus like a million tokens, right?
Like, and what does that really give you?
And what are the limitations if it's like 10 tokens?
It's kind of like, like, 8-bit, 16-bit games, right?
Like, back in the day, like, if you play Mario and you have like the initial Mario
where the graphics were very blocky and now obviously it's like full HD 3D,
just the resolution of the context and the output will change depending on how much context
you can actually fit in.
The way I kind of think about this from a more principal manner is like you have like,
there's a concept of like information capacity, just this idea of entropy, like given any
fixed amount of like storage space, like how much information can you actually compact in there?
And so basically a context bundle length is just like some fixed amount of storage space, right?
And so there's some theoretical limit to the maximum amount of information and compact
until like a 4,000 token storage space.
And what does that storage space use for these days with LMs?
It's for inputs and also outputs.
And so this really controls a maximum amount of information.
you can feed in terms of the prompt plus the granularity of the output.
If you had an infinite context window, you could have an infinitely detailed response and also
infinitely detailed memory.
But if you don't, you can only kind of represent stuff in more quantized bits, right?
And so the smaller context window, just generally speaking about less details and maybe the
less, like specific precise information you're going to be able to surface any given point in time.
When you have short context, is the answer just like get a better model?
Or is the answer maybe, hey, there needs to be a best?
balance between fine-tuning and rag to make sure you're going to leverage the context,
but at the same time, don't keep it too low resolution.
Yeah, yeah.
Well, there's probably some minimum threat.
I don't think anyone wants to work with like a 10.
I mean, that's just a thought exercise anyways, 10 token context window.
I think nowadays the modern context one does like 2K, 4K is enough for just like doing some
sort of retrieval on granular contacts and be able to synthesize information.
I think for most intense and purposes, that level of resolution is probably fine for
most people for most use cases.
I think the question there is just like the limitations actually more on, okay, if you're going to actually combine this thing with some sort of retrieval data structure mechanism, there's just limitations on the retrieval side because maybe you're not actually fetching the most relevant context to actually answer this question.
Right.
Like, yes, like given the right context, 4,000 tokens is enough.
But if you're just doing like top case similarity, like you might not be fetching the right information from the documents.
So how should people think about when to stick with rag versus when to.
even entertain. And also in terms of what's like the threshold of data that you need to actually
worry about fun tuning versus like just stick with rag. Obviously you're biased because you're building
a rack company. No, no, actually. I think I have like a few hot takes in here, some of which
sound like a little bit contradictory or what we're actually building. And I think to be honest,
I don't think anyone knows the right answer. I think this is something. The truth. Yeah, exactly.
This is just like thought exercise towards like understanding the truth. Right. So, okay, I have a few
hot takes. One is like rag is basically just just a hack. That turns out it's a very good hack.
because what is rag rag is you keep the model fixed and you just figure out a good way to like stuff stuff into the prompt of the language model.
And everything that we're doing nowadays in terms of like stuffing stuff into the prompt is just algorithmic.
We're just figuring out nice algorithms to like retrieve right information with top case similarity, do some sort of like hybrid search, some sort of like a chain of thought, decomp and then it's just like stuff stuff into the prompt.
So it's all like algorithmic and it's more like just software engineering to try to make the most out of these like existing.
APIs. The reason I say it's a hack is just like from a pure like optimization standpoint,
if you think about this from like the machine learning lens, unless the software engineering lens,
there's pieces in here that are going to be like suboptimal, right? Like like the thing about machine
learning is when you optimize like some system that can be optimized within machine learning,
like the set of parameters, you're really like changing like the entire system's weights to
try to optimize the subjective function. And if you just cobble a bunch of stuff together,
you can't really optimize the pieces that are inefficient. Right. And so like a
retrieval interface, like doing top K embedding lookup, that part is inefficient. If you, for instance,
because there might be potentially a better, more learned retrieval algorithm, that's better.
If, you know, you do stuff like some sort of, I know nowadays there's this concept of how do you do
like short-term long-term memory, represent stuff in some sort of vector embedding, do chunk sizes,
all that stuff, it's all just like decisions that you make that aren't really optimized. And it's not
really automatically learned. It's more just things that you set beforehand to actually feed into the
system. So I do think, like, there's a lot of room to actually optimize the performance of an
entire LLM system, potentially in a more like machine learning based way, right? And I will leave
room for that. And this is also why I think, like, in the long term, I do think fine tuning will
probably have, like, greater importance. And just like, there will probably be new architectures
invented where you can actually kind of like include a lot of this under the black box as
opposed to having, like, hobbling together a bunch of components outside the black box.
That said, just very practically, given with the current state of things, like, even if I said
rag is a hack, it's a very good hack, and it's also very easy to use, right? And so just like,
for kind of like the AI engineer persona, which to be fair is kind of one of the reasons
generative AI has gotten so big is because it's way more accessible for everybody to
get into, as opposed to just like traditional machine learning, it tends to be good enough,
right? And if we can basically provide these existing techniques to help people really
optimize how to use existing systems
without having to really deeply understand
machine learning, I still think that's a huge value add.
And so there's very much like a
UX and ease of use problem here, which is just
like RAG is way easier to onboard and use.
And that's probably like the primary reason
why everyone should do Ragn instead of fine tune to begin with.
If you think about like the 80-20 rule,
like Ragh very much fits within that and fine-tuning
doesn't really right now.
And then I'm just kind of like leaving room for the future that,
you know, like in the end,
fine-tuning can probably take over some of the aspects of like
what Rag does.
I don't know if this is mentioned in your explainability also allows for sourcing.
And at the end of the day, to increase trust, we have to source documents.
Yeah.
So I think what RAG does is it increases like transparency, visibility into the actual documents
that are getting fed into their context.
Here's where you got it from.
Exactly.
That's definitely an advantage.
I think the other piece that I think is an advantage.
And I think that's something that someone actually brought up is just you can do access
control with Ragn if you have an external storage system.
You can't really do that.
with large language models, which is like gate information to the neural net weights,
like depending on the type of user.
For the first point, you could technically,
you could technically have the language model,
like if it memorized enough information,
just as like a site sources,
but there's a question of just trust whether or not you're actually.
Yeah, well, but like it makes it up right now because it's like not good enough.
But imagine a world where it is good enough and it does give accurate citations.
Yeah, no, I think to establish trust, you just need a direct connection.
So it's kind of weird.
Is this melding of deep learning systems versus very traditional information retrieval?
Yeah, exactly.
Well, so I think, I mean, I kind of think about it as analogous to like humans, right?
Like we as humans, obviously, we use the internet.
We use tools.
These tools have API interfaces are well defined.
And obviously, we're not, like, the tools aren't part of us.
And so we're not like back propping or optimizing over these tools.
And so when you think about like RAG, it's basically LLM is learning how to use like a vector database to look up
information that it doesn't know. And so then there's just a question of like how much information
is inherent within the network itself and how much doesn't need to do some sort of like tool use
to look up stuff that it doesn't know. And I do think there'll probably be more or more of that interplay
as time goes on. Yeah. Some follow-ups on discussions that we've had, you know, we discussed fine-tuning
a bit. And what's your current take on whether you can you can find tune new knowledge into
elements? That's one of those things where I think long-term, you definitely can. I think some people
say you can't. I disagree. I think you definitely can. Just right now, I haven't gotten it to work yet.
So I think like,
you've tried.
Yeah,
well,
not in a very principled way, right?
Like this is something that requires like an actual research scientist and not someone
that has like,
you know,
an hour or two per night to actually look at this.
Like,
you were research sciences at Uber.
Yeah,
but it's like full time,
full time looking at.
So I think what I specifically concretely did was I took open AI's fine
tuning endpoints and then tried to,
you know,
it's in like a chat message interface.
And so there's like input question,
like user assistant message format.
And so what I did was I try to take just some piece of text and have the
I'll memorize it by just asking a bunch of questions about the text.
So given a bunch of context, I would generate some questions and then generate some
response and just fine tune over the question responses.
That hasn't really worked super well.
But that's also because I'm just like trying to like use open AI's endpoints as is.
If you just think about like traditional like how you train a Transformers model,
there's kind of like the instruction like fine tuning aspect, right?
You like ask itself and guide it with correct responses.
But then there's also just like next token production.
And that's something that.
You can't really do with the opening eye API, but you can do with if you just train it yourself.
And that's probably possible if you just like train it over some corpus of data.
I think Shishira from Berkeley said like, you know, when they trained guerrilla, they were like,
oh, you know, a lot of these albums are actually pretty good at memorizing information.
Just the way the API interface is exposed is just no one knows how to use them right now.
And so I think that's probably one of the issues.
Just to clue people then who haven't read the paper, Gorilla is the one where they trained to use specific APIs.
Yeah, I think this was on a guerrilla paper.
the model itself could try to learn some prior over the data to decide what tool to pick,
but it's also augmented with retrieval that helps supplement it in case the prior doesn't actually
work. Is that something that you'll be interested in supporting? I mean, I think in the long term,
like, if like this is kind of how fine-tuning like Ragn evolves, like I do think there will be
some aspect or fine-tuning will probably memorize some high-level concepts of knowledge,
but then like RAG will just be there to supplement like aspects to that, that,
aren't work that don't, that it doesn't know.
The way I think about this is kind of like, obviously, rag is the default way.
Like, to be clear, rag right now is the default way to actually augment stuff with knowledge.
I think it's just an open question of how much the LM can actually internalize both high-level
concepts, but also details as you can like train stuff over it.
And coming from an ML background, there is a certain beauty and just baking everything into some
training process of a language model.
like if you just take raw chat GPD or chat GPD code interpreter, right, like GPD4,
it's not like you do rag with it.
You just ask it questions about like, hey, how do I like to find a pedantic model in Python?
And then like, can you give me an example?
Can you visualize a graph for me?
It just does it.
Right.
And we'll run it through code interpreter as a tool, but that's not like a source for knowledge.
It's just an execution environment.
And so there is some beauty in just like having the model itself, like just, you know,
instead of you kind of defining the algorithm for what the data structure should look like,
the model just learns it under the head.
That said, I think the reason.
it's not a thing right now is just like no one knows how to do it. It probably costs too much money.
And then also like the API interfaces and just like the actual ability to kind of evaluate
and improve on performance like isn't known to most people. Yeah. It also would be better with
browsing. Yeah. I wonder when they're going to put that back. Okay. Yeah. So and then one more
follow up before we go into RAG for AI engineers is on your brief mention about security or off.
How many of your the people that you talk to, you know, you talk to a lot.
of people putting Lama Index into production, how many people actually are there versus just
like, let's just dump a whole company notion into this thing.
Wait, or talking about from like the security auth standpoint?
Yeah.
How big a need is that?
Because I talk to some people who are thinking about building tools in that domain, but I don't
know if people want it.
I mean, I think bigger companies, like just bigger companies, like banks, consulting firms,
like they all want this.
Yeah, it's a requirement, right?
The way they're using Lama Index is not with this, obviously.
I don't think we have support for, like, access control or author that have stuff like on a hood, because we're more just like an orchestration framework. And so the way they build these initial apps is more kind of like prototype. Like let's kind of, yeah, like, you know, use some publicly available data. That's not super sensitive. Let's like, you know, assume that every user is going to be able to have access to the same amount of knowledge, those types of things. I think users have asked for it, but I don't think that's like a P-0. Like, I think the P-0 is more on like, can we get this thing working before we expand this to like more users within the work? There's a bunch of pieces to rag, obviously. It's not.
not just an acronym.
And Et2 recently,
you think every AI engineer
should build a front scratch at least once.
Why is that?
I think so.
I'm actually kind of curious
to hear your thoughts about this,
but this kind of relates to
the initial AI engineering posts that you put out.
And then also just like the role of an AI engineer
and the skills that they're going to have to learn to truly succeed
because there's an entire spectrum.
On one end,
you have people that don't really like understand the fundamentals
and just want to use this to like cobble something together
to build something.
And I think there is a beauty in that for what it's worth.
Like, it's just one of those things.
And Gen AI has made it so that you can just use these models in inference only mode,
cobble something together, use it power your app experiences.
But on the other end, what we're increasingly seeing is that, like,
more and more developers building with these apps start running into,
honestly, like pretty similar issues that, like, will plague just a standard ML engineer
building like a class for our model, which is just like accuracy problems.
Like, and hallucination is basically just an accuracy problem, right?
Like, it's not giving you the right results.
So what do you do?
You have to iterate on the model itself.
you have to figure out what parameters you tweak,
you have to gain some intuition about this entire process.
That workflow is pretty similar, honestly,
even if you're not training the model to just like tuning an ML model with like hyperparameters
and learning like proper ML practices of like,
okay,
how do I have like define a good evaluation benchmark?
How do I define like the right set of metrics to do,
to use, right?
How do I actually iterate and improve the performance of this pipeline for production?
What tools do I use, right?
Like every ML engineer use like some form of weights and biases,
tensor boards or like some of,
some other experimentation tracking tool,
what tools should I use to actually help build
like LLM applications and optimize it for production?
There's like a certain amount of just like LLM ops
like tooling and concepts and just like practices
that people kind of have to internalize if they want to optimize these.
And so I think that the reason I think being able to build like rag from scratch is important
is it really gives you a sense of like how things are working to help you build intuition
about like what parameters are within a RAG system
and which ones actually tweak to make them better.
Because otherwise, I think that one of the advantages of the Laman Dax quick start is it's three lines of code.
The downside of that is you have zero visibility into what's actually going on under the hood.
And I think this is something that we've kind of been thinking about for a while.
And I'm like, okay, let's just release like a new tutorial series.
That's just like we're instead not no three lines of code.
We're just going to go in and actually show you how the thing actually works on that hood.
Right.
And so like does everybody need this?
Like probably not.
As for some people, the three lines of code might work.
But I think increasingly like honestly, 90% of the users,
I talk to have questions about how to improve the performance of their app.
And so just like given this is just like one of those things, it's like better for the
understanding.
Yeah.
I'd say it is one of the most useful tools of any sort of developer education toolkit
to write things to yourself from scratch.
So Kelsey Hightower famously wrote Kubernetes the hard way, which is don't use Kubernetes.
Here's everything that you would have to do by yourself.
And you should be able to put all these things together yourself to understand the value of Kubernetes.
And the same thing for Lama Index.
I was the guy who did the same for React.
And it's a pretty good exercise for you to just fully understand everything that's going on under the hood.
And I was actually going to stress, well, in one of the previous conversations,
there's all these hyperparameters, like the size of the chunks and all that.
And I was thinking, like, what would hyperparaminal optimization for RAG look like?
Yeah, definitely.
I mean, so absolutely.
I think that's going to be an increasing thing.
I think that's something we're kind of looking at.
I think someone should just do like some large-scale study and then just ablate everything.
And just, you tell us.
I think it's going to be hard to find a universal default that works for everybody.
I think it's going to be someone.
Boo.
I do think it's going to be somewhat dependent on the data in use case.
I think if there was a universal default, that would be amazing.
But I think increasingly we found, you know, people are just defining their own, like, custom parsers for like PDFs,
markdown files for like, you know, SEC filings versus like Slack conversations.
And then like the use case too, like, do you want like a summarization, like the granular
of the response. Like, it really affects the parameters that you want to pick. I do like the idea of
hyperparameter optimization, though. But it's kind of like one of those things where you are kind of
like training the model, basically, kind of on your own data domain. Yeah. You mentioned custom
parses. You've designed Lama Index. Maybe you can talk about like the surface area of the
framework. You designed Lama Index in a way that it's more modular, like you mentioned. How would
you describe the different components and what's customizable in each? Yeah, I think they're all
customizable. And I think that there is a certain burden on
to make that more clear through the docs.
Well, number four is customization tutorials.
Yeah, yeah.
But I think, like, just in general, I think we do try to make it so that you can plug in the out-of-the-box stuff.
But if you want to customize more lower-level components, like, we definitely encourage
you to do that and plug it into the rest of our abstractions.
So let me just walk through, like, maybe some of the basic components of Llamindux.
There's data loaders.
You can load data from different data sources.
We have Lava Hub, which you guys brought up, which is, you know, a collection of different
data loaders of, like, unstructured and unstructured data, like PDFs, file types, like,
Slack notion,
all that stuff.
Now you load in this data.
We have a bunch of like parsers and transformers.
You can split the text.
You can add metadata to the text.
And then basically figure out a way to load it into like a vector store.
So I mean,
you worked at like Airbright, right?
It's kind of like there is some aspect like E&T, right?
And in terms of like transforming this data.
And then the L, right,
loading it into some storage abstraction,
we have like a bunch of integrations with different document storage systems.
So that's data.
And then the second piece really is about like,
how do you retrieve this data?
How do you, like, synthesize this data and how do you, like, do some sort of higher level
reasoning over this data?
So retrieval is one of the core abstractions that we have.
We do encourage people to, like, customize to find your own retrievers.
That section on, like, how do you define your own, like, customer retriever?
But also we have, like, out-the-box ones.
The retrieval algorithm kind of depends on how you structure the data, obviously.
Like, if you just flat index everything with, like, chunks with, like, embeddings,
then you can really only do, like, top K lookup plus maybe, like, keyword search or something.
but if you can index it in some sort of like hierarchy, like to find relationships,
you can do more interesting things, like actually traverse relationships between nodes.
Then after you have this data, how do you like synthesize the data?
And this is the part where you feed it into the language model.
There's some response abstraction that can abstract away over like long context to actually still give you a response,
even if the context overflow is a context window.
And then there's kind of these like higher level like reasoning primitives that I'm going to define broadly.
And I'm just going to call them in some general bucket of like agents,
even though everybody has different definitions of agents.
But you're the first to data agents, which I was very excited.
Yeah, we kind of like coined that term.
And the way we thought about it was, you know,
we wanted to think about how to use agents for like data workflows, basically.
And so what are the reasoning primitives that you want to do?
So the most simple reasoning primitive you can do is some sort of routing module.
It's a classifier.
Like given a query, just make some automated decision on what choice to pick, right?
You could use LMs.
You don't have to use LLMs.
You could just train classfire, basically.
That's something that we might actually explore.
And then the next piece is, okay, what are some higher level things?
You can have the LM, like, define like a query plan, right, to actually execute over the data.
You can do some sort of while loop, right?
That's basically what an agent loop is, which is like React, a chain of thought,
like the open AI function calling like while loop to try to like take a question and try to break it down into some,
some series of steps to actually try to execute to get back a response.
And so there's a range in complexity from like simple reasoning primitives to more advanced ones.
The way we kind of think about it is like, which ones should we implement and how do they
work well, like, do they work well over, like, the types of, like, data tasks that we give
them?
How do you think about optimizing each piece?
So take embedding models, there's one piece of it.
You offer fine-tuning embedding models.
And I saw it was like fine-tuning gives you, like, 5-10% increase.
What's kind of like the delta left on the embedding side?
Do you think we can get models that are like a lot better?
Do you think, like, that's one piece where people should really not spend too much time?
I just think it's not the only parameter because I think in the end, if you think,
about everything that goes into retrieval, the chunking algorithm, how you define like metadata
will bias your embedding representations. Then there is the actual embedding model itself,
which is something that you can try optimizing. And then there's like the retrieval algorithm.
Are you going to just do top K? Are you going to do like hybrid search? Are you going to do auto retrieval?
Like there's a bunch of parameters. And so I do think it's something everybody should try.
I think by default, we use like open AI's embedding model. A lot of people these days use like
sentence transformers because it's just like free open source and you can actually optimize,
directly optimize it. This is an active area of exploration. I do think one of our goals is it should
ideally be relatively free for every developer to just run some fine-tuning process over their data
to squeeze out some more points in performance. And if it's that relatively free and there's no downsides,
everybody should basically do it. There's just some complexities, right, in terms of optimizing your
abetting model, especially in a production-grade data pipeline. If you actually fine-tune with the embedding
model and the embedding space changes, you're going to have to re-index all your documents. And for a lot of
people that's not feasible. And so I think like Joe from Vespa on our webinar is like there's this
idea that depending on if you're just using like document and query embeddings, you could keep
the document embeddings frozen and just train a linear transform on the query or any sort of
transform on the query. So therefore it's just a query side transformation instead of actually
having to re-index all the document and beddings. That's pretty smart. We weren't able to get like
huge performance gains there, but it does like improve performance a little bit. And that's
something that basically, you know, everybody should be able to kick off. You can actually do that in Lama
index too. Open AI has a cookbook on
adding bias to the admittings too,
right? Yeah, there's just like different parameters
that you can try adding to try to like
optimize the retrieval process. And the idea
is just like, okay, by default,
you have all this text. It kind of lives
in some latent space,
right?
Yay! You should take a drink every time.
But it lives in some latent space.
But like depending on the type,
specific types of questions that the user might
want to ask, the latent space might not be
optimized to actually retrieve
the relevant piece of context that the user want to ask.
So can you shift the embedding points a little bit, right?
And how do we do that, basically?
That's really the key question here.
So optimizing the embedding model, even changing the way you like chunk things, these all shift
the embeddings.
So the retrieval is interesting.
I got a bunch of startup pitches that are like, like, rag is cool, but like there's a lot
of stuff in terms of ranking that could be better.
There's a lot of stuff in terms of sunsetting data once it starts to become stale,
that could be better.
Are you going to move into that part too?
So you have SECE Insights is one of kind of like your demos.
And that's like a great example of, hey, I don't want to embed all the historical documents
because a lot of them are outdated and I don't want them to be in the context.
What's that problem space like?
How much of it are you going to also help with and versus how much you expect others to take care of?
Yeah, I'm having to talk about SEC Insights in just a bit.
I think more broadly about the like overall retrieval space, we're very interested in it
because a lot of these are very practical problems that people have access.
And so the idea of outdated data, I think,
how do you like deprecate or time weight data and do that in a reliable manner, I guess,
so you don't just like set some parameter and all of a sudden that affects all your retrieval
arguments, like is pretty important because people have started bringing that up.
Like I have a bunch of duplicate documents, things get out of date, how do I like sunset documents?
And then remind me what was the first thing you said?
Because I think there was something.
Yeah, like the ranking.
Yeah.
So I think this space is not new.
I think everybody who is new to this space starts learning some basic concepts of information
retrieval, which, to be fair, has been around for quite a bit. But our goal is to kind of like
take some of like just general ranking and information retrieval concepts. So by encoding,
like cross encoding, right, like word-based models versus like kind of keyword-based search,
how do you actually evaluate retrieval? These things start becoming relevant. And so I think for us,
like, rather than inventing like new retriever techniques for the sake of like just inventing better
ranking, we want to take existing ranking techniques and kind of like package in a way that's
like intuitive and easy for people to understand. That said, I think there are interesting and new
retrieval techniques that are kind of in place that can be done when you tie it into some downstream
rack system. The reason for this is just like if you think about the idea of like chunking text,
right, like that just really wasn't a thing or at least for this specific purpose. Like the reason
chunking is a thing in rag right now is because like you want to fit within the context bundle of
an LM, right? Like why do you want to chunk a document? That just was less of a thing, I think, back then.
If you wanted to like transform a document, it was more for like structured data extraction or something in the past.
And so there's kind of like certain new concepts that you got to play with that you can use to invent kind of more interesting retrieval techniques.
Another example here is actually LM based reasoning, like LLM based chain of thought reasoning.
You can take a question, break it down into smaller components and use that to actually ascend to your retrieval system.
And that gives you better results than kind of like sending the full question to a retrieval system.
That also wasn't really a thing back then.
But then you can kind of figure out an interesting way of like blending old and the new, right, with LMs and data.
There's a lot of ideas that you come across.
Do you have a store of them?
Yeah, I think sometimes I get like inspiration.
There's like some problem statement.
And I'm just like, oh, let's like hack this out.
Following you is very hard because it's just a lot of homework.
So I think I've started to like step on the brakes just a little bit because then I start.
No, no, no.
Well, the reason is just like, okay, if I just have invent like a hundred more retrieval techniques, like, sure.
But how do people know which one is good and which one's bad, right?
And so have a librarian, right?
Like, it's going to catalog it and go like.
You're going to need some like benchmarks.
And so I think that's probably the focus for the next few weeks is actually like properly kind of like having an understanding of like, oh, you know, when should you do this?
Or like, does this actually work well?
Yeah.
Some kind of like maybe like a flowchart decision tree type of thing.
Yeah, exactly.
When this do that, you know, something like that would be really helpful for me.
Thank you.
Yeah.
It seems like your most successful side project.
Yeah.
what is SEC Insights for our listeners? Our SEC Insights is a full stack LM chatbot application
that does analysis of your SEC 10K and 10Q filings. And so the goal for building this project is really twofold. The reason we started building this was one, it was a great way to dog food the production readiness for a library. We actually ended up like adding a bunch of stuff and fixing a ton of bugs because of this. And I think it was great because like, you know, thinking about how we handle like callbacks, streaming.
actually generating like reliable sub-responsees and bubbling up sources citations.
These are all things that like, you know, if you're just building the library in isolation,
you don't really think about, but if you're trying to tie this into a downstream application,
like it really starts mattering.
Is this for your error messages?
When you talk about bubbling up stuff?
So like sources, like if you go into SEC Insights and you type something,
you can actually see the highlights in the right side.
That was something that like took a little bit of like understanding to figure out how to build
law.
And so it was great for dog fooding improvement of the library itself.
And then as we're building the app, the second thing,
is we're starting to talk to users and just like trying to showcase like kind of bigger companies,
like the potential of Lomindex as a framework because these days,
obviously building a chat bot with Streamlet or something, it'll take you like 30 minutes or
hour.
Like there's plenty of templates out there on Laman Dex client train.
Like you can just build a chat bot.
But how do you build something that kind of like satisfies some of these,
this like criteria of surfacing like citations, being transparent, seeing like having a good
UX and then also being able to handle different types of questions, right?
Like more complex questions that compare different documents.
that's something that I think people are still trying to explore.
And so what we did was like we showed, well, first, like, organizations and possibilities
of like what you can do when you actually built something like this.
And then after like, you know, we kind of like stealth launched this for fun just as a separate
project just to see if we could get feedback from users who were using this world to see like,
you know, how we could improve stuff.
And then we were thought, we thought like, you know, we built this, right?
Obviously we're not going to sell like a financial app.
Like that's not really our wheelhouse.
But we're just going to open source the entire thing.
And so that now is basically just like a really nice, like, full stack app template you can use and customize on your own, right, to build your own chat.
Whether it is over like financial documents or over like other types of documents.
And it provides like a nice template for basically anybody to kind of like go in and get started.
There's certain components though that like aren't released yet that we're going to in the next few weeks.
Like one is just like kind of more detailed guides on like different modular components within it.
So if you're like a full stack developer, you can go in and actually take the pieces that you want and actually kind of build your own custom flows.
The second piece is, like, there's like certain components in there that might not be directly related to the LLM app that would be nice to just like have people use.
An example is the PDF viewer.
Like the PDF viewer with like citations, I think we're just going to give that.
Right.
So, you know, you could be using any library you want, but then you can just, you know, just drop in a PDF viewer, right?
So that it's just like a fun little module that you can view a plugin.
Nice.
That's a really good community service right there.
I want to talk a little bit about your cloud offering because you mentioned, I forget the name that you had for Enterprise something.
Well, one, we haven't come up with the name.
We're kind of calling it Lomindex platform.
Platform, Lomindex Enterprise.
Open to suggestions here.
And the second thing is, I don't actually know how much I can share right now
because it's mostly kind of like...
In design.
Yeah, exactly.
To the extent that you can talk about Lom Index as a business,
always just want to give people in the mind like, hey, like, you sell things too.
You know what I mean?
Yeah, 100%.
So I think the high level of what I can probably say is just like,
I think we're looking at ways of like actively kind of complement
the developer experience like building Lama Index.
We've always been very focused on stuff around like
plugging in your data into the language model.
And so can we build tools that help like augment that experience
beyond the open source library?
Right.
And so I think what we're going to do is like make a build an experience
where it's very seamless to transition from the open source library
with like a one line toggle.
You can basically get this like complementary service and then figure out a way to like
monetize in a bit.
I think our revenue focus this year is less emphasized.
It's more just about, can we build some manage offering that provides complementary value to what the open source library provides.
I think it's the classic thing about all open source is you want to start building the most popular open source projects in your category to old net category.
You're going to make it very easy to host.
Therefore, you just built your biggest competitor, which is you.
I think it'll be a complimentary because I think it will be like, you know, use the open source library.
And then you have a toggle and all of a side, you know, you can see this.
basically like a pipeline-ish thing,
pop up, and then it'll be able to kind of like,
you'll have a UI, there'll be some enterprise guarantees,
and the end goal would be to help you build
like a production rag app more easily.
Data loaders, there's a lot of them.
What are maybe some of the most popular,
maybe under, not underrated, but like under-expected,
you know?
And how has the open source side of it helped
with like getting a lot more connectors?
You only have six people on the team today,
so you can have done it all, your cell phone.
Oh, for sure.
Yeah, I think the nice thing about like Blama Hub itself, it's supposed to be a community-driven hub.
And so actually the bulk of the peers are completely community contributed.
And so we haven't written that many like first-party connectors actually for this.
It's more just like kind of encouraging people to contribute to the community.
In terms of the most popular tools or the data loaders, I think we have Google analytics on this.
And I forgot the specifics.
It's some mix of like the PDF loaders.
We have like 10 of them, but there's some subset of them that are popular.
And then there is Google, like I think Gmail and GDrive.
And then I think maybe it's like one of Slack or a notion.
One thing I will say though, and I think like SWIX might probably knows this better than
I do, given that you were used to work at Airbyte.
It's very hard to build like, especially for a full on service like Notion, Slack or like
Salesforce to build like a really, really high quality loader that really extracts all the
information that people want.
Right.
And so I think the thing is when people start out, like they will probably use these loaders
and it's a great tool to get started.
And for a lot of people, it's like good.
enough and they submit PRs if they want more additional features. But if you get to a point where you
actually want to call like an API that hasn't been supported yet, or you know, you want to
load in stuff that like in metadata or something that hasn't been directly baked into the logic
of the loader itself, people start adding up like writing their own custom loaders. And that is
a thing that we're seeing. That's something that we're okay with, right? Because like a lot of
this is more just like community driven. And if you want to submit a PR to improve the existing one
you can, otherwise you can create your own custom ones. Yeah. And all that is custom load is all
supported within Lama Index or do you pair it with something else?
Oh, it's just like, I mean, you just define your own subclass.
I think that's it.
Yeah, yeah.
Because typically in the data ecosystem with Airby, everybody has his own strategies for custom
loaders, but also you could write your own with like Dagster or like prefects or one of those tools.
Yeah, yeah, exactly.
So I think for us, it's more, we just have a very flexible like document abstraction.
They get filled in with any content that you want.
Are people really dumping all their Gmail into these things?
You said Gmail is number two.
I'm not sure, actually.
I mean, that's the most private data source.
That's true.
So I'm surprised that people are done with you.
I mean, I'm sure some people are, but I'm sure I'm surprised it's popular.
Well, and then so the LM engine, I assume opening eye is going to be a majority.
Is it an overwhelming majority?
What's the market share between opening eye, coherent, anthropic, you know, whatever you're seeing, open source too?
Yeah, I think it's probably some open eye as a majority, but then like there's anthropological.
And there's also open source. I think there is a lot of people trying out like Lama 2 and some variant of like a top open source model.
Side note, any confusion there, Lama 2 versus Lama?
Yeah, I think whenever I go to these talks, I always open it up with like we started before it.
Yeah, exactly. We start before meta, right? I want to point that out. But no, process of them, we try to use it for like branding.
We just add two llamas when we have like a Lama 2 integrations instead of one Lama. So I think a lot of people are trying out the popular open source models.
There's a lot of toolkits and open source projects that allow you to self-host and deploy Lama too.
And like, O Lama is just a very recent example, I think, that we add an integration with.
And so we just, by virtue of having more of these services, I think more and more people are trying it out.
Do you think there's potential there?
Is that going to be an increasing trend?
Open source?
Yeah.
Yeah, definitely.
I think in general, people hate monopolies.
And so, like, there's a, whenever, like, Open AI has something really cool or like any company has something really cool, even meta.
like there's just going to be a huge competitive pressure from other people to do something that's more open and better.
And so I do think just market pressures will improve like open source adoption.
Last thing I'll say about this, which is just really like it gets clicks.
People like psychologically want that.
But then at the end of the day they fall for brand name and popular and performance benchmarks, you know, at the end of the day opening eye still wins on that.
I think that's true.
But I just think like unless you're like an active employee at opening eye, right?
all these research labs are putting out like ML, like PhDs or kind of like other companies
to they're investing a lot of dollars. There's just going to be a lot of like competitive pressures
developed like better models. So is it going to be like all fully open source with like
permissive license? Like I'm not completely sure. But like there's just a lot of just incentive
for people to develop better stuff here. Have you looked at like rag specific models like contextual?
No. Is it public? No. They literally just, so Dewey Kila, I think is his name. You probably came across
him. He wrote the rag paper at Meta and just started contextual AI to create a rag-specific model.
I don't know what that means. I was hoping that you do because it's your business.
If I had it inside information. I mean, to be honest, I think this kind of relates to my previous
point on like rag and fine-tuning. Like a rag-specific model is a model architecture that's
designed for a better rag. And it's less the software engineering principle of like,
how can I take existing stuff and just plug and play different components into it? And there's
a beauty in that from ease of use and modularity, but when you want to end to un-optimize the thing,
you might want a more specific model. I think building your own models is honestly pretty hard.
And I think the issue is if you also build your own models, like, you're also just going to have
to keep up with like the rate of L in advances. Like how, like basically the question is when GPT
5 and 6 and whatever like Anthropic Cod 3 comes out, how can you prove that you're actually better
than software developers covling together on components on top of a base model, right? Even if it's just
like conceptually, this is better than maybe like GPD3 or GPD4.
What about vector stores?
I know this book says we're in a Chrome sweatshirt.
Yeah, because they use a swag game.
I have the mug.
Yeah.
Yeah.
Yeah.
What do you think there?
Like, there's a lot of them.
Are they pretty interchangeable for like your user's use case?
Is HNSW all we need is the room for improvements there?
It's an IPRAA all we need.
I think, yeah, we try to remain unopinionate about storage providers.
So it's not like, we don't try to like play favor.
So we have a bunch of integrations, obviously.
And the way we try to do is we just tried to find some standard interfaces,
but obviously, like, different vector stores will support kind of like slightly additional things,
like metadata filters and those things.
And the goal is to have our users basically leave it up to them to try to figure out,
like, what makes sense for their use case.
In terms of, like, the algorithm itself, I don't think the delta on, like, improving the vector store,
like, embedding a lookup algorithm is that high.
I think this stuff has been mostly solved, or at least there's just a lot of other stuff
you can do to try to improve the overall performance.
like what?
No, I mean, like, everything else that we just talked about.
Like, in terms of like accuracy, right, to improve rag, like, everything that we talked
about, like, chunking, like, metadata, like.
I mean, I was just thinking, like, maybe for me, the interesting question is, you know,
there are like eight, it's a kind of game of throws.
There's like eight, the war of eight databases right now.
Oh, oh, I see.
How do they stand out?
And how did they become very good partners if Lama Index?
Yeah, we're pretty good partners with most of them.
Let's see.
Well, like, so if you're, you know, vector database founder, like, what do you work on?
It's a good question.
I think one thing I'm very interested in.
is, and this is something I think I've started to see a general trend towards is combining
structured data querying with unstructured data querying. And I think that will probably just
expand the query sophistication of these vector stores and basically make it so that users
don't have to think about whether they separate. Would you call this like hybrid querying?
Is that what Weviates doing?
Yeah, I mean, I think like if you think about metadata filters, that's basically a structured
filter. It's like our select where something equal something. And then you combine that with
semantic search. I think like Lansd-B.
or something was like
trying to do some like joint interface.
The reason is like most data is semi-structured.
There's some structuring annotations and there's some like unstructured text.
And so like somehow combining all the expressivity of like SQL with like the flexibility
of semantic search is something that I think is going to be really important.
We have some basic hacks right now that allow you to jointly query both a SQL database
and like a separate SQL database and a vector store to like combine the information.
That's obviously going to be less efficient that if you just combined it into one system.
And so I think like PG vector, like, you know, that type of stuff.
I think it's starting to get there.
But like in general, like, how do you have an expressive query language to actually do like structured querying along with like all the capability of semantic search?
So your current favorite is just put it into Postgres?
No, no, no.
We don't play it.
The Postgres language, the query language.
I actually don't know what the best language would be for this.
Because I think it will be something that like the model hasn't been fine-tuned over.
And so you might want to train the model over this.
some way of expressing structured data filters.
And this could include time too, right?
It doesn't have to just be like a where clause with this idea of like semantic search.
Yeah, yeah.
And we talked about graph representations.
Yeah, oh yeah, that's another thing too.
And there's like, yeah, so that's actually something I didn't even bring up yet.
Like there's this interesting idea of like, can you actually have the language model?
Like explore like relationships within the data too, right?
And somehow combine that information with stuff that's like more structured within the DB.
Awesome.
What are your strong belief about how to evaluate, right?
I think I have thoughts.
I think we're trying to curate this into some, like, more opinionated principles because
there are some, like, open questions here.
I think one question I had to think about is whether you should do, like, evils, like,
component by component first, or is you just do the end-to-end thing?
I think you should, you might actually just want to do the end-to-end thing first, just to do
a sanity check of whether or not, like, this given a query and the final response, whether
or not it even makes sense.
Like, you eyeball it, right?
And then you, like, try to do some basic evils.
And then once you, like, diagnose what the issue is, then you go into the,
kind of like specific area to find some more solid benchmarks and try to like improve stuff.
So what is N20 evils? Like it's you have a query. It goes in through retrieval system.
You get back something. You synthesize response and that's your final thing. And you evaluate
the quality of the final response. And these days there's plenty of projects like startups,
like companies, research doing stuff around like GPD4, right, as like a human judge to basically
kind of like synthetically generate data. Do you think those will do well? I don't know from the
startup side. I just know from the technical side, I think people are going to do more of it.
The main issue right now is just, it's really unreliable. Like, it's just, like, there's like
variance in the response. Then they won't do more of it. I mean, it's just bad.
No, but these models will get better and you'll probably fine tune a model to be a better judge.
I think that's probably what's going to happen. So I'm like reasonably bullish on this because
I don't think there's really a good alternative beyond you just human annotating a bunch of
data sets and trying to like just manually go through and curating, like, evaluating evalmetrics.
And so this is just going to be a more scalable solution.
In terms of the startups, yeah, I mean, I think there's a bunch of companies doing this.
In the end, it probably comes down to some aspect of, like, UX speed, whether you can fine-tune a model.
So that's end-to-end devils.
And then I think, like, what we found is for RAG a lot of times, like, what ends up affecting this, like, end response is retrieval.
You're not able to retrieve the right response.
And so I think having proper retrieval benchmarks, especially if you want to do production
Rags is actually quite important.
I think what does having good retrieval metrics tell you?
It tells you that at least like the retrieval is good.
It doesn't necessarily guarantee the end generation is good, but at least it gives you
some sanity track, right?
So you can like fix one component while optimizing the rest.
What retrieval like evaluation is pretty standard and it's been around for a while.
It's just like an IR problem basically.
You have some like input query.
You get back some retrieves out of context.
And then there's some ground truth in that ranked set.
And then you try to measure it based on ranking.
So the closer that ground truth is to the top, the more you reward the evils.
And then the closer it is to the bottom, or if it's not in the retrieves not in the retrieves at all, then you penalize the Evales.
And so that's just like a classic ranking problem.
I think like most people starting out probably don't know how to do this.
Right now, we just launched some like basic retrieval evaluation modules to help users do this.
One is just like curating this data set in the first place.
And one thing that we're very interested in is this idea of like synthetic data set generation for EVLs.
So how can you give in some cost?
context, generate a set of questions with Jupy2.4, and then all of a sudden you have
question and then context pairs, and that becomes your ground truth.
Are data agent evils the same thing, or is there a separate set of stuff for agents that
you think is relevant here?
Yeah, I think data agents add like another layer of complexity because then it's just like
you have just more loops in the system.
Like you can evaluate like each chain of thought loop itself, like every LLM call to see
whether or not the input to that specific step in the chain of thought process actually
works or is correct, or you can evaluate like the final response to see if that's correct.
This gets even more complicated when you do like multi-agent stuff because now you have like
some communication between like different agents. Like you have a top level orchestration agent
passing it on to some low-level stuff. I'm probably less familiar with kind of like agent
email frameworks. I know they're starting to be become a thing. Talking to like June from the
journal of agents paper, which is pretty unrelated to what we're doing now. But it's very interesting
where it's like so you can kind of evaluate like overall agent simulations by just like kind
of understanding whether or not they like model the distribution of human behavior.
But that's out like a very macro principle, right?
And that's very much to evaluate stuff to kind of like model the distribution of things.
And I think that works well when you're trying to like generate something for like creative
purposes.
But for stuff where you really want the agent to like achieve a certain task, it really is like
whether or not achieve the task or not, right?
Because then it's not like, oh, does it generally mimic human behavior?
It's like no.
Like did you like send this email or not?
Right.
because otherwise, like, this thing didn't work.
Awesome.
Let's jump into a lining ground.
So we have two questions, acceleration, exploration, and then one final takeaway.
The acceleration question is, what's something that already happened in AI that you thought would take much longer to get here?
I think just the ability of LLMs to generate believable outputs and for text and also for images.
And I think just the whole reason I started hacking around with LLMs, honestly, I felt like I got into it pretty late.
I sure I got into it like early 2022 because Jupy 3 had been.
now for a while. Like just the fact that there was this engine that was capable, like, reasoning
and no one was really like tapping into it. And then the fact that, you know, I used to work in
image generation for a while. Like I did GANS and stuff back in the day. And that was like pretty
hard to train. You would generate these like 32 by 32 images. And then now taking a look at some of
the stuff by like Dolly and, you know, mid-jurney and those things. So it's just, it's very good.
Yeah. Exploration. What do you think is the most interesting unsolved question in AI?
Yeah. I'd probably work on some aspect of, um,
personalization of memory.
I think,
I actually think that I don't think anyone's,
like,
I think a lot of people have thoughts about that,
but like for what it's worth,
I don't think the final state will be ragged.
I think it'll be some,
some like fancy algorithm or architecture
where you like bake it into like the,
the architecture of the model itself.
Like if you have like a personalized assistant that you can talk to,
that will like learn behaviors over time,
right?
And learn stuff through like conversation history.
What exactly has the right architecture there?
I do think that will be part of like the role that.
Continuous fine-tuning?
Yeah, like some aspect of that.
Right, right.
Like, these are, like, I don't actually know the specific technique,
but I don't think it's just going to be something
where you have, like, a fixed vector store,
and that thing will be, like, the thing that restores all your memories.
It's interesting because I feel like using model weights for memory,
it's just such an unreliable storage device.
I know.
But, like, I just think from, like, the AGI, like, you know,
just modeling, like, the human brain perspective,
I think that there is something nice about just, like,
being able to optimize that system, right?
And to optimize a system, you need parameters.
And that's where you just get into the neural net piece.
Cool.
Cool.
And yeah, take away.
You got the audience ear.
What's something you want everyone to think about or, yeah, take away from this conversation
and you're thinking?
I think there were a few key things.
So we talked about two of them already, which was SEC Insights, which if you guys
haven't tracked it out, I've definitely encouraged you to do so because it's not just
like a random like SEC app.
it's like a full stack thing that we open source, right?
And so if you guys want to track it out,
I would definitely do that.
It provides a template for you to build kind of like production grade rack apps.
And we're going to open source like and moduliers more components of that soon.
Into a workshop.
Yeah.
And the second piece is I think we are thinking a lot about like retrieval and evils.
I think right now we're kind of exploring integrations with like a few different partners.
And so hopefully some of that will be really soon.
And so just like how do you basically have an experience where you just like write law index code?
All of a sudden, you can easily run like retrievals, e-vals, and like traces, all that stuff and
like a service. And so I think we're working with like a few providers on that. And then the
other piece, which we did talk about already is this idea of like, yeah, building like rag from
scratch. I mean, I think everybody should do it. I think I would check out the guide if you guys
haven't already. I think it's in our docs. But instead of just using, you know, either the
kind of like the retriever query engine and Lomindex or like the conversational like QA train and
in Langrain, it's, I would take a look at how do you actually chunk parse data and do like
top can batting retrieval. Because I really think by doing that process, it helps you understand
the decisions, the prompts, the language models to use. That's it. Thank you so much. Thank you, Jerry.
Yeah, thank you.
