No Priors: Artificial Intelligence | Technology | Startups - Improving search with RAG architecture with Pinecone CEO Edo Liberty

Episode Date: February 22, 2024

Accurate, customizable search is one of the most immediate AI use cases for companies and general users. Today on No Priors, Elad and Sarah are joined by Pinecone CEO, Edo Liberty, to talk about how R...AG architecture is improving syntax search and making LLMs more available. By using a RAG model Pinecone makes it possible for companies to vectorize their data and query it for the most accurate responses.  In this episode, they talk about how Pinecone’s Canopy product is making search more accurate by using larger data sets in a way that is more efficient and cost effective—which was almost impossible before there were serverless options. They also get into how RAG architecture uniformly increases accuracy across the board, how these models can increase “operational sanity” in the dataset  for their customers, and hybrid search models that are using keywords and embeds.  Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @EdoLiberty Show Notes:  (0:00) Introduction to Edo and Pinecone (2:01) Use cases for Pinecone and RAG models (6:02) Corporate internal uses for syntax search (10:13) Removing the limits of RAG with Canopy (14:02) Hybrid search (16:51) Why keep Pinecone closed source (22:29) Infinite context (23:11) Embeddings and data leakage (25:35) Fine tuning the data set (27:33) What’s next for Pinecone  (28:58) Separating reasoning and knowledge in AI

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, listeners, and welcome to another episode of No Priors. Today, Alad and I are talking with Iido Liberty, the founder and CEO of Pine Cone, a vector database company designed to power AI applications by providing long-term memory. Before Pine Cone, Ida was the director of research at AWS AI Labs and also previously at Yahoo. We're excited to talk about the increasingly popular rag architecture and how to make LMs more reliable. Welcome, Ida. Hi. Okay, let's start with some basic background.
Starting point is 00:00:33 Can you tell us more about Pine Cone for listeners who haven't heard of it? Like, what does it do and how does it differ from other databases? So Pine Cone is a vector database. And what vector databases do very differently is that they deal with data that has been analyzed and vectorized. I'll explain a second what that means by machine learning models, by large language. language models by foundational models and so on. The most large language models are foundation models, actually any models, really understand data in a numeric way.
Starting point is 00:01:11 Models are a thematical objects, right? When they read a document or a paragraph or an image, they don't save the pixels or the words, they save an numerical representation called an embedding or a vector. And that is the object that is manipulated, stored, retrieved and searched over and operated on by vector databases very efficiently at large scale. And that is Pinecone. When we started that category, people called me concerned and said, what is the vector and why are you starting a database? And now I think they know the answer.
Starting point is 00:01:49 How did you think about this early on? Because you started the company in 2019. At the time, this wave of generative AI hadn't happened quite yet. And so I was wondering what applications you had in mind, given that there's so much excitement around pine cone for the AI world, the prior AI world had a slightly different approach to a variety of these things. And I'm just curious, like, were you thinking of different types of embeddings back then?
Starting point is 00:02:12 Were you thinking about other use cases? Like, what was the original thinking in terms of starting pine cone? The tsunami wave of AI that we're going through right now didn't hit yet. But in 2019, the earthquake had already happened. Deep learning models and so on have already been grappled with large language models and transforma models like Bert and others started being used by the more mainstream engineering
Starting point is 00:02:39 cohorts. You could already kind of connect the dots and see that where this is going. In fact, before starting Pinecoe and I myself had founder anxiety between, are we already too late versus nobody knows what the hell this is and we're way too early. And it took me several months. of like wild swings between those two things until I figured maybe the fact that I have those too early, too late mood swings maybe means exactly the right time. Maybe, you know, you can actually just expand a little bit about, you know, in what use cases people want to use embeddings, right? I think there are ways to interact directly with language models and then reasons for,
Starting point is 00:03:21 for example, reliability or constant context length like that people. and performance that people interact with embeddings in, like, a rag architecture or in semantic search. So maybe you can sort of talk about some of the driving use cases. I mean, the obvious way, in some sense, to add knowledge to your conversational agent, whether it's chat or what have you. We talk about it as generative AI now, but it's much more general than that, is to, again, not shockingly bring the relevant information into the context, right, so that you can actually arm the foundational model with the right pieces of content, with text, with images, with what have you, right? You want to be able to retrieve that from a very large corpus of knowledge that you have, whether it's your own company's data or whether it's the Internet or what have you.
Starting point is 00:04:14 It so happens that LLMs are already very, very good at representing data in the way that they want to consume it, which is these embeddings. And so you can, at question time, in real time, at the time of the interaction, go and find relevant information. And relevant might be associated with or collided with or something that is similar to whatever it is that you're being asked about. And once you bring that into the context, you can now give much more accurate answers, right? And as a site experiment, we're actually loaded, it's called Common Crawl, which is the top internet pages crawled fairly frequently. We'll load that into Pine Cone and saw what happens when you augment GPT 3.5 and 4 and Lama and Mixtral and models. from coherent,thropic, and you could see that if you augment all of them with RAG, even on the internet, which is data that they were trained on, you can reduce hallucinations significantly
Starting point is 00:05:25 up to 50% sometimes. Interestingly enough, many of them actually start behaving quite similarly in terms of level of accuracy, even though without RAG, they actually have quite different behaviors. So it's sort of both like a uniform improvement and a little bit of leveling the plane field. Now, you know, because we know we can do that very well now, now you can do that also with proprietary data, with the company inside your company and so on, stuff that, of course, is not available on the Internet, and stuff that those models were never trained on. And interestingly enough, again, the quality ends up being incredibly high. I assume most Pinecone users are not, you know, using LMs and retrieving against general Internet data.
Starting point is 00:06:07 Like, what kinds of companies were your earliest or biggest users? Like, what kind of data do they want to retrieve against? So most companies do use their own company data. It could be whatever it is, depends on the application they're building. It could be legal data, medical records, internal Wiki information, sales calls, you name it. There's an infinite variety. I want to say that this is just rag. I mean, this is just semantic search.
Starting point is 00:06:35 I mean, there are many other applications that we didn't talk about, but we can keep it focused on this application for this conversation. And is it dominated by a specific use case? Like, were there customers that you feel like really represent the pine cone use case well? Yeah, 100%. First, text is probably most of what we see. Nowadays, models are really good at images and so on. But text is still the predominant data type. Notion, Q&A now runs on Pine Cone, and they serve essentially question answering with AI to tens of thousands, probably hundreds of thousands of their own customers.
Starting point is 00:07:16 Gong does the same thing with sales calls. Again, serves all of their use cases for all of their customers and so on. So one of the most common patterns is companies that themselves become trailblazers and innovators with AI, and they themselves hold a lot of their own. own users or customers text, and they want to search over it or generate information on top of it, that ends up being an incredibly common pattern. I guess earlier this month, one of the things that Pankone announced was the serverless offering called Canopy. Can you tell us a little bit about why he decided to go down the serverless direction
Starting point is 00:07:52 and how you view that in terms of either use cases or adoption or other things? So Canopy is actually an open source that we put out there as a framework for people to learn how to use RAG, pine cone serverless is just going to pine cone. It's just pine cone, but serverless. What it does is basically removes the limits from what people used to experience before. When we started Pinecone, a lot of the applications had to do with the recommendation engines
Starting point is 00:08:25 and anomaly detection and other problems where usually the scale was actually fairly small. And the requirements had to do with super low latencies and sometimes high throughput. And as a result, you still see a lot of databases kind of play in that field. We very quickly figured out with our own customers and our own experimentation, that something else is much more significant with just a scale and cost. If you want to be able to answer correctly, you just have to know a lot. And if you want to do that, you have to ingest hundreds of millions, billions, sometimes tens of billions of vectors into your own, into your vector database. And you want to query it efficiently in terms of cost.
Starting point is 00:09:11 You just don't want, you know, you don't want that to explode in terms of, again, spend. And finally, you want to do that easily. So you don't want to spend weeks and months setting things up and getting it to work. And doing that in our old architecture, and frankly, with any other architecture today that's not serverless is very difficult. And serverless is here to basically resolve those main problems. It's incredibly easy to operate. It scales massively. I mean, again, there's no theoretical limit to how much it can scale.
Starting point is 00:09:46 We've tested it with tens and tens of billions with live customers and live traffic. And I'm not going to go into the architectural design, but it's actually designed to be incredibly efficient, like asymptotically better than what can be done with any other architecture. It's fundamentally about removing all limits so people can actually have all the information they need ready for the foundational models. You mentioned Canopy is to help enable more people to build rag products. Like where do you see developers or your customer,
Starting point is 00:10:21 struggled to get embedding space to AI products generally successful? Or what were you trying to achieve with with canopy? Yeah, so the vector databases in Pinecone specifically are very foundational model, are very foundational pieces of technology. We're very deep in the stack. And to build a proper full end to end solution,
Starting point is 00:10:44 say like notion Q&A, there's quite a lot that you have to build on top of it. You have to ingest documents and, and what's called chunk them. You have to figure out how to break them into like factoids and pieces of information. You have to embed everything with models. You have to ingest them into the vector base. You know, when you get a query,
Starting point is 00:11:03 you have to figure out how to manipulate it and how to embed that. You have to search over it. You have to re-rank. You know, there's a lot. There's a whole system you have to build around it. And a lot of people told us that this is actually quite complex.
Starting point is 00:11:18 And they're right, right? We put out Canopy as really an example. It is an end-to-end kind of cookbook. If you just take this, it should work. You should probably, once it works, you should figure out how to make it better for your own application, right? Because medical data is not GER tickets, you know, and GERA tickets are not Slack messages and you might be building a different product.
Starting point is 00:11:42 But at least you have some end-to-end starting point that already does something and you can start improving on. Two of, I think, the most commison comparison points for vector databases that people use are, A, like, traditional databases, right? Like, why not just use Postgres, PGVector or some index associated within an existing database? Or, B, like, sort of more traditional search or incumbent search technologies or services like Elastic or Algilia? Can you talk about, like, you know, why not other databases or, like, how you think about traditional search? Yeah. I'll just go back to the fundamentals about what are you trying to achieve, right?
Starting point is 00:12:23 What we're trying to achieve is to give as much context and as much knowledge to foundational models as possible, do that easily at scale, you know, on a budget, get to a unit economics that actually works for your product, which is incredibly hard to deal with AI with like many discussions going on about that now. Those other products don't work. They don't work either because they don't scale in terms of the efficiency, scale, cost, the trade-offs that they can offer because they're not designed to do this. They're designed to do something else. They kind of thought about vector index as a bolt-on, you know, retrofitted feature.
Starting point is 00:13:10 And so, yes, it works at small scale, but when you try to actually go to production with it, you know, understand the limitations. With other search technologies, this is, again, this is the wrong search mode. If you're searching with keywords and just not finding the relevant information because the embeddings, the contextual space in which these pieces of text, documents, or images live is in vector space
Starting point is 00:13:37 in high dimensional numeric space, not in keyword space. And like everyone that's ever searched their inbox for an email, you know, for a fact. you have and not find it knows that keyword search has a deeply flawed retrieval system. I'm just curious if customers or, you know, developers are trying to combine the existing search systems they have. I know you also are increasingly supporting hybrid search. So kind of wanted to understand that. Where are embeddings like amazing and useful and like delivering new experiences and where they're not enough or not like the full experience and end users?
Starting point is 00:14:15 on. So it's interesting. Our research actually shows that when you do this well, you very rarely need keywords alongside embeddings. But getting embeddings to perform perfectly is actually, it could be quite intricate. And we find that it's very convenient to have keywords alongside embeddings and to score those things together. We call this hybrid search. And in fact, we made this even more. general and we said okay why not you know keywords under the hood are actually represented it's as sparse vectors that's true of any keyword search by the way this is not this is just kind of the mathematically identical and then we said why don't we just make this more general and just say hey you can give either sparse or dense vectors or both of them and kind of have the best of both worlds and people find that very convenient and so i'd highly encourage people to look at it and improve you know by boosting and all sorts of other tricks that you can bake into spouse vectors, including keywords. My guess is that that's not going to be the dominant mode of search in the very near future.
Starting point is 00:15:28 You think we progress, like you think hybrid search is a more temporary convenience. I mean, I think it'll be used for boosting and other types of levers to control your search. I think the mode of you baking keywords into that is going away. Yes. And maybe just going back to like the traditional database market, like why not in my postgres or my long or whatever I'm using already? Again, I mean, we see this in the market. A lot people tell me, hey, I already use tool X or database Y and why not. And frankly, oftentimes when it's some tiny workload, just learning how to use embeddings for the first time and so on, it might actually work okay. It's when people try to actually do something in production, they're trying to scale up, they're trying to actually push the envelope, or they're trying to launch a product that needs to have some unit economics attached to it that makes sense for that product.
Starting point is 00:16:26 That's where people run into huge problems. And so many of them just, you know, start with us to begin with, to be honest, a lot of them are enthusiasts and they actually kind of enjoy learning how to use a new kind of database. and are you, you know, user experience is smooth enough and, you know, there's so many tutorials and notebooks and examples that they actually find it exciting. But I guess some don't, and that's, that's fine. So maybe one more on database dynamics. Pinecone is closed source. It's gotten great adoption.
Starting point is 00:16:56 But many databases and like, you know, mature market are open source. How do you think about this decision? And has that, has it been an issue for you? I'll say that most databases start. before cloud was really a fully mature product or a market or platform. Okay. And so that was the precursor to PLG essentially or whatever. It was PLG, right?
Starting point is 00:17:26 It was, you know, that was the only way to put a technically complex product at the hands of engineers was to open source it. And you see, I think all, I mean, maybe not all, but definitely the larger databases that are open source out there, I think that's the reason they did that. When we started Pine Cone, we asked the very basic question of why do people open source the platform, right? One of it was to earn trust. One of them was to get contribution from the community and one of them was a channel to, you know, users.
Starting point is 00:18:08 And we figured we can earn trust by being excellent while we do in providing an amazing service. We don't need external contribution. And in fact, if you look at statistics, even companies that are open source, 99% of the contributions are actually from the company itself, not 99, but high 90s. And so that doesn't actually make a huge difference. And in terms of experience, we figured that we can actually provide a much better experience and much better access to the platform than what open source does. And Pinecone is a fully managed and multi-tenant service.
Starting point is 00:18:44 And to be able to run that at scale and provide the cost scale tradeoffs, we actually run a very, very complicated system. And in some sense, even if we gave it as open sales to somebody, they wouldn't know what to do with it. It will be a Herculean effort to even run this thing. The right decision was basically that we should offer this as a service. we should manage it end to end. And as long as you give people a fully reliable interface and you keep doing that year after year,
Starting point is 00:19:13 you earn the trust and the ease of use, that open source becomes, I hope, not an issue. It's funny because the two anecdotes around along those lines. I remember talking with, I think it was Ali from Databricks, and he said that if you can avoid doing open source, you should, you know, he felt like it was an incremental challenge because you get distribution through open source, but then you have to figure out the business model. And so he viewed it as like, you know, I think. I think the analogy he uses is like making an open source project work is like hitting a whole one one in golf. And then you pick up a baseball bat and you have to hit a grand slam because then you have to do the second act to make sure the thing actually works as a company.
Starting point is 00:19:52 That's right. No, I mean, I agree 100%. I mean, this is exactly what we're experiencing. And in fact, we already see even though new players in the vector database space that basically started to try to take. us down, all took the open source angle, we already see them, even young as they might be, they are already struggling with their open source strategy. Serverless is the fourth almost complete rewrite of the entire database at Pinecon. Yeah.
Starting point is 00:20:23 The one other thing that's coming in terms of the LLM world, which may or may not impact you, and I'm sort of curious how you think about it, is increasing long context windows for foundation models. Does that change how people interact with embeddings and vector databases, or does it not really impact things much. There's things people are talking about in terms of infinite context or other things like that. Like, I mean, I don't know what infinite context means, to be honest. It's like very big. It's infinite. It's like huge.
Starting point is 00:20:51 Oh, oh, I got it. Thank you. Yeah, yeah. You're welcome. I should take a note. First of all, those companies sell their services by the token. So the fact that they allow you to be used infinite context windows is not shocking, okay? That's good for business. The second thing is there there's plenty of evidence that increasing the context size doesn't actually improve results unless, you know, you do this very carefully, right? So just what's called concept stuffing is not helping. You just pay more and don't actually get much for it. And the last thing, that, even that, even if you kind of buy in to the the marketing,
Starting point is 00:21:36 that runs its course, right? If you're, it's like saying, oh, I don't need Google because I can, every time I query Google, I can send the internet along with my query, right? It's like, yeah, well, theoretically that's maybe possible, but clearly practically that's not feasible, right? So at some point, the context window just becomes gigabytes and gigabytes and gigabytes of data like terabytes.
Starting point is 00:22:04 I mean, where do you stop, right? And so already today, we have users who use not even very large models, you know, maybe a few billion parameters. And the vector database next to their model contains trillions of parameters, right? And they get, you know, much better performance that way, right? Just attaching all the context to everything you do, I think, runs its course very, very quickly. And it's also unnecessary to be. be honest.
Starting point is 00:22:37 Yeah, I guess related to that, another place where people have been talking about embeddings and vector databases is in sort of aspects of personalization and privacy. And I'm a little bit curious how you think about that because, you know, one of the risk people view is running an LLM over a large data corpus or fine-tuning it against a specific company's data is the issue of data leakage. You know, say for example, you're an HR company and you don't want different people's salaries to leak across an LLM because you're using it as like a chat bot to help you with context regarding your own personal data in an enterprise or things like that.
Starting point is 00:23:11 Can you talk a bit more about how embeddings can provide personalization and in some cases potentially other features that may be attractive to enterprises? Yeah. So that's a very common and reasonable thing to be concerned about. Data leakage can happen in two main ways. A, if you use a service for your foundational model that that frankly retrains their models with your data or records it, right, or saves it in some way that is opaque to you, right? That is a huge problem. And I think a lot of people are a lot of people are struggling with that. The second is if you're building an application in-house, whatever it might be, and you fine-tune your models on added data,
Starting point is 00:24:00 that data data might end up popping where it shouldn't in answers to, you know, other people's questions or whatever. What people do with vector databases is actually incredibly simple, right? You don't fine-tune a model on your own proprietary data, at which point you know for a fact it doesn't contain any proprietary data because it's never seen any of it, okay? And then at retrieval time or at, you know, whenever you apply, by the chat or the agent, you retrieve the right information from the database,
Starting point is 00:24:38 give it as context of the model, but only do inference. You don't actually retrain and you don't save that interaction, at which point that data doesn't exist anywhere. It's like an ephemeral thing. And the added benefit of that is, by the way, that you can be GDPR compliant. You can actually delete data. So if, you know, if you're a company,
Starting point is 00:24:59 like a legal company and somebody deletes a document, you can just delete it from the vector database. And that information will never be available to your foundational model again. So you don't even have to devise some complex mechanism for forgetting. You just don't know it anymore. What are the main reasons why people
Starting point is 00:25:17 attach vector databases to foundational models? It gives you this operational sanity that is almost completely impossible without it. That's interesting. I guess it feels like there's three different approaches that people are using. They're not much exclusive for models that kind of overlap in terms of what the hope for output is. One is really changing or engineering prompts or adding more information into the prompt.
Starting point is 00:25:48 The second is fine-tuning. And the third would be rag slash different aspects of embeddings or other approaches like that. How do you think about fine-tuning in this context? Like, when should you fine-tune versus, you know, use some of the approaches that you've talked about earlier? I can answer both the scientists and as a business owner, right? As a scientist, I'm all for fine-tuning. We have all the evidence to show that done right, it helps tremendously. As a business owner, I can tell you that it's actually extremely hard to do well.
Starting point is 00:26:22 I mean, this is something that unless you have the research team and the AI experts that know how to fine tune, you might actually make things significantly worse. Okay. So there is nothing that says that more data is going to make your model do better. In fact, it oftentimes gets, uh, regresses to something significantly worse. With prompt engineering, again, I think it's necessary, especially when you build applications. You want the response to, you know, conform to some format or have some property. I think that's a given you should do that.
Starting point is 00:27:00 It runs its course after a while. I mean, it's in some sense you get what you get. It's necessary, but there's a limit to what you can do with that. And RAG, I think, is incredibly powerful. But like I said before, when we talked about canopy, that's not, you know, that's not simple either. I mean, it's simpler than the other ones, but still acquires work and understanding, experimentation, and so on. This is almost a hallmark of a nascent mark. when the simplest solution is still somewhat complex.
Starting point is 00:27:32 Yeah, makes sense. What's next for a pine cone? What are some major things coming that you'd like to talk about? So, I mean, there's a ton. We're an infrastructure company. And so we obsessed about ease of use and security and stability and cost and scale and performance. Also, as an engineer at heart, I'm very excited about those things.
Starting point is 00:27:59 And all of that is coming. Again, serverless is becoming faster, bigger, better, more secure, easier to use. And we're starting to really grapple with what very large companies and very, you know, kind of trailblazing tech companies are going through. I said that getting AI to be truly knowledgeable is still complex. I think we're starting to grapple with deeper. issues that the entire information retrieval community has been dealing with for about 40, 50 years now. We're starting to see those, you know, come to the fore in RAG and in AI in general.
Starting point is 00:28:40 I guess putting aside Pine Cone and sort of the database world and everything else, what are you most excited about in terms of what's coming next in AI? It's hard to say. I really do want to see a distillate. in some sense of foundational models. And by distillation, I know it's a, I don't mean what usually people say, but there's distillation of models. I don't mean that. I mean, the separation of reasoning and, and knowledge, right?
Starting point is 00:29:12 Foundational models get it fundamentally wrong. When we learn how to build the subsystems of AI correctly and for each one of them to do their roles, optimally, either we're going to be able to do the, to achieve the same tasks, much cheaper, faster, better, or we're going to still want to use the same amount of resources, but achieve much more. What happens today is that we have very crude tools and we try to use everything for everything. Delightfully or shockingly enough, depending on who you are, that kind of works. I mean, we found this like very, very efficient and very general purpose tools. But they'll still very general purpose. They're still super blunt instruments.
Starting point is 00:30:00 Again, as a technology is somebody who cares deeply about how things are built. You kind of see the inefficiency and it hurts the brain to figure out that, you know, we take half the internet and cram it into GPU memory. I'm like, holy, what, why? This can't be the right thing to do. So I'm very excited about us as a community truly understanding how the, the, you know, how the different components interact and how to build everything much more in some sense correctly. I hope we get to build some exciting products. By we, I mean the community gets to build some exciting products this year.
Starting point is 00:30:39 I think we're going to see a year of a lot of experimentation that when the people went through last year, they're going to take the production and to build cool products this year. and I can't see, I can't wait to see how that looks like. I have a feeling that this feels going to be very, very exciting for consumers of AI. Yeah, I totally agree. It's a very exciting year ahead. So thank you so much for joining us today. Thanks, Edo.
Starting point is 00:31:05 Thank you, guys. Find us on Twitter at No Pryor's Pod. Subscribe to our YouTube channel if you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode. at no dash priors.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.