No Priors: Artificial Intelligence | Technology | Startups - Improving search with RAG architecture with Pinecone CEO Edo Liberty
Episode Date: February 22, 2024Accurate, customizable search is one of the most immediate AI use cases for companies and general users. Today on No Priors, Elad and Sarah are joined by Pinecone CEO, Edo Liberty, to talk about how R...AG architecture is improving syntax search and making LLMs more available. By using a RAG model Pinecone makes it possible for companies to vectorize their data and query it for the most accurate responses. In this episode, they talk about how Pinecone’s Canopy product is making search more accurate by using larger data sets in a way that is more efficient and cost effective—which was almost impossible before there were serverless options. They also get into how RAG architecture uniformly increases accuracy across the board, how these models can increase “operational sanity” in the dataset for their customers, and hybrid search models that are using keywords and embeds. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @EdoLiberty Show Notes: (0:00) Introduction to Edo and Pinecone (2:01) Use cases for Pinecone and RAG models (6:02) Corporate internal uses for syntax search (10:13) Removing the limits of RAG with Canopy (14:02) Hybrid search (16:51) Why keep Pinecone closed source (22:29) Infinite context (23:11) Embeddings and data leakage (25:35) Fine tuning the data set (27:33) What’s next for Pinecone (28:58) Separating reasoning and knowledge in AI
Transcript
Discussion (0)
Hi, listeners, and welcome to another episode of No Priors.
Today, Alad and I are talking with Iido Liberty, the founder and CEO of Pine Cone,
a vector database company designed to power AI applications by providing long-term memory.
Before Pine Cone, Ida was the director of research at AWS AI Labs and also previously at Yahoo.
We're excited to talk about the increasingly popular rag architecture and how to make LMs more reliable.
Welcome, Ida.
Hi.
Okay, let's start with some basic background.
Can you tell us more about Pine Cone for listeners who haven't heard of it?
Like, what does it do and how does it differ from other databases?
So Pine Cone is a vector database.
And what vector databases do very differently is that they deal with data that has been analyzed and vectorized.
I'll explain a second what that means by machine learning models, by large language.
language models by foundational models and so on.
The most large language models are foundation models,
actually any models, really understand data in a numeric way.
Models are a thematical objects, right?
When they read a document or a paragraph or an image,
they don't save the pixels or the words,
they save an numerical representation called an embedding or a vector.
And that is the object that is manipulated, stored,
retrieved and searched over and operated on by vector databases very efficiently at large scale.
And that is Pinecone. When we started that category, people called me concerned and said,
what is the vector and why are you starting a database? And now I think they know the answer.
How did you think about this early on? Because you started the company in 2019.
At the time, this wave of generative AI hadn't happened quite yet.
And so I was wondering what applications you had in mind,
given that there's so much excitement around pine cone
for the AI world, the prior AI world
had a slightly different approach to a variety of these things.
And I'm just curious, like, were you thinking
of different types of embeddings back then?
Were you thinking about other use cases?
Like, what was the original thinking
in terms of starting pine cone?
The tsunami wave of AI that we're going through right now
didn't hit yet.
But in 2019, the earthquake had already
happened. Deep learning models and so on have already been grappled with large language models
and transforma models like Bert and others started being used by the more mainstream engineering
cohorts. You could already kind of connect the dots and see that where this is going. In fact,
before starting Pinecoe and I myself had founder anxiety between, are we already too late
versus nobody knows what the hell this is and we're way too early. And it took me several months.
of like wild swings between those two things until I figured maybe the fact that I have those
too early, too late mood swings maybe means exactly the right time.
Maybe, you know, you can actually just expand a little bit about, you know, in what use cases
people want to use embeddings, right?
I think there are ways to interact directly with language models and then reasons for,
for example, reliability or constant context length like that people.
and performance that people interact with embeddings in, like, a rag architecture or in semantic
search. So maybe you can sort of talk about some of the driving use cases.
I mean, the obvious way, in some sense, to add knowledge to your conversational agent,
whether it's chat or what have you. We talk about it as generative AI now, but it's much more
general than that, is to, again, not shockingly bring the relevant information into the context,
right, so that you can actually arm the foundational model with the right pieces of content, with text, with images, with what have you, right?
You want to be able to retrieve that from a very large corpus of knowledge that you have, whether it's your own company's data or whether it's the Internet or what have you.
It so happens that LLMs are already very, very good at representing data in the way that they want to consume it, which is these embeddings.
And so you can, at question time, in real time, at the time of the interaction, go and find relevant information.
And relevant might be associated with or collided with or something that is similar to whatever it is that you're being asked about.
And once you bring that into the context, you can now give much more accurate answers, right?
And as a site experiment, we're actually loaded, it's called Common Crawl, which is the top internet pages crawled fairly frequently.
We'll load that into Pine Cone and saw what happens when you augment GPT 3.5 and 4 and Lama and Mixtral and models.
from coherent,thropic, and you could see that if you augment all of them with RAG, even on the
internet, which is data that they were trained on, you can reduce hallucinations significantly
up to 50% sometimes. Interestingly enough, many of them actually start behaving quite similarly in
terms of level of accuracy, even though without RAG, they actually have quite different behaviors.
So it's sort of both like a uniform improvement and a little bit of leveling the plane field.
Now, you know, because we know we can do that very well now, now you can do that also with proprietary data,
with the company inside your company and so on, stuff that, of course, is not available on the Internet,
and stuff that those models were never trained on.
And interestingly enough, again, the quality ends up being incredibly high.
I assume most Pinecone users are not, you know, using LMs and retrieving against general Internet data.
Like, what kinds of companies were your earliest or biggest users?
Like, what kind of data do they want to retrieve against?
So most companies do use their own company data.
It could be whatever it is, depends on the application they're building.
It could be legal data, medical records, internal Wiki information, sales calls, you name it.
There's an infinite variety.
I want to say that this is just rag.
I mean, this is just semantic search.
I mean, there are many other applications that we didn't talk about, but we can keep it focused on this application for this conversation.
And is it dominated by a specific use case?
Like, were there customers that you feel like really represent the pine cone use case well?
Yeah, 100%.
First, text is probably most of what we see.
Nowadays, models are really good at images and so on.
But text is still the predominant data type.
Notion, Q&A now runs on Pine Cone, and they serve essentially question answering with AI to tens of thousands, probably hundreds of thousands of their own customers.
Gong does the same thing with sales calls.
Again, serves all of their use cases for all of their customers and so on.
So one of the most common patterns is companies that themselves become trailblazers and innovators with AI, and they themselves hold a lot of their own.
own users or customers text, and they want to search over it or generate information on
top of it, that ends up being an incredibly common pattern.
I guess earlier this month, one of the things that Pankone announced was the serverless
offering called Canopy.
Can you tell us a little bit about why he decided to go down the serverless direction
and how you view that in terms of either use cases or adoption or other things?
So Canopy is actually an open source that we put out there as a framework for people to learn
how to use RAG, pine cone serverless is just going to pine cone.
It's just pine cone, but serverless.
What it does is basically removes the limits
from what people used to experience before.
When we started Pinecone, a lot of the applications
had to do with the recommendation engines
and anomaly detection and other problems where
usually the scale was actually fairly small.
And the requirements had to do with super low latencies and sometimes high throughput.
And as a result, you still see a lot of databases kind of play in that field.
We very quickly figured out with our own customers and our own experimentation, that something else is much more significant with just a scale and cost.
If you want to be able to answer correctly, you just have to know a lot.
And if you want to do that, you have to ingest hundreds of millions, billions, sometimes tens of billions of vectors into your own, into your vector database.
And you want to query it efficiently in terms of cost.
You just don't want, you know, you don't want that to explode in terms of, again, spend.
And finally, you want to do that easily.
So you don't want to spend weeks and months setting things up and getting it to work.
And doing that in our old architecture, and frankly, with any other architecture today that's not serverless is very difficult.
And serverless is here to basically resolve those main problems.
It's incredibly easy to operate.
It scales massively.
I mean, again, there's no theoretical limit to how much it can scale.
We've tested it with tens and tens of billions with live customers and live traffic.
And I'm not going to go into the architectural design,
but it's actually designed to be incredibly efficient,
like asymptotically better than what can be done with any other architecture.
It's fundamentally about removing all limits
so people can actually have all the information they need ready for the foundational models.
You mentioned Canopy is to help enable more people to build rag products.
Like where do you see developers or your customer,
struggled to get embedding space to AI products
generally successful?
Or what were you trying to achieve with with canopy?
Yeah, so the vector databases in Pinecone specifically
are very foundational model,
are very foundational pieces of technology.
We're very deep in the stack.
And to build a proper full end to end solution,
say like notion Q&A, there's quite a lot that you have to build on top of it.
You have to ingest documents and,
and what's called chunk them.
You have to figure out how to break them
into like factoids and pieces of information.
You have to embed everything with models.
You have to ingest them into the vector base.
You know, when you get a query,
you have to figure out how to manipulate it
and how to embed that.
You have to search over it.
You have to re-rank.
You know, there's a lot.
There's a whole system you have to build around it.
And a lot of people told us
that this is actually quite complex.
And they're right, right?
We put out Canopy as really an example.
It is an end-to-end kind of cookbook.
If you just take this, it should work.
You should probably, once it works,
you should figure out how to make it better for your own application, right?
Because medical data is not GER tickets, you know,
and GERA tickets are not Slack messages and you might be building a different product.
But at least you have some end-to-end starting point that already does something
and you can start improving on.
Two of, I think, the most commison comparison points for vector databases that people use are, A, like, traditional databases, right?
Like, why not just use Postgres, PGVector or some index associated within an existing database?
Or, B, like, sort of more traditional search or incumbent search technologies or services like Elastic or Algilia?
Can you talk about, like, you know, why not other databases or, like, how you think about traditional search?
Yeah.
I'll just go back to the fundamentals about what are you trying to achieve, right?
What we're trying to achieve is to give as much context and as much knowledge to foundational
models as possible, do that easily at scale, you know, on a budget, get to a unit economics
that actually works for your product, which is incredibly hard to deal with AI with like many
discussions going on about that now.
Those other products don't work.
They don't work either because they don't scale in terms of the efficiency, scale, cost, the trade-offs that they can offer because they're not designed to do this.
They're designed to do something else.
They kind of thought about vector index as a bolt-on, you know, retrofitted feature.
And so, yes, it works at small scale, but when you try to actually go to production with it, you know,
understand the limitations.
With other search technologies, this is, again,
this is the wrong search mode.
If you're searching with keywords and just not
finding the relevant information because the embeddings,
the contextual space in which these pieces of text,
documents, or images live is in vector space
in high dimensional numeric space, not in keyword space.
And like everyone that's ever searched their inbox
for an email, you know, for a fact.
you have and not find it knows that keyword search has a deeply flawed retrieval system.
I'm just curious if customers or, you know, developers are trying to combine the existing
search systems they have. I know you also are increasingly supporting hybrid search. So kind of
wanted to understand that. Where are embeddings like amazing and useful and like delivering
new experiences and where they're not enough or not like the full experience and end users?
on. So it's interesting. Our research actually shows that when you do this well, you very rarely need keywords alongside embeddings. But getting embeddings to perform perfectly is actually, it could be quite intricate. And we find that it's very convenient to have keywords alongside embeddings and to score those things together. We call this hybrid search. And in fact, we made this even more.
general and we said okay why not you know keywords under the hood are actually represented
it's as sparse vectors that's true of any keyword search by the way this is not this is just kind
of the mathematically identical and then we said why don't we just make this more general and just say hey
you can give either sparse or dense vectors or both of them and kind of have the best of both worlds
and people find that very convenient and so i'd highly encourage people to look at it and improve you know by
boosting and all sorts of other tricks that you can bake into spouse vectors, including keywords.
My guess is that that's not going to be the dominant mode of search in the very near future.
You think we progress, like you think hybrid search is a more temporary convenience.
I mean, I think it'll be used for boosting and other types of levers to control your search.
I think the mode of you baking keywords into that is going away. Yes.
And maybe just going back to like the traditional database market, like why not in my postgres or my long or whatever I'm using already?
Again, I mean, we see this in the market.
A lot people tell me, hey, I already use tool X or database Y and why not.
And frankly, oftentimes when it's some tiny workload, just learning how to use embeddings for the first time and so on, it might actually work okay.
It's when people try to actually do something in production, they're trying to scale up, they're trying to actually push the envelope, or they're trying to launch a product that needs to have some unit economics attached to it that makes sense for that product.
That's where people run into huge problems.
And so many of them just, you know, start with us to begin with, to be honest, a lot of them are enthusiasts and they actually kind of enjoy learning how to use a new kind of database.
and are you, you know, user experience is smooth enough and, you know,
there's so many tutorials and notebooks and examples that they actually find it exciting.
But I guess some don't, and that's, that's fine.
So maybe one more on database dynamics.
Pinecone is closed source.
It's gotten great adoption.
But many databases and like, you know, mature market are open source.
How do you think about this decision?
And has that, has it been an issue for you?
I'll say that most databases start.
before cloud was really a fully mature product or a market or platform.
Okay.
And so that was the precursor to PLG essentially or whatever.
It was PLG, right?
It was, you know, that was the only way to put a technically complex product at the hands of engineers was to open source it.
And you see, I think all, I mean, maybe
not all, but definitely the larger databases that are open source out there, I think that's
the reason they did that.
When we started Pine Cone, we asked the very basic question of why do people open source
the platform, right?
One of it was to earn trust.
One of them was to get contribution from the community and one of them was a channel to, you know, users.
And we figured we can earn trust by being excellent while we do in providing an amazing service.
We don't need external contribution.
And in fact, if you look at statistics, even companies that are open source, 99% of the
contributions are actually from the company itself, not 99, but high 90s.
And so that doesn't actually make a huge difference.
And in terms of experience, we figured that we can actually provide a much better experience
and much better access to the platform than what open source does.
And Pinecone is a fully managed and multi-tenant service.
And to be able to run that at scale and provide the cost scale tradeoffs,
we actually run a very, very complicated system.
And in some sense, even if we gave it as open sales to somebody,
they wouldn't know what to do with it.
It will be a Herculean effort to even run this thing.
The right decision was basically that we should offer this as a service.
we should manage it end to end.
And as long as you give people a fully reliable interface and you keep doing that year after year,
you earn the trust and the ease of use, that open source becomes, I hope, not an issue.
It's funny because the two anecdotes around along those lines.
I remember talking with, I think it was Ali from Databricks, and he said that if you can avoid
doing open source, you should, you know, he felt like it was an incremental challenge because
you get distribution through open source, but then you have to figure out the business model.
And so he viewed it as like, you know, I think.
I think the analogy he uses is like making an open source project work is like hitting a whole one one in golf.
And then you pick up a baseball bat and you have to hit a grand slam because then you have to do the second act to make sure the thing actually works as a company.
That's right.
No, I mean, I agree 100%.
I mean, this is exactly what we're experiencing.
And in fact, we already see even though new players in the vector database space that basically started to try to take.
us down, all took the open source angle, we already see them, even young as they might
be, they are already struggling with their open source strategy.
Serverless is the fourth almost complete rewrite of the entire database at Pinecon.
Yeah.
The one other thing that's coming in terms of the LLM world, which may or may not impact you,
and I'm sort of curious how you think about it, is increasing long context windows for foundation
models.
Does that change how people interact with embeddings and vector databases, or does it not
really impact things much. There's things people are talking about in terms of infinite
context or other things like that.
Like, I mean, I don't know what infinite context means, to be honest.
It's like very big. It's infinite. It's like huge.
Oh, oh, I got it. Thank you. Yeah, yeah. You're welcome.
I should take a note. First of all, those companies sell their services by the token.
So the fact that they allow you to be used infinite
context windows is not shocking, okay? That's good for business. The second thing is there
there's plenty of evidence that increasing the context size doesn't actually improve results
unless, you know, you do this very carefully, right? So just what's called concept stuffing is not
helping. You just pay more and don't actually get much for it. And the last thing, that,
even that, even if you kind of buy in to the the marketing,
that runs its course, right?
If you're, it's like saying, oh, I don't need Google
because I can, every time I query Google,
I can send the internet along with my query, right?
It's like, yeah, well, theoretically that's maybe possible,
but clearly practically that's not feasible, right?
So at some point, the context window just becomes gigabytes
and gigabytes and gigabytes of data like terabytes.
I mean, where do you stop, right?
And so already today, we have users who use not even very large models, you know,
maybe a few billion parameters.
And the vector database next to their model contains trillions of parameters, right?
And they get, you know, much better performance that way, right?
Just attaching all the context to everything you do, I think, runs its course very, very quickly.
And it's also unnecessary to be.
be honest.
Yeah, I guess related to that, another place where people have been talking about embeddings
and vector databases is in sort of aspects of personalization and privacy.
And I'm a little bit curious how you think about that because, you know, one of the risk
people view is running an LLM over a large data corpus or fine-tuning it against a specific
company's data is the issue of data leakage.
You know, say for example, you're an HR company and you don't want different people's salaries
to leak across an LLM because you're using it as like a chat bot to help you with context
regarding your own personal data in an enterprise or things like that.
Can you talk a bit more about how embeddings can provide personalization and in some cases
potentially other features that may be attractive to enterprises?
Yeah. So that's a very common and reasonable thing to be concerned about.
Data leakage can happen in two main ways. A, if you use a
service for your foundational model that that frankly retrains their models with your data or
records it, right, or saves it in some way that is opaque to you, right? That is a huge problem.
And I think a lot of people are a lot of people are struggling with that. The second is if you're
building an application in-house, whatever it might be, and you fine-tune your models on added data,
that data data might end up popping where it shouldn't in answers to, you know, other people's
questions or whatever.
What people do with vector databases is actually incredibly simple, right?
You don't fine-tune a model on your own proprietary data, at which point you know for a fact
it doesn't contain any proprietary data because it's never seen any of it, okay?
And then at retrieval time or at, you know, whenever you apply,
by the chat or the agent,
you retrieve the right information from the database,
give it as context of the model, but only do inference.
You don't actually retrain and you don't save that interaction,
at which point that data doesn't exist anywhere.
It's like an ephemeral thing.
And the added benefit of that is, by the way,
that you can be GDPR compliant.
You can actually delete data.
So if, you know, if you're a company,
like a legal company and somebody
deletes a document, you can just delete it from
the vector database. And that information will never
be available to
your foundational model again. So you don't
even have to devise some complex mechanism for
forgetting. You just don't know it anymore.
What are the main reasons why people
attach vector databases to
foundational models? It gives you
this operational sanity
that is almost completely impossible without it.
That's interesting.
I guess it feels like there's three different approaches that people are using.
They're not much exclusive for models that kind of overlap in terms of what the hope for output is.
One is really changing or engineering prompts or adding more information into the prompt.
The second is fine-tuning.
And the third would be rag slash different aspects of embeddings or other approaches like that.
How do you think about fine-tuning in this context?
Like, when should you fine-tune versus, you know, use some of the approaches that you've talked about earlier?
I can answer both the scientists and as a business owner, right?
As a scientist, I'm all for fine-tuning.
We have all the evidence to show that done right, it helps tremendously.
As a business owner, I can tell you that it's actually extremely hard to do well.
I mean, this is something that unless you have the research team and the AI experts that
know how to fine tune, you might actually make things significantly worse.
Okay.
So there is nothing that says that more data is going to make your model do better.
In fact, it oftentimes gets, uh, regresses to something significantly worse.
With prompt engineering, again, I think it's necessary, especially when you build applications.
You want the response to, you know, conform to some format or have some property.
I think that's a given you should do that.
It runs its course after a while.
I mean, it's in some sense you get what you get.
It's necessary, but there's a limit to what you can do with that.
And RAG, I think, is incredibly powerful.
But like I said before, when we talked about canopy, that's not, you know, that's not simple either.
I mean, it's simpler than the other ones, but still acquires work and understanding, experimentation, and so on.
This is almost a hallmark of a nascent mark.
when the simplest solution is still somewhat complex.
Yeah, makes sense.
What's next for a pine cone?
What are some major things coming that you'd like to talk about?
So, I mean, there's a ton.
We're an infrastructure company.
And so we obsessed about ease of use and security
and stability and cost and scale and performance.
Also, as an engineer at heart, I'm very excited about those things.
And all of that is coming.
Again, serverless is becoming faster, bigger, better, more secure, easier to use.
And we're starting to really grapple with what very large companies and very, you know,
kind of trailblazing tech companies are going through.
I said that getting AI to be truly knowledgeable is still complex.
I think we're starting to grapple with deeper.
issues that the entire information retrieval community has been dealing with for about 40, 50 years now.
We're starting to see those, you know, come to the fore in RAG and in AI in general.
I guess putting aside Pine Cone and sort of the database world and everything else, what are you most excited about in terms of what's coming next in AI?
It's hard to say. I really do want to see a distillate.
in some sense of foundational models.
And by distillation, I know it's a,
I don't mean what usually people say,
but there's distillation of models.
I don't mean that.
I mean, the separation of reasoning and, and knowledge, right?
Foundational models get it fundamentally wrong.
When we learn how to build the subsystems of AI correctly
and for each one of them to do their roles,
optimally, either we're going to be able to do the, to achieve the same tasks, much cheaper, faster, better, or we're going to still want to use the same amount of resources, but achieve much more.
What happens today is that we have very crude tools and we try to use everything for everything.
Delightfully or shockingly enough, depending on who you are, that kind of works. I mean, we found this like very, very efficient and very general purpose tools.
But they'll still very general purpose.
They're still super blunt instruments.
Again, as a technology is somebody who cares deeply about how things are built.
You kind of see the inefficiency and it hurts the brain to figure out that, you know, we take half the internet and cram it into GPU memory.
I'm like, holy, what, why?
This can't be the right thing to do.
So I'm very excited about us as a community truly understanding how the, the, you know,
how the different components interact and how to build everything much more in some sense correctly.
I hope we get to build some exciting products.
By we, I mean the community gets to build some exciting products this year.
I think we're going to see a year of a lot of experimentation that when the people went through last year,
they're going to take the production and to build cool products this year.
and I can't see, I can't wait to see how that looks like.
I have a feeling that this feels going to be very, very exciting for consumers of AI.
Yeah, I totally agree.
It's a very exciting year ahead.
So thank you so much for joining us today.
Thanks, Edo.
Thank you, guys.
Find us on Twitter at No Pryor's Pod.
Subscribe to our YouTube channel if you want to see our faces,
follow the show on Apple Podcasts, Spotify, or wherever you listen.
That way you get a new episode every week.
And sign up for emails or find transcripts for every episode.
at no dash priors.com.