The Infra Pod - Vector databases is not a feature? Let's dive deeper with Chang
Episode Date: April 1, 2024Ian and Tim sat down with Chang She from LanceDB to talk about what's happening on the vector db world, and how Lance started from a file format into a tensor data lake. ...
Transcript
Discussion (0)
Welcome back to yet another Infront Deep Dive podcast.
This is Tim from Essence VC.
And Ian, let's go.
Let's do it.
This is Ian Livingston helping turn sneak from a bag of stuff into a platform.
And I'm so excited today, Tim.
We're joined by Cheng Shi at LanceDB.
And I am so excited to talk about what LanceDB is doing. It's an embedded vector store.
I don't really yet know what that actually means, but I think we're going to find out.
Tell us about yourself.
Hey, guys. It's a pleasure to be here. I'm super excited. And I think we're going to have fun today.
I'm the CEO and co-founder of LanceDB. And I'd like to say what we're building is more than just a vector database.
What we're building is a new type of lake house for multimodal AI.
Certainly that includes the embedded vector store, but it is very different because we actually start from the bottom up.
We have a new data storage layer, a new data format. And the whole approach that we take with LAN-CV is that it's really,
really difficult to manage unstructured data that you need for multimodal AI. So we make that easy,
not just for vector search, but there's like a whole slew of other things that you need to do
with managing that data. And that is basically at a high level how we approach the problem.
That's really interesting.
For our audience,
what is it about multimodal that requires a new format?
What is it?
Why is it complicated?
What's the underlying problem
that spurred,
oh, we need to build something for this?
Yeah, absolutely.
So I've been making tools
to make it easier for data science and machine learning for a while now, starting with working on pandas and then working on recommender systems.
And basically, before the current era with multimodal AI, you got data that more or less fit into Excel sheets or tables.
And if you've got these like data frame tools or like
SQL databases, your life was okay. It wasn't, it wasn't perfect, but it also wasn't terrible
until you start dealing with like images and videos and, you know, vector embeddings and
PDFs and all that stuff that surprisingly makes up like most of enterprise data by volume but it just doesn't
really fit well into a table and so when people want to work with it like manage it store it
it inevitably becomes a pile of goo like i don't know how many times i've run into these problems
all like you know i set up a table with links to all the images, and it works great for about two weeks until somebody moves that directory of images,
and my whole data set's broken, and I have no idea what happened.
I have to chase that around.
And there's just like one tip of the iceberg of managing that kind of data.
But it turns out, you know, these are data that's like,
we call it dark data in the enterprise,
where people don't really know how to extract value out of very easily.
And so it turns out that is exactly the type of data that's actually valuable for AI today.
I want to probably talk about your journey of making into LanceDB.
And I actually still want to also understand what's the overall products you actually do.
So when we talked, you're doing Edo,
and you started with the file format, Lance, right?
So you're probably one of the weirdest database company
that starts with the file open source format
and go upwards, right?
Everybody else goes downwards.
So maybe talk about your journey.
Why do you start this file format as a beginning?
And what led you to even build a database on top of that?
Well, in the very beginning, we actually did try to go from the top down. We started building this query engine and it was loosely based on Spark and Parquet and a number of open source components.
And we ran into the same problems over and over again that we were trying to solve now with Lance Format.
And it wasn't until we sort of banged our head enough that we figured out,
okay, there's not enough lipstick in the world that you can put on this pig to make it to work.
And then we talked to a lot of practitioners.
This was 2022.
And, you know, pre-ChatGPT autonomous vehicles was still the king of ai and machine learning and
we talked to a lot of like computer vision practitioners and everyone complained about
the same thing you know i can't have the same system to do data exploration versus training
i tried to make it work with parquet 10 different ways never works and you know data exploration
for my image data set boils down to,
hey, there's an app called Finder in Mac. And when you click on a directory with images,
it gives you a bunch of thumbnails. That's how a lot of them did their, you know,
quote unquote, data exploration and curation. And that gave us the confidence to say, okay,
it's 2022. We're not crazy. Don't want to create low-level format layer. And there's a reasonable chance that this is something
that is actually valuable to the community.
That was sort of where we started at the file format layer.
We weren't thinking about a vector database at all.
What we were thinking about is essentially,
how can we make managing unstructured data as structured as possible?
We want to make that feel like you're working with a data warehouse with schemas and tables and structured queries and things like
that. And one of the benefits of the file format was, hey, we can support rapid random access.
So now it makes sense to put an index along with the file format. So then the computer vision, people can do things
like deduplicate images, active learning, finding similar images. And so it's sort of starting to
rhyme like, okay, this is a index for, you know, like D trees and bitmaps, but also for vectors.
And we added a vector index to do those taps of taps people want to do in computer vision. And then I think this was like maybe beginning of 2023, end of 2022, when the other folks in the open source community discovered Lance and was like, hey, I can use this as a vector database.
Like, why am I paying out the nose for other solutions when I can just get this for free. And we sort of felt the pull from the community
and created this embedded database layer
that was much more targeted towards like RAG and AI
and search and recommender systems.
So a winding road to this embedded vector database.
Yeah, it's super fascinating.
I think you already alluded to why we need a file format
and kind of leads you to why you're building this,
but probably one thing that isn't super clear,
this feels more like a data lake
than just purely like an enclosed database company
where most databases you put your data in,
the only way to access it is to put data out.
But when you build an open source file format,
I guess the purpose is to be able to put it like a data lake
where multiple things can access at the same time.
Is that the intention?
Like why put out the file formats as a database company?
Because I don't think that's a typical thing,
you know, Snowflake isn't putting out
its internal data form
because it has a lot of iterations,
had a lot of proprietary information in it.
Like what is almost like your intention
to make it more open
and build a data engine on top of that?
Do you want this to become like an open source ecosystem with a file format on top what is the intention here
yeah so i think the intention yes is to become a much bigger ecosystem with the data lake and
the purpose of us making that totally open source at the format level is because the file
format actually brings a lot of unexpected benefits that works really well
across all these different modes.
We have a meme internally on the team that we just pass around pictures of
Marie Antoinette with, you know, have your cake and eat it too.
And that's essentially what we want to give to our users who are, you know,
managing AI data. it's like previously
there were there was no system that was good for managing both tabular and unstructured data
previously there was no system where it would allow you to use the same piece of data in real
time serving and olap and training and there was no system that allowed you to do ml experimentation
training fine tuning and data exploration all with the same piece of data within the same system.
It's very easy to store and query petabytes of data in this open-source layer.
And then on top of that, we're building an entire data ecosystem on top, starting with the vector database.
I have so many questions.
So you built this file format.
You're focused on non-tabular data, right?
If you have tabular data, Parquet's great, but why a vector database?
Why is that the first thing in your journey?
So a lot of it is, where are the sharp paints in the community?
And for AI ML practitioners, what do they have the most trouble with and what's most
lacking? And so we started
that vector database journey in the end of 2022, beginning in 2023. I think if I were in the same
place, in the same spot today, I'd maybe have second thoughts around starting a vector database,
given all that's happened in the past year. But at the time, that seemed like that was the biggest
pain and the biggest gap in the market. Because at that time, that seemed like that was the biggest pain and the biggest gap in the market
because at that time, there was no real viable option for a lightweight vector search. And I
think there's still not great options outside of Lance where it allows you to combine that vector
search and data management. That's sort of the motivation for us to do that in the beginning.
Is your philosophy behind this embedded vector store, is it a lot like the DuckDB philosophy for analytics? Is that an apt analogy? Help those of us who are uninformed be better
informed in our mental model. Yeah, help us understand.
Yeah. So I'm a math nerd. I wasn't good enough to be a math major in school, but I always liked math.
So one of the things that I like to do is think about
how can I reduce the solution for a new problem
to a previously solved state?
And so that's our approach for that.
I think DuckDB is a great analogy.
We've had folks on Twitter and LinkedIn say,
hey, they like Lancey B because it feels like SQLite for vector search.
And so it's sort of that same lightweight feel.
Got it, yeah.
And one of the things you said in your introduction
when you were explaining multimodal,
I'd love to dig in more as you talked about the fact
that the vast majority of the best data
that will be the most valuable data
for this new generation of AI,
or let's say this new generation of large models,
right, large vision models or image models, text-to-speech models, whatever, isn't tabular.
It's this other undefined format.
Help us understand, like, what does the ecosystem look like today for tools?
And what do you think the ecosystem is going to look like as time moves on?
Like, one of the things you said at the beginning that I thought was unique and interesting
is that, well, when you went and thought about autonomous vehicles, it was like, well, the actual infrastructure they have to build
those systems was very poor. Can you help us understand what's the state of that tooling
ecosystem today and where do you think it's heading? Yeah, absolutely. We were just talking
about what a LAMP stack for AI looks like. And the answer is, right now, there is no LAMP stack. Anybody who
tells you this is the LAMP stack for AI is trying to sell you a bridge to nowhere. And that's partly
why this market is so fun right now to try to build that out. I think certainly around data
infrastructure, retrieval, fine-tuning modeling, prompt management, and then orchestration.
Those are all layers where folks are trying to standardize
and build the best-in-class solutions in.
And then, of course, you have the frameworks like lane-chain,
all that, and all those are stitching together.
And so the trends around this is previous generations
of machine learning evolutions
were primarily in Python
and primarily required you to be
at least familiar with the mathematics
behind the modeling and ML concepts.
And it's no longer true.
I think it's a great trend and it's democratizing AI.
So you can have a lot more folks
that are coming into the field building
compelling applications without having to spend years understanding the complicated mathematics
behind it. But on the flip side, it also means there are now a flood of new developers that
aren't necessarily familiar with data infrastructure and don't have the battle scars from the shared
pains of managing that data into the field.
So there's a lot of relearning about what works, what doesn't work.
And last year, we spent a lot of time building demo ware and going to hackathons.
And this year, everyone's like, okay, how do I put this thing in production?
And essentially relearning a lot of the lessons from like the 2010s. And so my prediction is that stack as it forms will look closer and closer to kind of the stack that we had before.
It won't look exactly the same, but it'll look much closer than what it is today.
I think what everybody's trying to predict and figure out is, is there going to be a LAMP stack?
And we're trying to have this sort of standard four-letter word stack or five-letter word stack.
In the meantime, LAMP was the starting point.
And there's a whole bunch of, even back in the day when we were working at Mesosphere, we had this max stack and all these things.
Some last longer, some doesn't.
And I guess maybe the question is,, how do you see vector database,
the role continue to evolve? Because, you know, for most people, they put this sort of like head
that vector database is just storing vectors and doing similarity search and doing some simple
things here. I feel like everybody in the ecosystem is all evolving quickly and we're not actually are
all catching up while the technology is shifting quickly there's
so much more models now and embedding models are becoming more being trying to be innovative on
its own too what are things you believe are the most important besides just doing search on vectors
that helps people to use lance db and that's sort of like a question is like, can that become so much more crucial
for folks to like understand how to tell you're the best option as well? Totally. So the way I'm
looking at it is not so much like, you know, what vector databases can do, but like, what are the
pressing problems the community needs to solve? And typically retrieval for RAG and better semantic search are problems that folks
are tackling today. And going from a demo to production means a lot of them need that extra
20% in retrieval quality, right? Going from like a 60% retrieval quality with simple vector search alone to 80 to 90 percent retrieval
quality that you need for production. And typically, that gap is filled in with a lot of
things. One is you can experiment with preparing your data and chunk data differently. You can
clean your data differently. So you need tools to store that data, to run experiments, to make it easy to query that raw data.
And then you need not just vector search, but you need different ways to retrieve information from full-text search, like just SQL queries.
And folks are now experimenting with like cohort embedders and retrievers knowledge graphs and
graph databases and all that so now you've you've got this like diverse set of ways to retrieve
information now you need a way to actually combine all those results and re-rank them so that the
real top quality contexts can then go into your rack. And that's sort of a pipeline that's starting to
look more and more like what I worked on before, which is recommender systems.
And that next big step then is a lot of folks who I ran into today building RAC is,
we got something up and going as quickly as possible. Typically, that means, okay, we use open AI for embedding model and completions.
But the next step for them is always, okay,
how do we actually leverage our own data to fine-tune a lot of those models?
That's really how they built their adage, right?
If you build a standard RAC pipeline,
every one of your competitors can also build the same pipeline,
but nobody has
your data. So if you can use your data
to make that embedding model
make retrieval more accurate
or your completion model
better, then you have a lasting
edge. And so
some of the more advanced users I've run into
have told me that, hey,
they're getting
really good results with fine-tuning.
So you've got the top of the MTEB leaderboard,
and then you've got small open-source models.
And right now, I think everyone's like,
okay, let's create better and better and bigger generic embedding models.
But some of our users are finding, hey, like $10 in generated synthetic data, they can fine tune an open source model to be better than the top of the MTB leaderboard, which is pretty insane.
And also, I think really great for a lot of enterprises with private data looking to build an AI stack that's actually differentiated. Put all that together, it's like that feedback loop
and you need data management,
you need versioning reproducibility,
you need different ways to query your data.
Again, we get to that previous state
in autonomous vehicles where,
okay, I need to have my data split
into three different formats
in three different places
with different systems talking to all of them.
And then I'm spending maybe like half my time just trying to keep the data in sync with each other.
And it becomes a huge mess.
That's at the core of the problem we want to solve with LanceDB is you can put everything together.
You can create any way you want.
You can run DuckDB on LanceDB data tables.
You can get the data into Pandas and Polars and Spark.
And we're working on like Presto and Trino integration.
It's sort of taking your existing compute and you can just plug
your Lance data set right in. You don't have to worry about
making copies for experiments and rollbacks and time travel.
That's all taken care of as well.
That's really interesting. I think there's a couple of key points
where you just said that I'd love to dig in a little bit more.
But one is, actually, we had Chris on, Rick Comoni,
and he was talking a lot about how the rise of object store
is going to remove the need for data integration.
Because if everything's sitting on top of the same S3 bucket,
and let's assume we've all agreed on some intermediate format
where all the different data tools and ecosystem components can all talk the same language
and potentially have the same bucket layout, then what do you need Kafka for?
Because everything's in one place in the first place.
And so it sounds a lot like what you're saying is very similar to that,
which is your goal is to have your vector store that you're using for doing RAG,
plus your training pipelines you're using for doing fine tuning, all sitting on top of the same warehouse.
Is that what you are saying effectively?
Yes, that's a huge part of it.
For example, we're the only open source vector database, I think,
that lets you just put data into ObjectStore directly
and create it from anywhere.
So a bunch of our users,
hey, I can just have my vectors sitting in S3
next to my images or next to the text.
They can run LAN CB and they'd be slammed out to query it.
And I don't have to pay for anything, essentially,
other than the S3 storage.
And also, once it's in object storage,
a lot of other things are taken care of.
Like, you don't need to worry about replication.
S3 comes with a lot of tools around encryption, right?
And key management and permissions.
So when you have a system that can interact really well
with object store, it really simplifies your stack.
That's interesting.
Curiously get your experience and your thought process.
Like, one of the things I've been thinking a lot about is
a lot of the similarity score stuff we're doing in RAG,
you made the point already, it's good enough for
the 60% solution.
One thing you suggested is knowledge graphs.
Are we missing fundamental
data representations to get from
the 60, 80 to 90%?
And is it just graphs, or are there other things
you think we're missing to get to the point where we can retrieve
the context, even give it to the
model to have the reasoning model actually generate that precise response? How big is that gap?
I guess the next question I have, since you're a practitioner in the field, you're talking to
people building these systems all the time, is this one of those situations where 80% of the
effort is going to be in the last 20% or not? How close do you think we are to actually being able
to get to reasonable, precise systems with what we have today? And how much of that gap do we still have
to fill in? Yeah, I think for a lot of use cases today, the common retrieval methods
are good enough for a lot of them to get to that 80-90% level. So this means like, you know,
you're finding optimal ways to chunk your data.
You're finding optimal ways to do re-ranking of
well-text search and vector search results.
And maybe you're doing a little bit of fine-tuning on that vector embedding model.
If you have a specialized use case
where your context goes very deep into a knowledge graph
is when you might need to think about starting to reach out for a graph database.
And I'm an embedded database guy, so as far as graph databases go,
I love new solutions like Kuzu, for example,
which I think you guys are also familiar with.
And when I look at Kuzu,
it feels like we share a lot of the same
sort of design philosophies around columnar storage
and making it easy for people to use it,
making it lightweight,
and having it be an embedded system.
So I think there's a lot of interesting developments
there as well.
It feels like there's so much going on in this space.
Do you feel like you need to take a bet
on what could work in the future
and start building around,
like maybe Knowledge Graph
is going to take off a little bit more?
Or is there any particular things
you believe in this space
that are around these future of rags or a vector
that you think is going to be very important
that you're taking some bets on at all or not?
I think it's not so much that a particular technique we're going to bet on. So there's
a couple of things. One is the big bet that we're essentially making with the philosophy of the
company is that the Rackstack and AI will still need sort of the same high-level data for
infrastructure practices that we have from
data warehouses, right? Like same schemas, organization scale, data management, and all
of those just for a new set of data types, for a new set of workloads, and things like that.
That's what we're betting on with the company. Now for new techniques, I think there are lots of new
techniques coming up on a weekly basis. And I think framework companies like LaneChain and
Lama Index also are at the forefront of that. And they're trying to see what are the really
interesting things people are doing and thinking about integrating that into those frameworks.
I think from our perspective,
the work that we want to do there is essentially,
that's why having that open ecosystem is so important to make it play well with the layers above it.
So that's why we made a huge emphasis
on our Apache Arrow integration.
Once you integrate with Arrow,
high-level systems, chances are they already talk to Arrow.
So that integration becomes very easy.
And that's why we make that SQLite form factor.
So whatever the new techniques are, most of the time it reduces down into a,
okay, I can squint and this looks like a modified DataFrame API,
or this looks like a SQL query with some variation. So by doing that,
I think we don't have to essentially say this technique is going to be the future for RAG versus
that one. I have one question. It's more involved in RAG, which is actually more in the multi-modalities.
Are there specific types of problems that are better for similarity search or better for different modalities of RAG?
Like, for example, are there specific problems like code?
Like, is code better as a graph problem or similarity search problem?
Do you have an experience and thoughts on which types of algorithms and approaches are better suited for what types of modalities and problem cases?
That's really interesting.
I've been working with a couple of friends on using Lansi before,
things like Texas SQL, not necessarily code.
But it's hard to say whether a graph database search
will definitely offer that next step function up.
I think partly because traditional graph databases are just really hard to use and that's not
enough.
Folks are experimenting with that.
Conceptually, I think there is a lot of potential there because, for example, we've had folks
come to the Lansi Bee and say, hey, I'm writing this search over code or chat with code thing, I want to be able to retrieve information as,
find the most similar classes,
and then each class will have methods.
And then I want to be able to then traverse into each method
and search for the most relevant methods.
And then each method has function signatures and comments and all that,
and be able to traverse down and find a search
through those spaces as well.
On the surface, I think it does
look like a graph search problem to me.
So I'm sort of eagerly
anticipating folks who are
deep into graph databases
to experiment with that, and hopefully
we'll see some positive results there.
Alright. Let's start
our spicy future section right here.
Spicy futures.
And so as you know,
we've been doing this for some time now.
I want you to hit us with the punch, man.
What do you believe about AI
or record databases of the world?
It's going to be a future.
Okay.
I actually kind of have two slightly spicy takes.
One is a lot of folks are
focusing on the features for RAG and features for vector search. And not a lot of folks are
talking about the underlying engineering, the scalability and performance aspects of it.
And I don't think scale and performance is a solved problem when it comes to vector database infrastructure.
So our spicy take is traditional microservices kind of like share-nothing architecture of that era of distributed databases doesn't really work for the top end of the scale for vector databases. It just tends to be very complicated to scale and turns out to be very expensive to scale. I'm talking
about like multiple billions of embeddings with, you know, thousands of QPS and like, you know,
hundreds or thousands of, you know, insertions per second and that type of scale. So my spicy
take is basically how we architect at Lance is with that storage layer, this allows us to have a separate shared storage layer
and a sort of stateless query layer.
So you can find this type of architecture
in databases like Neon, for example,
that offers folks a much simpler experience
on that Postgres-like OLTP experience.
And we're essentially giving folks roughly the same thing
for scalable vector search.
So that's one.
And then number two is, even though I at most vector databases today, they look less like
full-fledged databases and more just like an index with a service around that. And so if you
imagine back in the day, maybe when B-tree indices were first introduced, it was amazing. You could
wrap a B-tree index in a service and be able to reference your data, but those don't really
stand the test of time. On the other hand, traditional databases that have added vector
indices, it's also not a long-term solution for enterprises that are introducing more and more
AI-native solutions and need more sophisticated data management, querying capabilities, and higher scale. So I think there's going to be
a convergence of solutions there
where vector database companies in general
will try to look more and more like
a traditional database,
and maybe certain traditional database companies
that don't have as much design architecture baggage
can re-architect to become much more AI native. And I think we'll
see that one convergence maybe a couple years down the road, but narrow vector search companies will
cease to exist or become like a library that plugs in somewhere. In the same token, I think there's
going to be a divergence where people who are using vector databases today for RAG
will find that there's two sets of needs and directions
for data infrastructure.
One is, okay, what I actually want is a search engine.
I don't care about the vector embedding.
I don't care about these different index.
I have this bunch of images and text and videos,
maybe all three of those, in some S3 bucket.
Hey, search engine, go look at that thing and make it searchable.
And I just want to be able to ask natural language questions or send images to it.
And it gives me really good results.
On the other hand, it would be like folks who are, okay, I need to actually train my model.
I need to fine-tune stuff.
I need very good management capabilities.
And I need full control and customizability.
And what I really need then is,
okay, I need a data warehouse
or a data lake for all that AI data.
So I think folks who are creating
data infrastructure for RAD today,
I think we'll see that divergence.
Some of them might become services that are search divergence. Some of them might become like services
that are search engines.
Some of them become like data warehouses.
Can you think of like a dividing line
in the future as to all the small startups,
the SMB and the mid-market,
they're going to go with like the out-of-the-box,
simple like nice DX version
and then the enterprise is going to do with the data ware.
Like, have you thought about how the market segments?
Yeah.
So I think AI-native companies, large and small,
will want that data warehouse
because being able to customize that model
is core to their value proposition as a business.
And so it's not so much on the company size,
but what you're actually doing
and what is the role of AI in your company.
But even though every company today is trying to add AI, I think most of them are still thinking of it as,
I have a core value proposition to my users or customers that is orthogonal to AI.
AI is a great value add that I need to tackle. Fundamentally, I am just improving on
that core value proposition. And so a lot of them, if they didn't have AI teams before,
they may not want to hire in-house talent to do all that custom model training. And so they're
going to want an Algolia for the AI era, if you will, that takes care of all that underlying data
management and fine-tuning all that for them.
Do you believe that every company in the world will have their own fine-tuned models?
Or do you believe that the people that will actually fine-tune models and do the work
to bring their private data are going to be the set of folks where AI is like the company,
right?
Like you have OpenAI obviously is an AI company.
That is what they sell. And then you have Nike who make shoes, but will want to add AI into the
website and AI into their backend processes. And they'll want to automate a bunch of things they
can, but ultimately at the end of the day, they actually just sell some shoes and their innovation
points go to making better shoes, not like making better models from the ground up. I'm kind of
curious, like, have you thought about that market segmentation as well? Is fine-tuning an example
of a thing? Is data set curation a thing that will be true across everybody? Or are we going to have
this sort of like middle layer of verticalized vendors that might be use case specific?
Right. So there's a couple of points here. I mean, this is a great question to think of. And a lot of it is just speculation, to be perfectly honest, because who knows what it's going to look like by the end of 2024.
So what I think is for smaller companies where maybe AI is like one application or one use case, most of them probably won't really care to do the fine-tuning themselves.
And what they really want is just a way to plug in their own data and use their own data to make
Retrieval better. And ideally, it would just automatically get better without them having
to do anything. The only thing they have to do is just use the system and then like on the top end like these ai native companies you know your mid journey or runway they're always going to need that level of
customization and it's not just fine tuning but like they're training their own models
in the middle i think where it's like you know the nikes and maybe like walmart and macy's
these are like large companies that are not native traditionally you don't think of them as having the most advanced AI practices.
But I think AI in those large enterprises,
it's a diverse set of use cases across the company.
There are probably fairly stringent data privacy
and security requirements
and very complicated procurement process
and also an internal team that's managing all that.
And so my prediction is they will want to have
some sort of customization that they're implementing on their own.
And I think fine-tuning would be a perfect example of that
where fine-tuning becomes easier and easier
and fine-tuning is much cheaper than just training your own data,
pre-training your own model from scratch.
And so my prediction is larger enterprises, even non-AI native ones, will want to do at
least part of that in-house.
So I want to ask the question, fundamentally, there's the neural vector search and there's
like the platform or more production ready.
There's always a debate about like, hey, why don't you use PG vector?
Or why don't you use like some like native company databases,
existing vector added?
Can you give us an example?
What's like a fundamental thing that's hard to scale
when you use like a BG vector, for example,
that you've done to make sure it actually works at scale or some production
quality is there any particular example we can sort of like talk about so it's sort of like
highlights why we need almost like a vector database or even like lance tv specifically
that's being designed from growing up instead of just like yeah totally fucking neon and vg pg
vector ish kind of thing and you can just. So I think it's interesting to think about PG Vector in particular.
But in general, there's three buckets of things that I hear a lot on.
So one is scale and performance.
And PG Vector is sort of part of the Postgres architecture.
And it works great when you have a small amount of data
and your queries are not that complex.
But if you're talking about like tens of millions
of embeddings or even hundreds of millions,
it's really hard to handle that with PG Vector.
Postgres does not scale out, right?
You scale up, you just get bigger nodes,
but that has a limit and it gets costly.
Also, like Vector Search is a very IO-intensive workload that is very different from your
traditional OLTP workload that Postgres is used for.
So when you mix the two, it gives you wonky SLAs in production.
I still remember when I used to work at Tubi, the data science team wanted access to the
MySQL server
that housed all of the metadata
for their analysis.
And that was like part of the production system.
And so I think while I wasn't looking,
they convinced somebody to open that up.
And the first query they sent
promptly brought it down
because it was just like this massive OLAP query
that got sent to MySQL.
And so that's one set of things,
whereas, you know, LAN-CV,
and also many purebred vector databases
really pay attention to this problem of like,
you know, how do you scale out?
How do you have a distributed vector database
that can handle, you know, up to like billions of embeddings?
I think LAN-CV in particular also excels at scaling
because our indices are
fully disk backed so that the amount of memory you need to use is actually very
little. We can allow you to scale compute and storage separately. Right.
And so that's one bucket. The second bucket is workflows. And so, you know,
I hear like I start on PG vector,
but I'm like having to manage my own embedding
pipeline and it's really pain like hey lance cb has this like embedding registry where i can just
specify the embedding model and lance cb takes care of calling the embedding model and adding it
and all that so it's much easier or things like you know for pg vector it's not a dedicated api
so you're a sql query with specialized syntax for vector search.
And sometimes if your query is very complicated,
the query planner will just skip the index.
And so now you get to this really slow state.
And it's hard to predict when that happens,
when your query gets complicated.
And then the third bucket is just features.
Like, hey, I need that a little bit of retrieval quality.
I want to do hybrid search. I want to do re-ranking. You know,
I want to be able to like store that data to fine tune my model and all that.
And like none of those things, Postgres is a good fit for.
Cool. Well, I think we got all the spicy take we could have got,
but this is awesome.
We have probably even a whole lot more
we're going to talk about,
but for people that want to know more,
learn more about LanceDB or you,
where we should find information about him?
So if you want to learn the details
and see how the sausage is made,
come to GitHub.
So our GitHub org is LanceDB,
and then we have Lance,
which is the columnar format repo,
and LanceDB, which is the columnar format repo,
and LanceCB, which is the vector database.
If you're looking for examples and help,
so we have a vector DB recipes repository with dozens of worked examples.
Come to our Discord
for real-time conversations with folks.
If you want more spicy takes
and see me troll people,
come to Twitter.
I think our
company account is LanceyV and my personal Twitter handle is ChangusKhan. It's been such a pleasure.
Thank you so much. Thanks, guys.