The Infra Pod - Vector databases is not a feature? Let's dive deeper with Chang

Starting point is 00:00:00 Welcome back to yet another Infront Deep Dive podcast. This is Tim from Essence VC. And Ian, let's go. Let's do it. This is Ian Livingston helping turn sneak from a bag of stuff into a platform. And I'm so excited today, Tim. We're joined by Cheng Shi at LanceDB. And I am so excited to talk about what LanceDB is doing. It's an embedded vector store.

Starting point is 00:00:26 I don't really yet know what that actually means, but I think we're going to find out. Tell us about yourself. Hey, guys. It's a pleasure to be here. I'm super excited. And I think we're going to have fun today. I'm the CEO and co-founder of LanceDB. And I'd like to say what we're building is more than just a vector database. What we're building is a new type of lake house for multimodal AI. Certainly that includes the embedded vector store, but it is very different because we actually start from the bottom up. We have a new data storage layer, a new data format. And the whole approach that we take with LAN-CV is that it's really, really difficult to manage unstructured data that you need for multimodal AI. So we make that easy,

Starting point is 00:01:14 not just for vector search, but there's like a whole slew of other things that you need to do with managing that data. And that is basically at a high level how we approach the problem. That's really interesting. For our audience, what is it about multimodal that requires a new format? What is it? Why is it complicated? What's the underlying problem

Starting point is 00:01:34 that spurred, oh, we need to build something for this? Yeah, absolutely. So I've been making tools to make it easier for data science and machine learning for a while now, starting with working on pandas and then working on recommender systems. And basically, before the current era with multimodal AI, you got data that more or less fit into Excel sheets or tables. And if you've got these like data frame tools or like SQL databases, your life was okay. It wasn't, it wasn't perfect, but it also wasn't terrible

Starting point is 00:02:12 until you start dealing with like images and videos and, you know, vector embeddings and PDFs and all that stuff that surprisingly makes up like most of enterprise data by volume but it just doesn't really fit well into a table and so when people want to work with it like manage it store it it inevitably becomes a pile of goo like i don't know how many times i've run into these problems all like you know i set up a table with links to all the images, and it works great for about two weeks until somebody moves that directory of images, and my whole data set's broken, and I have no idea what happened. I have to chase that around. And there's just like one tip of the iceberg of managing that kind of data.

Starting point is 00:02:57 But it turns out, you know, these are data that's like, we call it dark data in the enterprise, where people don't really know how to extract value out of very easily. And so it turns out that is exactly the type of data that's actually valuable for AI today. I want to probably talk about your journey of making into LanceDB. And I actually still want to also understand what's the overall products you actually do. So when we talked, you're doing Edo, and you started with the file format, Lance, right?

Starting point is 00:03:33 So you're probably one of the weirdest database company that starts with the file open source format and go upwards, right? Everybody else goes downwards. So maybe talk about your journey. Why do you start this file format as a beginning? And what led you to even build a database on top of that? Well, in the very beginning, we actually did try to go from the top down. We started building this query engine and it was loosely based on Spark and Parquet and a number of open source components.

Starting point is 00:04:00 And we ran into the same problems over and over again that we were trying to solve now with Lance Format. And it wasn't until we sort of banged our head enough that we figured out, okay, there's not enough lipstick in the world that you can put on this pig to make it to work. And then we talked to a lot of practitioners. This was 2022. And, you know, pre-ChatGPT autonomous vehicles was still the king of ai and machine learning and we talked to a lot of like computer vision practitioners and everyone complained about the same thing you know i can't have the same system to do data exploration versus training

Starting point is 00:04:36 i tried to make it work with parquet 10 different ways never works and you know data exploration for my image data set boils down to, hey, there's an app called Finder in Mac. And when you click on a directory with images, it gives you a bunch of thumbnails. That's how a lot of them did their, you know, quote unquote, data exploration and curation. And that gave us the confidence to say, okay, it's 2022. We're not crazy. Don't want to create low-level format layer. And there's a reasonable chance that this is something that is actually valuable to the community. That was sort of where we started at the file format layer.

Starting point is 00:05:13 We weren't thinking about a vector database at all. What we were thinking about is essentially, how can we make managing unstructured data as structured as possible? We want to make that feel like you're working with a data warehouse with schemas and tables and structured queries and things like that. And one of the benefits of the file format was, hey, we can support rapid random access. So now it makes sense to put an index along with the file format. So then the computer vision, people can do things like deduplicate images, active learning, finding similar images. And so it's sort of starting to rhyme like, okay, this is a index for, you know, like D trees and bitmaps, but also for vectors.

Starting point is 00:06:00 And we added a vector index to do those taps of taps people want to do in computer vision. And then I think this was like maybe beginning of 2023, end of 2022, when the other folks in the open source community discovered Lance and was like, hey, I can use this as a vector database. Like, why am I paying out the nose for other solutions when I can just get this for free. And we sort of felt the pull from the community and created this embedded database layer that was much more targeted towards like RAG and AI and search and recommender systems. So a winding road to this embedded vector database. Yeah, it's super fascinating. I think you already alluded to why we need a file format

Starting point is 00:06:42 and kind of leads you to why you're building this, but probably one thing that isn't super clear, this feels more like a data lake than just purely like an enclosed database company where most databases you put your data in, the only way to access it is to put data out. But when you build an open source file format, I guess the purpose is to be able to put it like a data lake

Starting point is 00:07:01 where multiple things can access at the same time. Is that the intention? Like why put out the file formats as a database company? Because I don't think that's a typical thing, you know, Snowflake isn't putting out its internal data form because it has a lot of iterations, had a lot of proprietary information in it.

Starting point is 00:07:18 Like what is almost like your intention to make it more open and build a data engine on top of that? Do you want this to become like an open source ecosystem with a file format on top what is the intention here yeah so i think the intention yes is to become a much bigger ecosystem with the data lake and the purpose of us making that totally open source at the format level is because the file format actually brings a lot of unexpected benefits that works really well across all these different modes.

Starting point is 00:07:51 We have a meme internally on the team that we just pass around pictures of Marie Antoinette with, you know, have your cake and eat it too. And that's essentially what we want to give to our users who are, you know, managing AI data. it's like previously there were there was no system that was good for managing both tabular and unstructured data previously there was no system where it would allow you to use the same piece of data in real time serving and olap and training and there was no system that allowed you to do ml experimentation training fine tuning and data exploration all with the same piece of data within the same system.

Starting point is 00:08:29 It's very easy to store and query petabytes of data in this open-source layer. And then on top of that, we're building an entire data ecosystem on top, starting with the vector database. I have so many questions. So you built this file format. You're focused on non-tabular data, right? If you have tabular data, Parquet's great, but why a vector database? Why is that the first thing in your journey? So a lot of it is, where are the sharp paints in the community?

Starting point is 00:08:57 And for AI ML practitioners, what do they have the most trouble with and what's most lacking? And so we started that vector database journey in the end of 2022, beginning in 2023. I think if I were in the same place, in the same spot today, I'd maybe have second thoughts around starting a vector database, given all that's happened in the past year. But at the time, that seemed like that was the biggest pain and the biggest gap in the market. Because at that time, that seemed like that was the biggest pain and the biggest gap in the market because at that time, there was no real viable option for a lightweight vector search. And I think there's still not great options outside of Lance where it allows you to combine that vector

Starting point is 00:09:39 search and data management. That's sort of the motivation for us to do that in the beginning. Is your philosophy behind this embedded vector store, is it a lot like the DuckDB philosophy for analytics? Is that an apt analogy? Help those of us who are uninformed be better informed in our mental model. Yeah, help us understand. Yeah. So I'm a math nerd. I wasn't good enough to be a math major in school, but I always liked math. So one of the things that I like to do is think about how can I reduce the solution for a new problem to a previously solved state? And so that's our approach for that.

Starting point is 00:10:16 I think DuckDB is a great analogy. We've had folks on Twitter and LinkedIn say, hey, they like Lancey B because it feels like SQLite for vector search. And so it's sort of that same lightweight feel. Got it, yeah. And one of the things you said in your introduction when you were explaining multimodal, I'd love to dig in more as you talked about the fact

Starting point is 00:10:34 that the vast majority of the best data that will be the most valuable data for this new generation of AI, or let's say this new generation of large models, right, large vision models or image models, text-to-speech models, whatever, isn't tabular. It's this other undefined format. Help us understand, like, what does the ecosystem look like today for tools? And what do you think the ecosystem is going to look like as time moves on?

Starting point is 00:10:58 Like, one of the things you said at the beginning that I thought was unique and interesting is that, well, when you went and thought about autonomous vehicles, it was like, well, the actual infrastructure they have to build those systems was very poor. Can you help us understand what's the state of that tooling ecosystem today and where do you think it's heading? Yeah, absolutely. We were just talking about what a LAMP stack for AI looks like. And the answer is, right now, there is no LAMP stack. Anybody who tells you this is the LAMP stack for AI is trying to sell you a bridge to nowhere. And that's partly why this market is so fun right now to try to build that out. I think certainly around data infrastructure, retrieval, fine-tuning modeling, prompt management, and then orchestration.

Starting point is 00:11:45 Those are all layers where folks are trying to standardize and build the best-in-class solutions in. And then, of course, you have the frameworks like lane-chain, all that, and all those are stitching together. And so the trends around this is previous generations of machine learning evolutions were primarily in Python and primarily required you to be

Starting point is 00:12:09 at least familiar with the mathematics behind the modeling and ML concepts. And it's no longer true. I think it's a great trend and it's democratizing AI. So you can have a lot more folks that are coming into the field building compelling applications without having to spend years understanding the complicated mathematics behind it. But on the flip side, it also means there are now a flood of new developers that

Starting point is 00:12:36 aren't necessarily familiar with data infrastructure and don't have the battle scars from the shared pains of managing that data into the field. So there's a lot of relearning about what works, what doesn't work. And last year, we spent a lot of time building demo ware and going to hackathons. And this year, everyone's like, okay, how do I put this thing in production? And essentially relearning a lot of the lessons from like the 2010s. And so my prediction is that stack as it forms will look closer and closer to kind of the stack that we had before. It won't look exactly the same, but it'll look much closer than what it is today. I think what everybody's trying to predict and figure out is, is there going to be a LAMP stack?

Starting point is 00:13:25 And we're trying to have this sort of standard four-letter word stack or five-letter word stack. In the meantime, LAMP was the starting point. And there's a whole bunch of, even back in the day when we were working at Mesosphere, we had this max stack and all these things. Some last longer, some doesn't. And I guess maybe the question is,, how do you see vector database, the role continue to evolve? Because, you know, for most people, they put this sort of like head that vector database is just storing vectors and doing similarity search and doing some simple things here. I feel like everybody in the ecosystem is all evolving quickly and we're not actually are

Starting point is 00:14:01 all catching up while the technology is shifting quickly there's so much more models now and embedding models are becoming more being trying to be innovative on its own too what are things you believe are the most important besides just doing search on vectors that helps people to use lance db and that's sort of like a question is like, can that become so much more crucial for folks to like understand how to tell you're the best option as well? Totally. So the way I'm looking at it is not so much like, you know, what vector databases can do, but like, what are the pressing problems the community needs to solve? And typically retrieval for RAG and better semantic search are problems that folks are tackling today. And going from a demo to production means a lot of them need that extra

Starting point is 00:14:56 20% in retrieval quality, right? Going from like a 60% retrieval quality with simple vector search alone to 80 to 90 percent retrieval quality that you need for production. And typically, that gap is filled in with a lot of things. One is you can experiment with preparing your data and chunk data differently. You can clean your data differently. So you need tools to store that data, to run experiments, to make it easy to query that raw data. And then you need not just vector search, but you need different ways to retrieve information from full-text search, like just SQL queries. And folks are now experimenting with like cohort embedders and retrievers knowledge graphs and graph databases and all that so now you've you've got this like diverse set of ways to retrieve information now you need a way to actually combine all those results and re-rank them so that the

Starting point is 00:15:59 real top quality contexts can then go into your rack. And that's sort of a pipeline that's starting to look more and more like what I worked on before, which is recommender systems. And that next big step then is a lot of folks who I ran into today building RAC is, we got something up and going as quickly as possible. Typically, that means, okay, we use open AI for embedding model and completions. But the next step for them is always, okay, how do we actually leverage our own data to fine-tune a lot of those models? That's really how they built their adage, right? If you build a standard RAC pipeline,

Starting point is 00:16:41 every one of your competitors can also build the same pipeline, but nobody has your data. So if you can use your data to make that embedding model make retrieval more accurate or your completion model better, then you have a lasting edge. And so

Starting point is 00:16:57 some of the more advanced users I've run into have told me that, hey, they're getting really good results with fine-tuning. So you've got the top of the MTEB leaderboard, and then you've got small open-source models. And right now, I think everyone's like, okay, let's create better and better and bigger generic embedding models.

Starting point is 00:17:19 But some of our users are finding, hey, like $10 in generated synthetic data, they can fine tune an open source model to be better than the top of the MTB leaderboard, which is pretty insane. And also, I think really great for a lot of enterprises with private data looking to build an AI stack that's actually differentiated. Put all that together, it's like that feedback loop and you need data management, you need versioning reproducibility, you need different ways to query your data. Again, we get to that previous state in autonomous vehicles where, okay, I need to have my data split

Starting point is 00:17:58 into three different formats in three different places with different systems talking to all of them. And then I'm spending maybe like half my time just trying to keep the data in sync with each other. And it becomes a huge mess. That's at the core of the problem we want to solve with LanceDB is you can put everything together. You can create any way you want. You can run DuckDB on LanceDB data tables.

Starting point is 00:18:20 You can get the data into Pandas and Polars and Spark. And we're working on like Presto and Trino integration. It's sort of taking your existing compute and you can just plug your Lance data set right in. You don't have to worry about making copies for experiments and rollbacks and time travel. That's all taken care of as well. That's really interesting. I think there's a couple of key points where you just said that I'd love to dig in a little bit more.

Starting point is 00:18:47 But one is, actually, we had Chris on, Rick Comoni, and he was talking a lot about how the rise of object store is going to remove the need for data integration. Because if everything's sitting on top of the same S3 bucket, and let's assume we've all agreed on some intermediate format where all the different data tools and ecosystem components can all talk the same language and potentially have the same bucket layout, then what do you need Kafka for? Because everything's in one place in the first place.

Starting point is 00:19:13 And so it sounds a lot like what you're saying is very similar to that, which is your goal is to have your vector store that you're using for doing RAG, plus your training pipelines you're using for doing fine tuning, all sitting on top of the same warehouse. Is that what you are saying effectively? Yes, that's a huge part of it. For example, we're the only open source vector database, I think, that lets you just put data into ObjectStore directly and create it from anywhere.

Starting point is 00:19:41 So a bunch of our users, hey, I can just have my vectors sitting in S3 next to my images or next to the text. They can run LAN CB and they'd be slammed out to query it. And I don't have to pay for anything, essentially, other than the S3 storage. And also, once it's in object storage, a lot of other things are taken care of.

Starting point is 00:20:01 Like, you don't need to worry about replication. S3 comes with a lot of tools around encryption, right? And key management and permissions. So when you have a system that can interact really well with object store, it really simplifies your stack. That's interesting. Curiously get your experience and your thought process. Like, one of the things I've been thinking a lot about is

Starting point is 00:20:22 a lot of the similarity score stuff we're doing in RAG, you made the point already, it's good enough for the 60% solution. One thing you suggested is knowledge graphs. Are we missing fundamental data representations to get from the 60, 80 to 90%? And is it just graphs, or are there other things

Starting point is 00:20:39 you think we're missing to get to the point where we can retrieve the context, even give it to the model to have the reasoning model actually generate that precise response? How big is that gap? I guess the next question I have, since you're a practitioner in the field, you're talking to people building these systems all the time, is this one of those situations where 80% of the effort is going to be in the last 20% or not? How close do you think we are to actually being able to get to reasonable, precise systems with what we have today? And how much of that gap do we still have to fill in? Yeah, I think for a lot of use cases today, the common retrieval methods

Starting point is 00:21:16 are good enough for a lot of them to get to that 80-90% level. So this means like, you know, you're finding optimal ways to chunk your data. You're finding optimal ways to do re-ranking of well-text search and vector search results. And maybe you're doing a little bit of fine-tuning on that vector embedding model. If you have a specialized use case where your context goes very deep into a knowledge graph is when you might need to think about starting to reach out for a graph database.

Starting point is 00:21:52 And I'm an embedded database guy, so as far as graph databases go, I love new solutions like Kuzu, for example, which I think you guys are also familiar with. And when I look at Kuzu, it feels like we share a lot of the same sort of design philosophies around columnar storage and making it easy for people to use it, making it lightweight,

Starting point is 00:22:14 and having it be an embedded system. So I think there's a lot of interesting developments there as well. It feels like there's so much going on in this space. Do you feel like you need to take a bet on what could work in the future and start building around, like maybe Knowledge Graph

Starting point is 00:22:30 is going to take off a little bit more? Or is there any particular things you believe in this space that are around these future of rags or a vector that you think is going to be very important that you're taking some bets on at all or not? I think it's not so much that a particular technique we're going to bet on. So there's a couple of things. One is the big bet that we're essentially making with the philosophy of the

Starting point is 00:22:56 company is that the Rackstack and AI will still need sort of the same high-level data for infrastructure practices that we have from data warehouses, right? Like same schemas, organization scale, data management, and all of those just for a new set of data types, for a new set of workloads, and things like that. That's what we're betting on with the company. Now for new techniques, I think there are lots of new techniques coming up on a weekly basis. And I think framework companies like LaneChain and Lama Index also are at the forefront of that. And they're trying to see what are the really interesting things people are doing and thinking about integrating that into those frameworks.

Starting point is 00:23:43 I think from our perspective, the work that we want to do there is essentially, that's why having that open ecosystem is so important to make it play well with the layers above it. So that's why we made a huge emphasis on our Apache Arrow integration. Once you integrate with Arrow, high-level systems, chances are they already talk to Arrow. So that integration becomes very easy.

Starting point is 00:24:08 And that's why we make that SQLite form factor. So whatever the new techniques are, most of the time it reduces down into a, okay, I can squint and this looks like a modified DataFrame API, or this looks like a SQL query with some variation. So by doing that, I think we don't have to essentially say this technique is going to be the future for RAG versus that one. I have one question. It's more involved in RAG, which is actually more in the multi-modalities. Are there specific types of problems that are better for similarity search or better for different modalities of RAG? Like, for example, are there specific problems like code?

Starting point is 00:24:50 Like, is code better as a graph problem or similarity search problem? Do you have an experience and thoughts on which types of algorithms and approaches are better suited for what types of modalities and problem cases? That's really interesting. I've been working with a couple of friends on using Lansi before, things like Texas SQL, not necessarily code. But it's hard to say whether a graph database search will definitely offer that next step function up. I think partly because traditional graph databases are just really hard to use and that's not

Starting point is 00:25:29 enough. Folks are experimenting with that. Conceptually, I think there is a lot of potential there because, for example, we've had folks come to the Lansi Bee and say, hey, I'm writing this search over code or chat with code thing, I want to be able to retrieve information as, find the most similar classes, and then each class will have methods. And then I want to be able to then traverse into each method and search for the most relevant methods.

Starting point is 00:25:59 And then each method has function signatures and comments and all that, and be able to traverse down and find a search through those spaces as well. On the surface, I think it does look like a graph search problem to me. So I'm sort of eagerly anticipating folks who are deep into graph databases

Starting point is 00:26:17 to experiment with that, and hopefully we'll see some positive results there. Alright. Let's start our spicy future section right here. Spicy futures. And so as you know, we've been doing this for some time now. I want you to hit us with the punch, man.

Starting point is 00:26:35 What do you believe about AI or record databases of the world? It's going to be a future. Okay. I actually kind of have two slightly spicy takes. One is a lot of folks are focusing on the features for RAG and features for vector search. And not a lot of folks are talking about the underlying engineering, the scalability and performance aspects of it.

Starting point is 00:26:58 And I don't think scale and performance is a solved problem when it comes to vector database infrastructure. So our spicy take is traditional microservices kind of like share-nothing architecture of that era of distributed databases doesn't really work for the top end of the scale for vector databases. It just tends to be very complicated to scale and turns out to be very expensive to scale. I'm talking about like multiple billions of embeddings with, you know, thousands of QPS and like, you know, hundreds or thousands of, you know, insertions per second and that type of scale. So my spicy take is basically how we architect at Lance is with that storage layer, this allows us to have a separate shared storage layer and a sort of stateless query layer. So you can find this type of architecture in databases like Neon, for example,

Starting point is 00:27:54 that offers folks a much simpler experience on that Postgres-like OLTP experience. And we're essentially giving folks roughly the same thing for scalable vector search. So that's one. And then number two is, even though I at most vector databases today, they look less like full-fledged databases and more just like an index with a service around that. And so if you imagine back in the day, maybe when B-tree indices were first introduced, it was amazing. You could

Starting point is 00:28:39 wrap a B-tree index in a service and be able to reference your data, but those don't really stand the test of time. On the other hand, traditional databases that have added vector indices, it's also not a long-term solution for enterprises that are introducing more and more AI-native solutions and need more sophisticated data management, querying capabilities, and higher scale. So I think there's going to be a convergence of solutions there where vector database companies in general will try to look more and more like a traditional database,

Starting point is 00:29:16 and maybe certain traditional database companies that don't have as much design architecture baggage can re-architect to become much more AI native. And I think we'll see that one convergence maybe a couple years down the road, but narrow vector search companies will cease to exist or become like a library that plugs in somewhere. In the same token, I think there's going to be a divergence where people who are using vector databases today for RAG will find that there's two sets of needs and directions for data infrastructure.

Starting point is 00:29:53 One is, okay, what I actually want is a search engine. I don't care about the vector embedding. I don't care about these different index. I have this bunch of images and text and videos, maybe all three of those, in some S3 bucket. Hey, search engine, go look at that thing and make it searchable. And I just want to be able to ask natural language questions or send images to it. And it gives me really good results.

Starting point is 00:30:19 On the other hand, it would be like folks who are, okay, I need to actually train my model. I need to fine-tune stuff. I need very good management capabilities. And I need full control and customizability. And what I really need then is, okay, I need a data warehouse or a data lake for all that AI data. So I think folks who are creating

Starting point is 00:30:40 data infrastructure for RAD today, I think we'll see that divergence. Some of them might become services that are search divergence. Some of them might become like services that are search engines. Some of them become like data warehouses. Can you think of like a dividing line in the future as to all the small startups, the SMB and the mid-market,

Starting point is 00:30:55 they're going to go with like the out-of-the-box, simple like nice DX version and then the enterprise is going to do with the data ware. Like, have you thought about how the market segments? Yeah. So I think AI-native companies, large and small, will want that data warehouse because being able to customize that model

Starting point is 00:31:16 is core to their value proposition as a business. And so it's not so much on the company size, but what you're actually doing and what is the role of AI in your company. But even though every company today is trying to add AI, I think most of them are still thinking of it as, I have a core value proposition to my users or customers that is orthogonal to AI. AI is a great value add that I need to tackle. Fundamentally, I am just improving on that core value proposition. And so a lot of them, if they didn't have AI teams before,

Starting point is 00:31:53 they may not want to hire in-house talent to do all that custom model training. And so they're going to want an Algolia for the AI era, if you will, that takes care of all that underlying data management and fine-tuning all that for them. Do you believe that every company in the world will have their own fine-tuned models? Or do you believe that the people that will actually fine-tune models and do the work to bring their private data are going to be the set of folks where AI is like the company, right? Like you have OpenAI obviously is an AI company.

Starting point is 00:32:26 That is what they sell. And then you have Nike who make shoes, but will want to add AI into the website and AI into their backend processes. And they'll want to automate a bunch of things they can, but ultimately at the end of the day, they actually just sell some shoes and their innovation points go to making better shoes, not like making better models from the ground up. I'm kind of curious, like, have you thought about that market segmentation as well? Is fine-tuning an example of a thing? Is data set curation a thing that will be true across everybody? Or are we going to have this sort of like middle layer of verticalized vendors that might be use case specific? Right. So there's a couple of points here. I mean, this is a great question to think of. And a lot of it is just speculation, to be perfectly honest, because who knows what it's going to look like by the end of 2024.

Starting point is 00:33:13 So what I think is for smaller companies where maybe AI is like one application or one use case, most of them probably won't really care to do the fine-tuning themselves. And what they really want is just a way to plug in their own data and use their own data to make Retrieval better. And ideally, it would just automatically get better without them having to do anything. The only thing they have to do is just use the system and then like on the top end like these ai native companies you know your mid journey or runway they're always going to need that level of customization and it's not just fine tuning but like they're training their own models in the middle i think where it's like you know the nikes and maybe like walmart and macy's these are like large companies that are not native traditionally you don't think of them as having the most advanced AI practices. But I think AI in those large enterprises,

Starting point is 00:34:09 it's a diverse set of use cases across the company. There are probably fairly stringent data privacy and security requirements and very complicated procurement process and also an internal team that's managing all that. And so my prediction is they will want to have some sort of customization that they're implementing on their own. And I think fine-tuning would be a perfect example of that

Starting point is 00:34:35 where fine-tuning becomes easier and easier and fine-tuning is much cheaper than just training your own data, pre-training your own model from scratch. And so my prediction is larger enterprises, even non-AI native ones, will want to do at least part of that in-house. So I want to ask the question, fundamentally, there's the neural vector search and there's like the platform or more production ready. There's always a debate about like, hey, why don't you use PG vector?

Starting point is 00:35:07 Or why don't you use like some like native company databases, existing vector added? Can you give us an example? What's like a fundamental thing that's hard to scale when you use like a BG vector, for example, that you've done to make sure it actually works at scale or some production quality is there any particular example we can sort of like talk about so it's sort of like highlights why we need almost like a vector database or even like lance tv specifically

Starting point is 00:35:35 that's being designed from growing up instead of just like yeah totally fucking neon and vg pg vector ish kind of thing and you can just. So I think it's interesting to think about PG Vector in particular. But in general, there's three buckets of things that I hear a lot on. So one is scale and performance. And PG Vector is sort of part of the Postgres architecture. And it works great when you have a small amount of data and your queries are not that complex. But if you're talking about like tens of millions

Starting point is 00:36:09 of embeddings or even hundreds of millions, it's really hard to handle that with PG Vector. Postgres does not scale out, right? You scale up, you just get bigger nodes, but that has a limit and it gets costly. Also, like Vector Search is a very IO-intensive workload that is very different from your traditional OLTP workload that Postgres is used for. So when you mix the two, it gives you wonky SLAs in production.

Starting point is 00:36:39 I still remember when I used to work at Tubi, the data science team wanted access to the MySQL server that housed all of the metadata for their analysis. And that was like part of the production system. And so I think while I wasn't looking, they convinced somebody to open that up. And the first query they sent

Starting point is 00:36:58 promptly brought it down because it was just like this massive OLAP query that got sent to MySQL. And so that's one set of things, whereas, you know, LAN-CV, and also many purebred vector databases really pay attention to this problem of like, you know, how do you scale out?

Starting point is 00:37:14 How do you have a distributed vector database that can handle, you know, up to like billions of embeddings? I think LAN-CV in particular also excels at scaling because our indices are fully disk backed so that the amount of memory you need to use is actually very little. We can allow you to scale compute and storage separately. Right. And so that's one bucket. The second bucket is workflows. And so, you know, I hear like I start on PG vector,

Starting point is 00:37:44 but I'm like having to manage my own embedding pipeline and it's really pain like hey lance cb has this like embedding registry where i can just specify the embedding model and lance cb takes care of calling the embedding model and adding it and all that so it's much easier or things like you know for pg vector it's not a dedicated api so you're a sql query with specialized syntax for vector search. And sometimes if your query is very complicated, the query planner will just skip the index. And so now you get to this really slow state.

Starting point is 00:38:16 And it's hard to predict when that happens, when your query gets complicated. And then the third bucket is just features. Like, hey, I need that a little bit of retrieval quality. I want to do hybrid search. I want to do re-ranking. You know, I want to be able to like store that data to fine tune my model and all that. And like none of those things, Postgres is a good fit for. Cool. Well, I think we got all the spicy take we could have got,

Starting point is 00:38:42 but this is awesome. We have probably even a whole lot more we're going to talk about, but for people that want to know more, learn more about LanceDB or you, where we should find information about him? So if you want to learn the details and see how the sausage is made,

Starting point is 00:38:58 come to GitHub. So our GitHub org is LanceDB, and then we have Lance, which is the columnar format repo, and LanceDB, which is the columnar format repo, and LanceCB, which is the vector database. If you're looking for examples and help, so we have a vector DB recipes repository with dozens of worked examples.

Starting point is 00:39:17 Come to our Discord for real-time conversations with folks. If you want more spicy takes and see me troll people, come to Twitter. I think our company account is LanceyV and my personal Twitter handle is ChangusKhan. It's been such a pleasure. Thank you so much. Thanks, guys.

The Infra Pod - Vector databases is not a feature? Let's dive deeper with Chang

Ian and Tim sat down with Chang She from LanceDB to talk about what's happening on the vector db world, and how Lance started from a file format into a tensor data lake. ...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

The Infra Pod - Vector databases is not a feature? Let's dive deeper with Chang

Ian and Tim sat down with Chang She from LanceDB to talk about what&#39;s happening on the vector db world, and how Lance started from a file format into a tensor data lake. ...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Ian and Tim sat down with Chang She from LanceDB to talk about what's happening on the vector db world, and how Lance started from a file format into a tensor data lake. ...