The Data Stack Show - 159: What Is a Vector Database? Featuring Bob van Luijt of Weaviate

Episode Date: October 11, 2023

Highlights from this week’s conversation include:How music impacted Bob’s data journey (3:16)Music’s relationship with creativity and innovation (11:38)The genesis of Weaviate and the idea of ve...ctor databases (14:09)The joy of creation (19:02)OLAP Databases (22:21)The progression of complexity in databases (24:31)Vector database (29:23)Scaling suboptimal algorithms (34:34)The future of vector space representation (35:51)Databases role in different industries (39:14)The brute force approach to discovery (45:57)Retrieval augmented generation (51:26)How generative model interacts with the database (57:55)Final thoughts and takeaways (1:03:20)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show, Kostas. Today we are talking with Bob from Weeviate and so many interesting things about Bob. He's never had a boss, number one. He started building websites when he was 15. He reminds me
Starting point is 00:00:42 a lot of you from that aspect. And then fast forward, he actually has built a company around vector databases and embeddings that support AI type use cases. I mean, what a journey. That's pretty amazing. I want to ask about databases in general on a spectrum, right? So we've, actually, this is sort of a theme. I almost feel like Brooks could write a thesis on databases from the data section
Starting point is 00:01:12 because we've talked about every type of database. But we haven't talked about vector databases. We've talked, I mean, of course, ad nauseum about sort of OLAP, LTP, workflows, graph, we've had a number of graph databases, but I think this is the first vector database. And so I want to zero in on that and put that in the context of sort of the spectrum
Starting point is 00:01:35 of like basic SQL OLAP, you know, to graph, to vector. So that's what I'm going to ask. But how about you? I mean, there's so much here, so. Yeah, I mean, okay, there are two main areas of questions. I think the first one is to make clear to our audience what a vector database is, how it relates to the rest of the databases out there and why we need a different one, another one. And the other one, which is also I think super interesting, is what businesses you can build
Starting point is 00:02:13 around them. Why someone would build? Why even can be a sustainable business. And it's not just like a feature of another database, right? So I think we have the right person to talk both on the technical side of things, but also have a very good and deep conversation around the business of this type of databases. All right, well, let's dig in and talk with Bob and solve all these problems and more. Yeah. One last thing. He had at least one boss, as he said, right? And that was his mother. His mother. That's true.
Starting point is 00:02:58 The CRO, right? The CRO. Okay. Let's talk with Bob and we'll hear about Bob's mom being the CRO of Bob's life. Bob, welcome to the Data Stack Show. We're so excited to have you. Well, thanks for having me. All right. Well, you have a really interesting history. So I guess technically you've never had a boss, which is fascinating. So you've always been your own boss, which we may have a few questions about that. But tell us about how you got into data.
Starting point is 00:03:32 Where did you start? And then what are you doing today? Yeah, no, that's, that is correct. If you, I mean, if you exclude my mother, I never had a boss. So the mother is boss or like chief product officer for your life. Exactly. Exactly. Yeah, exactly.
Starting point is 00:03:51 Well, more chief revenue officer, but that's different. So, but the no, so I, so I'm a millennial, I'm born in 85. And so that means that, you know what? Yeah, there you go. So there was like, when, when I was 15 years old, like means that, you know what? Yeah, here you go. So that was like when I was 15 years old, like we had, you know, the internet, just the internet connection at home and at school too. So I just started to play around, you know,
Starting point is 00:04:15 build websites, that kind of stuff. And at some point there were people, I lived in a small village in the Netherlands and people said, hey, you know, we need websites for stuff. And I was like, you know, we need websites for stuff. And I was like, you know, I can build you a website. So then I got a gig to sell toothbrushes and lighters on a website. I don't want to know anything about the security that the website had back then, but they basically,
Starting point is 00:04:41 they said, so how much, you know, they said like, how much money do you want to make? So I asked my dad, I said said, how much money do you want to make? So I asked my dad, I said, how much money? And my dad said, you just ask for like 500 bucks. I was like, I'll do it for 500 bucks. And the guy said, deal. And I was like, whoa, that's a lot of money. I'm rich. Yeah, exactly.
Starting point is 00:04:58 Exactly. So I went to the, as you do in Holland, I went to the Chamber of Commerce and I registered my company. And then you grow, right? So you learn a lot. And then I grew into being a more software consultant. I did study in between. No CS or anything. So I studied, I said music because it's another passion I have. I always kept working in technology because i was like i even said
Starting point is 00:05:25 someone i studied in boston i got a grant to study in boston and then on the side i was just yeah writing it was like remote work of all electric so it's like the yeah i was working in boston for these dutch companies and that grew and grew and then at some point i was introduced to machine learning and that kind of changed everything because then I was like, okay, I'm going to stop being like a freelance consultant. I'm going to start a company. So like a product company. So that's, so for a long time already, I'm in software.
Starting point is 00:06:00 I love it. Were you studying music in Boston? Yes. Okay. At Berkeley? Yes. Okay. At Berklee? Yes. Okay. And what's your instrument of choice? I mean, I know multi-talented people go to Berklee, but what's your instrument? Yeah. So I studied with bass guitar. And one of the funny things is that people that are now here on the radio, they were like at the same time at Berkeley as I was.
Starting point is 00:06:27 And that is just super exciting. But Berkeley has been super important for me running a business now. So a lot of things that I've learned being, I was very young. I was like, I mean, I would not argue very young. So it was like really early 20s. And that I flew in to boston and a lot of things that i learned at berkeley are things that i'm using in building business today so it's like there's a there was a very important lesson in my life and if i could go back in in history i
Starting point is 00:07:00 would probably do the same thing again. It was just a great, fantastic period in my life. So yeah, I'm proud of that. Okay, so two questions, because I want to talk about databases and vector databases in particular, but two quick questions on Berkeley, because I
Starting point is 00:07:19 can't help myself. So you said that you learned a lot of lessons that helped you on the business side. Can you give us maybe like the top thing? Because you're an entrepreneur, right? Yes. What did you learn at Berkeley that helped you from an entrepreneurial standpoint? And then I have a follow-up question. So there are two things. So this sounds a little bit cliche, but it's really true. As I learned at Berkeley, if people talk about the American dream, I learned at Berkeley what that means.
Starting point is 00:07:51 People were living 24-7. They were living that lifestyle. And everybody was dreaming big and working together. And you had these amazing artists coming. So that was one thing that i that i learned there but another thing that i learned was that if you know the people if you are the people listening to have you know musicians in there you know in like friends or family they know that you need to do a lot besides just playing right you need to promote your music you need to get it out there you need
Starting point is 00:08:25 to present it online and those kind of things so a lot of things that i learned like this is when we when i started it's very similar to starting an open source project but instead of code you're shipping mp3 files right but the mechanics are very similar so that was something i learned that i learned and to not to go too deep into that but there's a i have a strong belief that in and this is from an author is like a futurist author bruce sterling and he has this talk and i'm i don't know exactly what the name is but if you google this you'll probably find it that he says if you want to know what the impact is of technology on society, you need to look at
Starting point is 00:09:08 what the technology is doing to musicians. Because it will always happen to the music industry first. And that is something, I mean, you can fill a whole episode with that discussion, but that's, I think that's to be true. Because it's a very,
Starting point is 00:09:24 you know, it's a very, it's a very, you know, it's a very, the industry is very, you know, it's, it's not as very strong industry, right? There's really people do this for the love of making art, but the technology plays a tremendously important role in that.
Starting point is 00:09:38 So it's a, that those kinds of things that I've learned there, I know when I grow older, building the business, like, Hey, I actually learned that my time at Berkeley. Man, that's fascinating.
Starting point is 00:09:48 Okay, we should probably do a follow-up episode just because, I mean, music is hard to monetize. You obviously saw that early at Berkeley. Yet there are people with a lot of money who figure out how to sort of exploit, you know, things that people are passionate about creating, which sounds, you know, ironically identical to the venture SaaS industry, which is interesting. Okay, so second follow up question, then I want to dig into your current company and vector databases. But one thing that I've noticed just, I mean, even on the podcast, but throughout my career is that people who study music tend to think with both sides of their brain in a way that's unique.
Starting point is 00:10:35 And I'm not saying that, like, I don't have any science to back that up. It's just something that I've noticed enough to where I, whenever someone's doing software and they are a musician, I'm immediately interested because I've noticed that pattern over the years. You said you discovered machine learning. What I'd love to know is, did or does your study of music influence the way you think about machine learning? Because there is this relationship of structure and pattern within music that is required to create a foundation, but there's sort of unlimited combinations of notes and everything and melody that you can use to create things that are not new, right? I mean, they say, you know,
Starting point is 00:11:23 maybe we've only discovered five or 10% of the possible songs that are ever, you know, able to be created since the history of the world, which is actually very interesting in terms of machine learning. It feels the same way, right? Is that relationship there for you? Yes.
Starting point is 00:11:39 And that relationship mostly sits in my mind. And let me explain what I mean with that. So when i was very young around the same time that i was 15 right so i got like you know interested in things like you know red or chili peppers that kind of stuff so you had like guitar solos right so and i was interested in that so what was happening there and then a teacher in high school is like well if it's purely that what you like you might want to listen to the later music of Miles Davis because they have done 30 seconds of guitar, but you have six minutes. Yeah.
Starting point is 00:12:09 And if you go into that, they double click onto that. You go like, hey, let's see what like John Coltrane was doing. And if you double click on that, you get to the classic. So you look at like Bach, Stravinsky, that kind of stuff and so on and so forth. So that is how you study these kind of things. And so it gets more and more complex. And there's an aesthetic in that complexity. And every time that I was working on it and I figured something out, I could see it.
Starting point is 00:12:34 I have these structures that I just can visualize and then I see it. And that is the exact same thing that's happening to me with machinery. So when I started with Weaviate and with these early models, the moment that I figured out that they were, what they were doing, I could see it. And then I, and everything that I currently work on is like, it's very, I hate it if there's something happening that I don't understand, but then if I say I understand it, I can visualize it. And that's the exact same mechanic in in my mind so so to give
Starting point is 00:13:06 you a third example i'm as a hobby i'm very interested in in in language philosophy i once even gave a ted talk or tedx talk about software and language and those kind of things if i read these kind of books or they're like the like i don't know like you know i like to the work with consign that kind of stuff then you're reading it you don't getting it and then at some point i can see it and then when i can see it i understand it so that my point is that mechanism in my mind is very similar yep super interesting okay i love it i mean we definitely i mean casas we should do an episode just on that because that is i mean that is so fun to talk about but okay let's get down to business
Starting point is 00:13:51 we v8 can you tell so this is a company you founded several years ago you've been working on it for some time what is it what does it do and then i want to dig into databases in general from there and learn from you about sort of the progression of databases on the spectrum of complexity yeah so i think this is best explained by giving a little bit of history that what i went through so i was i was as a consultant i was working at a at a big firm like publishing firm and the and they hired me to work on something. And they were looking at new types of products, what they could do with scientific papers
Starting point is 00:14:30 and that kind of stuff. And I was introduced to Glove, which is a model that produces embeddings for single words, so word embeddings. And people familiar with this, there's like this famous calculation where they do king minus man plus women and then it in vector space it moves to the word queen yeah and i was immediately i was like when i sort of was like oh this is exciting this is cool
Starting point is 00:15:01 so i started to play around with this and i I got this very simple idea, very simple. I said, rather than doing this with individual words, if I take a sentence or a paragraph, and I take all these words from the paragraph, and I create a calculus centroid, so the center of those vector embeddings, then I can represent that paragraph in vector space. So now I can do semantic search over those kind of...
Starting point is 00:15:27 That was a very early on, that was an idea. And I wasn't sure how to build it yet, how to structure it yet, those kind of things. And so something very logical that you started to do is that you take a database, right? And we experiment with different databases. You start to, you try to store these embeddings in there. Storing is not the problem, retrieving is. So then you get in a situation, how are we going to retrieve that stuff? And so we were experimenting with the very early so-called approximate nearest neighbor
Starting point is 00:15:56 algorithms. We can double click on that if you like, but I was excited to experiment with it. And then my co-founder, Ejen, started to play a very important role because he said, well, actually for search back then, like traditional search, the library that's kind of used is Lucene. So Lucene sits in Solr, it sits in
Starting point is 00:16:15 Elasticsearch, it sits in MongoDB, it sits in all these kinds of solutions. And the way that Lucene shards on scale is suboptimal for sharding the approximate nearest neighbor index. So then we were like, hey, wait a second. Now we have this semantic search use case. And we know that there's like a mode for a new database.
Starting point is 00:16:39 Let's build a new database. And that's how the idea was born to start to work on Weavey8. And so Weavey8 currently, you store a data object like you would do in any NoSQL database, and you can add a vector embedding to it, right? And that is how that was born and how all these things came together. Back then, we did not use the term vector database, vector search engine, anything like that. We were just looking for ways to position it.
Starting point is 00:17:09 What did you call it back then? Just out of curiosity, what did you call it? Because I mean, vector database and embeddings have become really popular terms with the hype in the last year around LLMs. But what did you call it back then? Because you were doing things that were sort of primitives for Knowledge Graph. Yeah, because you had these data objects floating around the vector space, and then you could make a link between them.
Starting point is 00:17:37 And it's like, okay, so that was... But that caused a lot of confusion because Knowledge Graph was kind of adapted by the semantic web people. And I remember that I went to semantic web conferences presenting WeaveYate and people didn't get the concept of using embeddings.
Starting point is 00:17:54 They were like, so you have a keyword list or something, or how do you map? I said, no, I'm using the representations from the ML models. By the way, this is before the Transformers paper, etc. was was released, right? So it took a long time for people to get an understanding of what the data type of embeddings,
Starting point is 00:18:16 what you could do with them. I've been talking about this stuff for a long time. Probably on YouTube, you can find some of these old talks. Don't try to find them, but they're probably floating around somewhere. We won't put them in the show notes, yeah. So, okay, so one more question. I'm sorry, again, I can't
Starting point is 00:18:32 help myself, but before we get into databases, how did you... That's a lot of stamina, right? I mean, you've been doing Weevee 8 for, what, like, four or five years now? I mean, quite some time. And so, before embeddings were cool, before the language around vector was cool.
Starting point is 00:18:51 And obviously you've never had a boss, so you're of course paving your own way, but that's a lot of endurance it seems like to sort of present at a conference where people don't get it. How did you deal with that? You know, I mean, you obviously believe in it enough to, to continue on, but that's hard. Yeah, I, this is, that is an, that's a great question. I'm, so one of the things, so this
Starting point is 00:19:18 is maybe also a relation back to what we talked about with art is like a i can really fall in love with something right and then just i just put my you know my intellectual claws in it and i just don't let go yeah and i'm also i never i have a very i'm blessed with a very wonderful life and i meet amazing people and i'm able to build an amazing team. And I'm just, that was not planned. I'm just, it's like,
Starting point is 00:19:49 like, you know, I just go with the flow. Yeah. So I never saw that as a, as an issue. It's just, I'm just enjoying the ride and it's just amazing.
Starting point is 00:19:59 It's just, it, so it's, so I appreciate that you say that, but that's not how it feels it's like it just feels like i'm just yeah go with flow i it's interesting if we tie it back to music you hear musicians talk about you know writing a certain song right and they're like well how did you do that you know and a lot of times you'll hear musicians
Starting point is 00:20:25 describe it as like, well, it wasn't like I didn't set out to write a catchy song. I just had something inside of me that I needed to express. And so this is just a process of me expressing what's inside of me, right? And it happens to be that it's a song and it happens to be that it resonated with a lot of other people or whatever, but that sounds really similar to what you're talking about. Yes. And one of the things that I do right now, so I'm in a very fortunate position that I can talk to a lot of young people, right? Who are studying or do experiments with things that I, and one of the things that I'm trying to get across to them is like,
Starting point is 00:21:05 whatever you do, make something, create something. And if you're now a student and I, so for example, when I talk to you at the business school or something, because then if I talk about software, it's always very,
Starting point is 00:21:16 like a lot of people show up because no tech. So, and I said like, if you now work at a Starbucks, stick, keep working at the Starbucks and try to build something. Don't get excited by these big companies, you know, offering you big.
Starting point is 00:21:33 Try to use this time you'll have to make something. And I don't care. So the two talents, I guess, that I have are in working with software and in music. But if it's cooking, then cook, right? If it's writing, write. If it's branding, start a design, whatever, right? But make something. And because life is so much fun,
Starting point is 00:22:00 if you make stuff, whatever stuff you make. And that is what I've been doing always. I've been making stuff and that's a, and what I'm doing now is just the company building. That's a form of making that gives a lot of joy. Right. So that's a, so yeah. So that.
Starting point is 00:22:21 I love it. I mean, gosh, what a good, we could keep going down the path, but let's, okay, so I want to get technical. And can we talk about OLAP databases? So let's start with the, so Weave8 is a form of database, and we can talk about the specifics. But I want to go back to basics. Most people who are interacting with a database are interacting with LTP or OLAP database, right? And mostly OLAP, if you're doing any sort of SQL-based workflow where you're building analytics or anything,
Starting point is 00:23:00 right? And so that runs the world, right? I mean, any KPIs at any company of any size, it's just, we're talking about OLAP workflows, right? Yes. Okay. Weevy8 is sort of on a, like, if we think about the spectrum of complexity of databases and use cases, Weevy8 to me, as I understand it, is sort of is much further along the spectrum of complexity. And I think the step in between, and you correct me if I'm wrong because you're the expert, but a lot of teams, when they're working in OLAP and they have 10,000 lines of SQL and it's getting really crazy, a lot of times they'll say like, okay, maybe we need to do like graph, which will help us solve some of the relationships that we're trying to represent in
Starting point is 00:23:52 SQL. Okay, great. So they like move to graph. So I think, you know, probably a lot of people are familiar with a graph database. And then we have a vector database. And so we actually haven't on the show, I think, had a specific discussion about vector database. And so can you just help us understand when you go from sort of OLAP to graph to vector cases and use paint the spectrum of, of database use cases for us in terms of complexity. Yeah. So I can offer a, a way for people to think about this. Right.
Starting point is 00:24:37 So, so, and, or I think about it. So if you, if you envision it, like, let's say you have, like, a big circle. And in the center of the circle, you have databases like Postgres, MySQL, those kind of things. And these kind of databases are, you could say, catch-all databases, right? So you can do everything with them. So you can make graph connections, like, in the form of, like, joins. You can store vector embeddings nowadays in them. You can store strings in them and those kinds of things. And that's great.
Starting point is 00:25:13 For a lot of use cases, these databases are fantastic. And the people designing these databases, they make trade-offs in the decisions they make to build these kinds of databases to support all these cases. But that means that there is a limit to that, right? There's a limit. So let's say graph as an example, I think, because graph is a great example. If it turns out that your data set is very reliant on these graph connections, then you
Starting point is 00:25:41 run into an issue at some point. And we all know the, or what, maybe not everybody, but there's this term like join hell, right? So at some point. Sure. Yeah. So, so then you say, well, actually it's just, we run into join hell. So now we, so in that circle in the center, we have these core SQL days. So we move a little bit outside of that.
Starting point is 00:26:01 Right. So we start to move in the no SQL space. It's not SQL anymore. So we're moving it. We're saying, we got to design something from the ground up that is very good at dealing with these graph structures. So if you don't have these graph structures or just a tiny graph, then it's fine. Stay in the center. But if you want to do something more towards the frame. Sure. And what do we often see is that if you look at the data types you have in the center, a relation became graph databases. Date stamps became time series databases.
Starting point is 00:26:32 Yep. Searches became search engines. Yep. that the specific data types that you have in these databases, they kind of ask for bigger use case. They kind of ask for their own category, basically. Sure. As you start to scale, databases emerge that solve these particular problems, for sure. Exactly. So what you start to see is with vector databases, exactly is the same thing.
Starting point is 00:27:11 I think, so you have a vector embedding, which is a data type in itself, right? And the uniqueness in these databases is not so much in storing them as much as in retrieving them fast. Only one of the things, or a thing that we started to see, right, if you look back at history, is that start from the perspective of that SQL database, and then we go out to the fringes, is kind of skipped. So then we go and say, okay, we see this new data type, e.g. vector embedding. Let's just start in that category, right? So let's just create that category and work in that category, right? So let's just create that category and work in that category. Because what starts to happen is that these databases,
Starting point is 00:27:47 hence that everything does not SQL, no SQL, is that you start to have different ways to interact with the database that is very well suited for that specific data type. You want to have different APIs with a time series database
Starting point is 00:28:02 than you might want to have with a vector database, for example. So that is how I visualize it. So yes, in that center, everything comes together in the center. But the moment you want to really double down on one of these data types in the SQL databases, you're probably better off with a purpose-built database, regardless if it's graph, time series, vector, whatever. Yeah, 100%. Okay, well, one more question for me, because I've been monopolizing here, because I know Kostas has a bunch of questions, but can you just define a vector database?
Starting point is 00:28:37 I mean, a graph database, I think, makes sense to a lot of people because creating relationships between nodes and edges in SQL is brutal at scale. And so a graph database is a very logical conclusion if you need to represent social relationships or something like that. So that's logical, I think. But what's a vector database? And what's sort of the graph thing? Let's say like that's social, right? You move from the center out of Postgres and you need to represent complex social relationships.
Starting point is 00:29:14 And so graph makes sense. What's the thing that pulls a vector out of the center? And can you describe the vector database? Yeah. So a vector database is in essence, in essence, often a search engine in the majority of cases. It's a type of search engine where the vector embedding is the first class citizen, right? So that is the first class citizen in the database. So the way that the database shards, the database, the way that you scale with it, all those kinds of things, these architectural decisions in building a scalable database, they go all the way back to looking at it from the perspective of the vector index,
Starting point is 00:29:56 right? That sits at the heart of it. So that's how I would define it. It's just a database where the vector is a first-class citizen. And then you have a UX element to that. So the way that developers interact with the database is tailored to that, to those types of use cases. Okay.
Starting point is 00:30:14 What's the difference between a vector database and something like Lucene, right? Because I'm old enough to experience, let's say, the introduction of inverted indexes. And suddenly we were like, oh my God, like we can have so, so much like fast retrieval of data. By the way, I think I'm revealing a little bit like what's going on here because Lucene is also about retrieval, right? It's how we can trade off upfront processing to be able to go and search very quickly the kind of unstructured data, right? And there's a lot of NLP work
Starting point is 00:30:57 that has been going inside this library, right? But we have this for many years, right? And we have products and businesses, actually, that have been built on top of that, right? We have had this for many years, right? And we have products and businesses actually that have been built on top of that, right? We have Algogia, for example. So, and, okay, obviously another
Starting point is 00:31:15 company that came from the Netherlands, if I'm not wrong, which is Elastic, right? Yes. So there is something in the Netherlands about... What's that? It's saying. Yeah. So what's the difference and how like, and by the way, the second question, like a follow-up question to that, how do they compete? They compliment each other.
Starting point is 00:31:38 Like, and how do you see like the business case there? This is a, this, thank you. This is an excellent question because I get this question a lot. So I'm happy that you're asking it because now we can broadcast like the inside. So, so a little bit of a preamble before I go into the answer. So they're like three things that play roles here,
Starting point is 00:31:56 right? So one is search algorithms, regardless if that's for vector, like approximately neighbor or in more twisted keyword, BM25, those kind of things. So we have the algorithms. Then we have libraries. And these libraries, those are, a library contains one or more of these algorithms.
Starting point is 00:32:19 And then you have databases. And a database can be built around the library. It doesn't have to be. And what you try databases. And a database can be built around a library. It doesn't have to be. And what you try to do with the database is that you offer functionality that people expect from a database. So, for example, CRUD support, create, read, update, delete support. It can be backups. It can be storage. It can be, if it's transactional, you know, certain guarantees or et cetera, et cetera.
Starting point is 00:32:43 So those are three distinct different things. So Lucene is a collection, is a library of a collection of search algorithms, mostly tailored around keyword search. It's relatively, in the world of software, relatively old.
Starting point is 00:33:04 And I don't mean it in a positive way, like an old library that has brought a lot of value to a lot of people. And there's also an equivalent actually for ML, and that's called Face. That's built by Facebook. So those are two libraries. And so what you started to see with Lucene was people saying, hey, we can take Lucene and we can turn it into a database. Add that layer of functionality around it that makes it a database.
Starting point is 00:33:34 Elastic, solar. I believe Neo4j uses Lucene for sure. Et cetera, et cetera, et cetera. So the people started to add that. So now a very logical question would be, okay, great. So now we have a new data type, factor embeddings. Let's add it to Lucene, right? That's a very logical thing.
Starting point is 00:33:54 People did that. So if you now see these databases that I just mentioned, talk about Lucene, sorry, talk about a factor search, they're often Lucene-based and they use the A&N algorithm that's in Lucene. So now the question becomes, so why, then why
Starting point is 00:34:10 new databases, right? So why just not leverage Lucene? And the answer has to do with if you use Lucene at the core
Starting point is 00:34:20 of your database for search, you're bound to the GVM and those kinds of things. So you shard the database in a specific way, right? And it turns out that the algorithms used to scale Proxmox nearest neighbor is suboptimal in Lucene.
Starting point is 00:34:41 You can even see this in the open source Lucene community that people are debating this till today. Some people disagree with Lucene doing this at all. So now all of a sudden, and that's why we thought, hey, we believe that there's room in the market to build something new because, of course, a production database needs to shard and replicate and what have you.
Starting point is 00:35:00 And that's where that's coming from. So what people will notice is that if they use a Lucene based search engine or database for really heavy vector processing work, they will
Starting point is 00:35:18 run into scalability problems and that is why you see these new databases and you had a second question but I forgot what the second question was sorry about that. I also forgot to be honest but I remembered it I think so might be a new one but it's okay
Starting point is 00:35:34 it's fine. Yeah if they compete actually like or if you see like how you see the future right because if you think about it like okay let's take solar and I'll keep like elastic a little bit like outside of the equation here for a very particular reason, which is business oriented.
Starting point is 00:35:51 Like I won't like to come back to that. But if we talk like primarily for information retrieval and search of unstructured text, right? Maybe someone can argue that, okay, why do we need both, right? Or do we need both, like is the question. So the question here is, is there like a future for these kind of inverted indexing techniques? Is this going to be, let's say, abandoned because vector space is like just a better presentation and add all semantics. Or there is a reason like to keep them both there, right?
Starting point is 00:36:29 Even if they are not like in the same system, let's say they're like just different systems. Let's not talk about the systems here. It's more about how complimentary of a technology they are in terms of like the use cases that they share when they, we use them as products, right? Stas Moukallisoukis Yes. So this is an excellent question. I can actually marry the two things together. So if you kept Elastic, if I may, I'll bring that back in,
Starting point is 00:36:55 right, to answer your question. So the thing is, there's one thing, for example, what we know when it comes to NLP, because of course we can do ML not only with text, but also with other things, but purely when it comes to text, we know actually that mixing the two indices works best. So a hybrid search yields the best results, where you mix the dense and the sparse index together. So the embedded from the model and then, for example, BM25 index.
Starting point is 00:37:25 That works best. And it turns out, and then the word scale plays an important role here because especially on scale, that starts to play a role. Now, what's interesting though, and now I'm taking off like the tech hat and I'm putting out my product slash business hat. That's like a thing that if for everybody listening with the, and if they have the ambition to build their own database company or those kinds of things, that's like a, that's like a little, that's like a little trick or like a thing that I can share.
Starting point is 00:37:58 Right. And with the exception of the SQL database that we talked about. So where the, where like my SQLs and Postgres are of this world, but everything around that, what you start to see is that people start to position these databases at something that they are uniquely good at, right? And so if you take Elastic as an example, right. So yes, the database is the search engine. And yes, that's what a lot of people use and makes a lot of people happy. Right. It is actually what they build as a business is focusing on observability and cybersecurity.
Starting point is 00:38:41 So what I'm doing in my role is that I'm asking the question, if I take the vector database, what's that for us? And it turned out that we learned, thanks to the release of JetGPT, played a very important role in that. We've learned what that unique thing is for the vector database. And that comes in the form of something called retrieval augmented generation. We can talk about that if you like, but that is to answer your question. So that is the big difference, right? So at some point it goes out of the realm of purely that architectural decision. So like now this database exists and it's structured in a certain way with a certain
Starting point is 00:39:21 architecture. What kind of use cases does that enable that are unique for this database? And that is how these companies start to grow around the database. So yes, Elastic is a search engine, but maybe a business buyer might say, well, for us, it's just an observability tool, right? And it plays a tremendously important role in that. So, and vector data bases start to gravitate
Starting point is 00:39:48 in another direction, right? More the generative AI direction where they just play such a crucial role. Actually, you did like an excellent job in answering my next question. That's why I wanted like to keep Elastic outside like for a follow-up question, exactly for that reason, because I would like to say exactly what you said,
Starting point is 00:40:09 that Elastic ended up as something that is a product for observability, used primarily for that. And that's why things actually get really interesting when you start building businesses and products on top of the technology itself. And my follow-up question would be exactly that. Like, where do you see these embedding databases or vector databases, whatever you want to call them, like, going towards, right? What's the equivalent of observability for these systems?
Starting point is 00:40:44 And we will get to that. But before we get to that, I think I'd like to do like a bit of, let's say, go like a little bit back to high school, I would say, and talk a little bit about algebra, vectors, cosine similarity, and talk a little bit about like the basics of how do you retrieve information from these systems. And hopefully we've like in, it will sound like extremely
Starting point is 00:41:11 familiar, like to everyone. Before we go to the more, let's say like in the, into the indexing part, like the, the sophisticated algorithms that are approximated and all these things. But let's, I'd love to hear from you, like what is like the basic operation that actually happens. That's pretty much like everyone who went through high school probably knows about, right?
Starting point is 00:41:31 Which I find very fascinating, actually. Yeah, it's a, I've been, so it's funny that you asked this question because I've been looking for a lot of metaphors and those kinds of things to explain it. And it's always the question is like, how deep do you want to go? I think if I, and maybe this is interesting for the show notes. So Stephen Wolfram did a very interesting blog post where really in a very easy to understand language
Starting point is 00:41:57 goes really into like, hey, how do you create an embedding? And he takes Wikipedia pages as an example. So he says, for example, if you have the question, for example sentence finishes and the cat said on the and then how do you get to the word met right so how do you get there and then he explains that from like distances from words and sentences and those kind of things so i'll i don go, I will not go into that because then we just need another 30 minutes to get into it. But let me take as my starting point
Starting point is 00:42:30 that what the machine learning model does, and technically speaking, you don't need a machine learning model for it. It's just, otherwise you need to do it brute force and it's just going to take way too long. So you want to predict. And what you're predicting in the effector embedding is a a geographical representation of these data objects closely to each other so very simply put if you think about a two-dimensional
Starting point is 00:42:58 you know sheet of paper and you have individual words, then probably the word banana and apple are going to be more closely together in that two-dimensional space than the word monkey, right? So, and maybe the word monkey sits closer to banana than it sits to apple, right? Because somewhere when it was training these texts, so it's like, you know, if you say like, you know, monkeys live there, they're there, and they like to eat bananas, blah, blah, blah. So the word banana in these sentences is more closely related to a monkey than apple, for example. But, you know, if you are at the, for example, Wikipedia page of fruit, they make, you know, examples of fruits, apples, bananas, et cetera. And so what we also learned was if we do that in two dimensions, we lose too much context. So we can't represent it in two dimensions. We
Starting point is 00:43:54 can't represent it in three dimensions, but it kind of starts to make sense from 90 dimensions and up. So the smallest representation back in the days from GloVe was 90 dimensions. But geographically speaking, it's the same. And the distance calculations that I use, so cosine similarity, Euclidean distance, those kinds of things, those are just very similar, the same mathematical distance metrics that you would use in a two-dimensional or three-dimensional space, just you apply them on a multi-dimensional space. But conceptually, it's the exact same thing. And rather than, so for example, Steve Wolfram does a great example in his article where he says like, he said like, if I want to brute force calculate this from the wikipedia page for dog and cat that you know that's kind of doable but
Starting point is 00:44:51 then it doesn't really make sense but if i want to do it for the whole wikipedia i need a prediction model and if i want to build gpt and i want to do it for the whole web then it's impossible to do that brute force right so you i mean technically theoretically speaking it's, then it's impossible to do that brute force, right? So you, I mean, technically, theoretically speaking, it's possible, but it's very impractical. So what the models do, the models predict where these words sit in the distance metric. And later that evolved from single words to sentences. So for example, that's why you got like things like sentence birth, right? So that you did that for a full sentence, et cetera.
Starting point is 00:45:26 And that's how that started to evolve. But in the end, there are two types of models. And the first type of model is that what it generates is a vector embedding. And we're not talking about the generative AI, which you get with like JetGPT, just talking about a model that generates these vectors and embeddings. And it turns out you can represent text, images, audio, heat maps, what have you, in vector space. And then if you store them in a database, you can find similar items.
Starting point is 00:45:55 And that's how we do search. Awesome. So, okay. One way to do it is, let's say, you have like a bag of all these vectors there. You get a query, which is like, when we say query, it's say you have like a bag of all these vectors there, you get a query, which is like, when we say query, it's not like a SQL query, it's just a textual question, like, that the human would write down, right?
Starting point is 00:46:14 Turning again into like this vector representation. And then you start finding similarities with a bag of, let's say, vectors that you have, right? And you can do it like, let's say, vectors that you have, right? And you can do it like, let's say, you can brute force that, right? Like you can go there, do like a cosine similarity across all of them and choose, let's say, I don't know, like the best one or the five best ones or like whatever, right? When, why don't we just do that for a retrieval?
Starting point is 00:46:42 Like, and at what point someone who builds an application should start considering indexes, approximate algorithms for that, and at the end, a system, a DB system like Wave8 that does that at scale.
Starting point is 00:47:00 Yeah, so that's an excellent study. So the way of doing that brute force, as you described, is a way how people do it. To go even a step further, when I started, when I was starting to play with this, that's how I did it, right? I did a brute force. So what you do is this. So let's say that you store the vector embedding again for, what did we have? Apple, banana, and monkey.
Starting point is 00:47:25 And you have a semantic search question where you say, like, you know, which animal lives in the jungle, right? Then you get a fact-train betting. So you compare, if you did brute force, you compare that to apple and it gives you a distance. Then you compare that to banana, gives you distance. And then you compare that with monkey and it gives you a distance. Then you compare that to banana, gives you distance and then you compare that with monkey and it gives you distance. So now you have three distances and you basically can organize it based on the shortest distance. Yep. Great. So that's like a linear function, right? So if I now add like a fourth data
Starting point is 00:47:58 object, you know, it takes a little bit long. Fifth takes a little bit long, et cetera, et cetera. So if you now have a database or like a serious project, and we're not having three data objects, but we have 100,000, a million, 10 million, we have users that have that in the billion scale. So now imagine that you have a production use case for an e-commerce player, right? And you not only have like a billion data objects, but you also have multiple requests
Starting point is 00:48:26 at the same time. You don't want to do that brute force because then people can go on a holiday and when they come back, they get a search result, right? So I exaggerated a little bit, but that's what the brute force problem. And this is where the academic world started to help because they invented something called approximate nearest neighbors. So these things live in vector space. So you place this query ahead, the animal that lives in the jungle, and you look for
Starting point is 00:48:57 the nearest neighbors. And what that does, those algorithms, they are lightning fast. They're super fast in retrieving that information from vector space. You pay with something though. You pay with the A, with the approximation. So now with brute force, you can guarantee 100% accuracy. You can't do that with the approximate algorithm. So you pay with approximation, but what you get in return is that speed improvement. And now all of a sudden we can build these production systems.
Starting point is 00:49:28 And what the database does for you is make sure that you can run that production system rather than just adopting or using the algorithm. Awesome. We can talk a lot about that stuff, but I want to spend some time about the use cases and more about where these embedding databases are going forward. What kind of categories, product categories, are starting to form there. You mentioned something about RU and open ai and llms so tell us a little bit more about that how like waviate or any system like waviate is adding value to a user and what's the use case right right? Yeah. So even I'll take off again my tech hat and put on my business hat. So the way you can look at this is like
Starting point is 00:50:35 if you build a new type of database based on a new data type, e.g. a factory database, you can look at the use case. So we are open source, right? So people start building stuff and you just look at what these people build and you ask them, you know, at the use case. So we are open source, right? So people start building stuff and you just look at what these people build and you ask them, you know, what are you building? And then what you try to do with the answers that people give you, you basically put them in a box, right?
Starting point is 00:50:55 So you put them in the box. Are they building a displacement service or are they doing something completely new? Yep. And the displacement service for us is, as I like to call it, better search and better recommendation. So people were like, you know, we're not happy with the keyword search result that we're getting. We're going to adopt these machine
Starting point is 00:51:13 learning models to better search. So that is, for example, why we got functionality like hybrid search and those kind of things, because it helps people to do better search. But then all of a sudden, we, and this is really in the, as I like to call it, post-GPT era, people started to do something.
Starting point is 00:51:36 And what they started to do was that they said, well, we love the whole degenerative models like, you know, GPT, but used in GPT, et cetera. So the generative models, but also from the open source models, the coherent models, the nowadays centropic models, whatever, right? But they said, like, we have one problem. We want to do that with our own data. And so what started to emerge was that people used the vector database to have a semantic search query, return these data objects, and inject them into the prompt. And that process is called retrieval augmented generation.
Starting point is 00:52:16 So what I did now is that now you can double-click on that again, right? And you can say, okay, so, you know, but that's quite primitive, right? It's a primitive way of, it's nice, it works, it makes people very happy, but it's a primitive way of injecting that into the prompt. And now you see there's a lot of research happening and not research as in like, we need to wait two more years before we have it.
Starting point is 00:52:38 No, it's released now and you can't see the first signs of that already. Where research has said like, well, let's not do that. Let's marry the vector store that's in the vector database with the retrieval in the model. So the model knows that it needs to be retrieval and it gets back a bunch of vectors and it's able to generate something based on that. So now from a business perspective, you have a very unique use case within vector databases that not only have a uniquely positioned vector database, it also solves the user problems of data privacy, not having to fine tune your model, explainability. So the model generated X that came from document Y in the database. So it ticks all these kind of boxes.
Starting point is 00:53:27 So now zooming back out, now I have in my displacement services too, which are great. But now the first one sits in my new use case that we see, that people use the Factor database to do every nice, cool Gen AI thing with their own data. Okay, so how does this work? I mean, you mentioned that, okay, you retrieve some data, you ingest them like in the prompt, right? That's straightforward. You mentioned also that there's research around that goes beyond that. Tell us a little bit more about that. That sounds quite fascinating. What happens there? The user is actually questioning the model directly and the model goes and retrieves the data from the index, from the vector database?
Starting point is 00:54:28 Is the user aware of that? Like, how does this work? So, first of all, the user is not aware of this, right? So, this is a great thing, right? So, that is, by the way, something, on a quick side note, what I believe. I think we should not make it too complex
Starting point is 00:54:44 for our users that they need to create their embedding. We need to help them to do that. And especially for the rack use cases, we just need to do that for them. So the injection in the prompt is a very, I call it primitive rack, right? It's a very straightforward primitive. But, and this is something I could point people to that they might find this interesting.
Starting point is 00:55:07 If you go in Hugging Face in the Transformers library, there's a folder in there and that's called Rack. And there they have a little bit more of what I call sophisticated Rack. So they use two models, one model to create a vector embedding to store data in the vector space and retrieve them. And another model where you feed in the vector embeddings and tokens come out on the other end to generate an answer.
Starting point is 00:55:37 So the more efficiently you want to marry them, as we like to say, you want to weave them together, right? The model and the database. Rather than them as two separate entities. But that means what you can do is, for example, you can create smaller ML models. Because now you don't have to store all the knowledge in the model. You just need to have its language understand. And the model just needs to know, I now need to retrieve something from the database. You get to real-time use cases because these databases can update very fast and you can just keep doing vector search over them. So the closer we can
Starting point is 00:56:09 bring them together and marry them together, the more efficiently they work. So now you see, for example, that out of the models don't come vector embeddings anymore, but like binary representations of these embeddings and those kinds of things. And if the database can eat that information and provide an answer, then the UX tremendously increases, right? Because you just marry the two things together. Yeah, 100%. I mean, it's like, I guess we have the UX, like the user experience, the developer experience,
Starting point is 00:56:41 and probably we start to have like working on the model experience. Because from what I hear is that we are starting building like databases that they are going to be primarily interfacing like with the models, not with humans. Yes, exactly. And this is something, so I'm happy that you bring this up because this is like a second use case that we're working on. And ironically we kind of discovered that one before we had the old REC use case.
Starting point is 00:57:09 And my colleague Connor wrote a beautiful blog post about that on the Weavehead blog, what we call generative feedback loops. So what you now start to do is you have a bunch of data in your database, in your effective database, that interacts with a generative model. But everything we've just been discussing until now is like a flow, right? So you have a user query, the model processes that, knows that it needs to get something from the database, injects it into the generative model, something comes out. So we go from left to right. But there's no reason why you can't feedback that back into your database so the use case that connor is describing in the blog post is that he has airbnb data that has like the name of the host the price the location but not a description yep so basically what he does is that he says okay
Starting point is 00:57:59 i use this the rack use case to to okay, show me all listings without description, generate based on these descriptions, a listing for an elderly couple traveling, younger couple traveling, store that with a vector embedding back into the database. And now all of a sudden, the database starts to populate itself. You can use it for the database to clean itself and those kinds of things.
Starting point is 00:58:20 So we now not only have human interaction with the database, but also model interaction with the database. And that's the second new use case that i'm extremely excited about yeah okay we are close i mean we are at the end here so i'd like to get you to have you like back to talk more about that stuff but there is one question that i cannot ask. And I want you to put back your business hat again, right? And as the founder of Wavi8, right, building like this database and entering like this new era where, let's say, Wavi8 itself is not going to be exposed to the user, right?
Starting point is 00:59:05 But to the model. And it sounds like a complement problem here. Like who is the complement like as a product between like the LLM and the vector database? So how do you see this playing out like from a business viability perspective, right? Is this going to be a feature of the LLMs? Is this going to be a category that can stand out on its own?
Starting point is 00:59:33 What's your take on that? So this is an excellent question. And as you could imagine, in the role that I have in this company, it's one that I think about literally daily. So let me try to answer it, but what I think it's not. So I don't think that pure and alone vector search, so looking at these traditional plays, so just the vector cases, I don't think that's going to be the answer, right? So yes, people want that. Yes, there will be some market for that.
Starting point is 01:00:10 But I think that indeed marrying the two together, or again, weaving the two together, that is where the big opportunity lies. Now, how users will interact with it, how they will get access to that, what path they will take to that, we do not know, right? So by the time this episode airs, then one of the things you would see is that we have
Starting point is 01:00:33 this combined offering with AWS, right? So that you use the models, SageMaker, we've invented database and it intertwines. Yeah. So people press button and it comes all together. But we're going to learn if people will take that path through the models or through the database,
Starting point is 01:00:52 but they need both of them. So hopefully in a couple of months or a couple of weeks, I have the answer. But this is all very new and very fresh. But I do think that for us, at least,
Starting point is 01:01:05 that's the big new step. So how do you, as a vector, stay two steps ahead of what's happening in the world? And I think this is the answer. Yep, yep. Awesome. Okay, Eric, unfortunately, I have to give the microphone back to you.
Starting point is 01:01:22 Well, on yours. Yes, I guess. You know, it's funny. We sort of fight in halves about the show. Okay. This has been amazing, but Bob, I actually want to end on a very simple question. When you've had a really long week and you don't want to think about vector databases or embeddings or business and you put a record on the record player, what's your go-to?
Starting point is 01:01:58 Like what are the top couple records that you put on the turntable? So I am a... Recently, I've been to relax. I'm listening a lot to the solo concerts from Key Jarrett. So I love that. It goes back to what we discussed. It's like, for people who don't know him,
Starting point is 01:02:23 he gives a one hour concert he sits behind the grand piano and he just starts improvising and every night is different and to go on that kind of a journey so that is something that i'm that i'm listening to a lot i like a lot of what's coming out the scene in la now with like people like thundercat etc i like that a lot so those kind of things i i so i would argue so music that's i would answer with music that takes me on that journey that i can take my mind off things so that's what i'm currently listening to love it all right well bob it's been so wonderful to have you on the show we've learned so much and i think we need
Starting point is 01:03:03 to get another one of the books because obviously we've gone over time, which, sorry, Brooks, but we just had so much to cover. So Bob, yeah, thanks for giving us your time and we'll have you back on soon. Thank you so much. And I would love to join you. Thank you. Kostas, I think one of the things that is... Actually, my takeaway is in many ways a question for you. We've had some really prestigious academic personas on the show who have done really significant things. I think about the Materialize team. I mean, there's just some
Starting point is 01:03:48 people who have done some really cool things. And what was interesting about Bob is that he studied music, number one, but he also drew a lot of influence from academia, but he's self-taught. And he's building a
Starting point is 01:04:04 vector database and dealing with embeddings, which is really interesting to me. And so I guess my takeaway question for you is, how do you think about that? Because you, of course, studied a lot of complex engineering stuff in school. But it's amazing to me that he can sort of study music and apply some of those concepts to very deep engineering concepts and mathematical concepts to produce an actual product without formal training. I mean, that's pretty amazing. So I don't know. I'm thinking about that, but I think you're the best person to help me understand that.
Starting point is 01:04:49 Yeah. I think if you ask me, we probably need like more people from music into tech and business. I don't find it that strange that music studies helped him so much because at the core of, let's say, music itself, there's creativity, right? It's like people who play, let's say, an instrument or they go and try to even have a career around that, they have a very strong need to be to express and create they are creators right i think like the definition of a creator right and it's a definition of a creator who is i'll say that like it's like a mental
Starting point is 01:05:40 creation right like music always comes out of like something that it's in your mind right yeah so i think like there are like many similarities with even like going and like writing code at the end right like you start from something very abstract something that it's yeah it can be represented like with math or like whatever right that's also true like for music but at the same time like okay and that's something that we talked with him, is that he learned a lot of things about business, right? Because music is, like, if you want to survive in there, like, as an industry, it's brutal.
Starting point is 01:06:14 Like, you have to expose yourself. Like, it's the definition of, like, exposing yourself and getting rejected, right? So I think that there are, like, from let's say like the people who are there like many similarities in a way but with let's say platform it's not like a keyboard and writing code it's like an instrument but at the end it has like a lot to do they have like things in common that are very important of like creativity and also being they'd like to create something completely new and take it out there like convince people that like there's value in
Starting point is 01:06:51 that right so that's one thing the other thing is that okay he's like i'm amazingly good at expressing himself and some very deep and complex, let's say, concepts, which I think is very important for anything that has to do with all this craziness around AI. And that's one of the reasons that I would ask anyone to go and listen to him because I think they are going to feel much more confident around AI as a technology and how like it actually has
Starting point is 01:07:30 substance and value. And we have also like some very interesting conversations about businesses, right? New business categories, like new product categories that they are out there. So please listen to him. And I hope that we are going to have him back again and talk more about that because we can spend hours with him for sure. I agree. Well, if you're interested in vector databases, embeddings, or sort of database
Starting point is 01:08:00 history in general, listen to the show, subscribe if you haven't, tell a friend, and of course, we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.