The Data Stack Show - 159: What Is a Vector Database? Featuring Bob van Luijt of Weaviate
Episode Date: October 11, 2023Highlights from this week’s conversation include:How music impacted Bob’s data journey (3:16)Music’s relationship with creativity and innovation (11:38)The genesis of Weaviate and the idea of ve...ctor databases (14:09)The joy of creation (19:02)OLAP Databases (22:21)The progression of complexity in databases (24:31)Vector database (29:23)Scaling suboptimal algorithms (34:34)The future of vector space representation (35:51)Databases role in different industries (39:14)The brute force approach to discovery (45:57)Retrieval augmented generation (51:26)How generative model interacts with the database (57:55)Final thoughts and takeaways (1:03:20)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week, we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show, Kostas.
Today we are talking with Bob from Weeviate and so many interesting things about Bob.
He's never had a boss, number one. He started building websites when he was 15. He reminds me
a lot of you from that aspect. And then fast forward, he actually has built a company around vector databases and embeddings
that support AI type use cases.
I mean, what a journey.
That's pretty amazing.
I want to ask about databases in general on a spectrum, right?
So we've, actually, this is sort of a theme.
I almost feel like Brooks could write a thesis
on databases from the data section
because we've talked about every type of database.
But we haven't talked about vector databases.
We've talked, I mean, of course,
ad nauseum about sort of OLAP, LTP, workflows,
graph, we've had a number of graph databases,
but I think this is the first vector database.
And so I want to zero in on that
and put that in the context of sort of the spectrum
of like basic SQL OLAP, you know, to graph, to vector.
So that's what I'm going to ask.
But how about you?
I mean, there's so much here, so.
Yeah, I mean, okay, there are two main areas of questions.
I think the first one is to make clear to our audience what a vector database is, how
it relates to the rest of the databases out there and why we need a different one, another one.
And the other one, which is also I think super interesting, is what businesses you can build
around them. Why someone would build? Why even can be a sustainable business. And it's not just like a feature of another database, right?
So I think we have the right person to talk both on the technical side of things,
but also have a very good and deep conversation
around the business of this type of databases.
All right, well, let's dig in and talk with Bob
and solve all these problems and more.
Yeah. One last thing. He had at least one boss, as he said, right? And that was his mother.
His mother. That's true.
The CRO, right?
The CRO. Okay. Let's talk with Bob and we'll hear about Bob's mom being the CRO of Bob's life.
Bob, welcome to the Data Stack Show. We're so excited to have you.
Well, thanks for having me.
All right. Well, you have a really interesting history. So I guess technically you've never had
a boss, which is fascinating.
So you've always been your own boss, which we may have a few questions about that.
But tell us about how you got into data.
Where did you start?
And then what are you doing today?
Yeah, no, that's, that is correct.
If you, I mean, if you exclude my mother, I never had a boss.
So the mother is boss or like chief product officer for your life.
Exactly.
Exactly.
Yeah, exactly.
Well, more chief revenue officer, but that's different.
So, but the no, so I, so I'm a millennial, I'm born in 85.
And so that means that, you know what?
Yeah, there you go.
So there was like, when, when I was 15 years old, like means that, you know what? Yeah, here you go. So that was like when I was 15 years old,
like we had, you know, the internet,
just the internet connection at home and at school too.
So I just started to play around, you know,
build websites, that kind of stuff.
And at some point there were people,
I lived in a small village in the Netherlands
and people said, hey, you know, we need websites for stuff.
And I was like, you know, we need websites for stuff.
And I was like, you know, I can build you a website.
So then I got a gig to sell toothbrushes and lighters on a website.
I don't want to know anything about the security that the website had back then, but they basically,
they said, so how much, you know, they said like, how much money do you want to make? So I asked my dad, I said said, how much money do you want to make?
So I asked my dad, I said, how much money?
And my dad said, you just ask for like 500 bucks.
I was like, I'll do it for 500 bucks.
And the guy said, deal.
And I was like, whoa, that's a lot of money.
I'm rich.
Yeah, exactly.
Exactly.
So I went to the, as you do in Holland, I went to the Chamber of Commerce and I registered my company. And then you grow, right?
So you learn a lot.
And then I grew into being a more software consultant.
I did study in between.
No CS or anything.
So I studied, I said music because it's another passion I have.
I always kept working in technology because i was like i even said
someone i studied in boston i got a grant to study in boston and then on the side i was just
yeah writing it was like remote work of all electric so it's like the yeah i was working
in boston for these dutch companies and that grew and grew and then at some point i was introduced
to machine learning and that kind of changed
everything because then I was like, okay, I'm going to stop being like a freelance consultant.
I'm going to start a company.
So like a product company.
So that's, so for a long time already, I'm in software.
I love it.
Were you studying music in Boston?
Yes.
Okay.
At Berkeley? Yes. Okay. At Berklee?
Yes.
Okay. And what's your instrument of choice? I mean, I know multi-talented people go to Berklee, but what's your instrument?
Yeah. So I studied with bass guitar. And one of the funny things is that people that are now here on the radio, they were like at the same time at Berkeley as I was.
And that is just super exciting.
But Berkeley has been super important for me running a business now.
So a lot of things that I've learned being, I was very young.
I was like, I mean, I would not argue very young.
So it was like really early 20s.
And that I flew in to boston and a lot of things
that i learned at berkeley are things that i'm using in building business today so it's like
there's a there was a very important lesson in my life and if i could go back in in history i
would probably do the same thing again. It was just a great,
fantastic period in my life.
So yeah, I'm proud of that.
Okay, so two questions,
because I want to talk about databases
and vector databases
in particular, but
two quick questions on Berkeley, because I
can't help myself. So you said
that you learned a lot of lessons that
helped you on the business side.
Can you give us maybe like the top thing? Because you're an entrepreneur, right?
Yes.
What did you learn at Berkeley that helped you from an entrepreneurial standpoint? And then I
have a follow-up question. So there are two things. So this sounds a little bit cliche,
but it's really true. As I learned at Berkeley, if people talk about the American dream, I learned at Berkeley what that means.
People were living 24-7.
They were living that lifestyle.
And everybody was dreaming big and working together.
And you had these amazing artists coming.
So that was one thing that i that i learned there but another thing that i learned was that if you know the people if you
are the people listening to have you know musicians in there you know in like friends or family
they know that you need to do a lot besides just playing right you need to promote your music you
need to get it out there you need
to present it online and those kind of things so a lot of things that i learned like this is
when we when i started it's very similar to starting an open source project
but instead of code you're shipping mp3 files right but the mechanics are very similar so that
was something i learned that i learned and to not to go too deep into that but
there's a i have a strong belief that in and this is from an author is like a futurist author bruce
sterling and he has this talk and i'm i don't know exactly what the name is but if you google
this you'll probably find it that he says if you want to know what the impact is of technology on society,
you need to look at
what the technology is doing to musicians.
Because it will always happen to the music
industry first.
And that is something,
I mean, you can fill a whole episode
with that discussion, but that's,
I think that's to be true. Because it's
a very,
you know, it's a very, it's a very, you know, it's a very,
the industry is very,
you know,
it's,
it's not as very strong industry,
right?
There's really people do this for the love of making art,
but the technology plays a tremendously important role in that.
So it's a,
that those kinds of things that I've learned there,
I know when I grow older,
building the business,
like,
Hey,
I actually learned that my time at Berkeley.
Man, that's fascinating.
Okay, we should probably do a follow-up episode just because, I mean, music is hard to monetize.
You obviously saw that early at Berkeley.
Yet there are people with a lot of money who figure out how to sort of exploit, you know, things that people are
passionate about creating, which sounds, you know, ironically identical to the venture SaaS industry,
which is interesting. Okay, so second follow up question, then I want to dig into your current
company and vector databases. But one thing that I've noticed just, I mean, even on the podcast, but throughout my career
is that people who study music tend to think with both sides of their brain in a way that's
unique.
And I'm not saying that, like, I don't have any science to back that up.
It's just something that I've noticed enough to where I, whenever someone's doing software
and they are a musician, I'm immediately interested
because I've noticed that pattern over the years. You said you discovered machine learning.
What I'd love to know is, did or does your study of music influence the way you think about machine
learning? Because there is this relationship of structure and pattern within music that is required to
create a foundation, but there's sort of unlimited combinations of notes and everything and melody
that you can use to create things that are not new, right? I mean, they say, you know,
maybe we've only discovered five or 10%
of the possible songs that are ever, you know,
able to be created since the history of the world,
which is actually very interesting
in terms of machine learning.
It feels the same way, right?
Is that relationship there for you?
Yes.
And that relationship mostly sits in my mind.
And let me explain what I mean with that.
So when i was very
young around the same time that i was 15 right so i got like you know interested in things like
you know red or chili peppers that kind of stuff so you had like guitar solos right so and i was
interested in that so what was happening there and then a teacher in high school is like well
if it's purely that what you like you might want to listen to the later music of Miles Davis because they have done 30 seconds of guitar, but you have six minutes.
Yeah.
And if you go into that, they double click onto that.
You go like, hey, let's see what like John Coltrane was doing.
And if you double click on that, you get to the classic.
So you look at like Bach, Stravinsky, that kind of stuff and so on and so forth.
So that is how you study these kind of things.
And so it gets more and more complex.
And there's an aesthetic in that complexity.
And every time that I was working on it and I figured something out, I could see it.
I have these structures that I just can visualize and then I see it.
And that is the exact same thing that's happening to me with machinery.
So when I started with Weaviate and with these early models, the moment that I
figured out that they were, what they were doing, I could see it.
And then I, and everything that I currently work on is like, it's very, I hate it if there's
something happening that I don't understand, but then if I say I understand it, I can visualize
it.
And that's the exact same mechanic in in my mind so so to give
you a third example i'm as a hobby i'm very interested in in in language philosophy i once
even gave a ted talk or tedx talk about software and language and those kind of things if i read
these kind of books or they're like the like i don't know like you know i like to the work with
consign that kind of stuff then you're reading it you don't getting it and then at some point i can
see it and then when i can see it i understand it so that my point is that mechanism in my mind is
very similar yep super interesting okay i love it i mean we definitely i mean
casas we should do an episode
just on that because that is i mean that is so fun to talk about but okay let's get down to business
we v8 can you tell so this is a company you founded several years ago you've been working
on it for some time what is it what does it do and then i want to dig into databases in general
from there and learn from you about sort
of the progression of databases on the spectrum of complexity yeah so i think this is best explained
by giving a little bit of history that what i went through so i was i was as a consultant i was
working at a at a big firm like publishing firm and the and they hired me to work on something.
And they were looking at new types of products,
what they could do with scientific papers
and that kind of stuff.
And I was introduced to Glove,
which is a model that produces embeddings for single words,
so word embeddings.
And people familiar with this,
there's like this famous calculation where
they do king minus man plus women and then it in vector space it moves to the word queen
yeah and i was immediately i was like when i sort of was like oh this is exciting this is cool
so i started to play around with this and i I got this very simple idea, very simple.
I said, rather than doing this with individual words,
if I take a sentence or a paragraph,
and I take all these words from the paragraph,
and I create a calculus centroid,
so the center of those vector embeddings,
then I can represent that paragraph in vector space.
So now I can do semantic search over those kind of...
That was a very early on, that was an idea.
And I wasn't sure how to build it yet, how to structure it yet, those kind of things.
And so something very logical that you started to do is that you take a database, right?
And we experiment with different databases.
You start to, you try to store these embeddings in there.
Storing is not the problem, retrieving is.
So then you get in a situation, how are we going to retrieve that stuff?
And so we were experimenting with the very early so-called approximate nearest neighbor
algorithms.
We can double click on that if you like, but I was excited to experiment with it.
And then my co-founder, Ejen, started to play a very important role because he said, well, actually
for search
back then, like traditional search,
the library
that's kind of used is Lucene.
So Lucene sits in Solr, it sits in
Elasticsearch, it sits in MongoDB,
it sits in all these kinds of solutions.
And the way that
Lucene shards on scale
is suboptimal for sharding the approximate nearest neighbor index.
So then we were like, hey, wait a second.
Now we have this semantic search use case.
And we know that there's like a mode for a new database.
Let's build a new database.
And that's how the idea was born to start to work on Weavey8.
And so Weavey8 currently, you store a data object like you would do in any NoSQL database,
and you can add a vector embedding to it, right?
And that is how that was born and how all these things came together.
Back then, we did not use the term vector database, vector search engine, anything like
that.
We were just looking for ways to position it.
What did you call it back then?
Just out of curiosity, what did you call it?
Because I mean, vector database and embeddings have become really popular terms with the
hype in the last year around LLMs.
But what did you call it back then? Because you were doing things that were sort of primitives
for Knowledge Graph.
Yeah, because you had these data objects floating around
the vector space, and then you could make a link between them.
And it's like, okay, so that was...
But that caused a lot of confusion because Knowledge Graph
was kind of adapted by the semantic web people.
And I remember that I
went to semantic web conferences
presenting WeaveYate
and people didn't get the concept
of using embeddings.
They were like, so you have a keyword list
or something, or how do you map?
I said, no, I'm using the representations
from the ML models.
By the way, this is before the Transformers
paper, etc. was was released, right?
So it took a long time for people to get an understanding
of what the data type of embeddings,
what you could do with them.
I've been talking about this stuff for a long time.
Probably on YouTube, you can find some of these old talks.
Don't try to find them,
but they're probably floating around somewhere.
We won't put them in the show notes, yeah.
So, okay, so
one more question. I'm sorry, again, I can't
help myself, but
before we get into databases,
how did you...
That's a lot of stamina, right? I mean, you've been doing
Weevee 8 for, what, like, four or five
years now? I mean, quite some time.
And so, before embeddings were cool,
before the language around vector was cool.
And obviously you've never had a boss,
so you're of course paving your own way,
but that's a lot of endurance it seems like
to sort of present at a conference
where people don't get it.
How did you deal with that? You know, I mean,
you obviously believe in it enough to, to continue on, but that's hard.
Yeah, I, this is, that is an, that's a great question. I'm, so one of the things, so this
is maybe also a relation back to what we talked about with art is like a i can really fall in love with something right
and then just i just put my you know my intellectual claws in it and i just don't let go
yeah and i'm also i never i have a very i'm blessed with a very wonderful life and i meet
amazing people and i'm able to build an amazing team.
And I'm just,
that was not planned.
I'm just,
it's like,
like,
you know,
I just go with the flow.
Yeah.
So I never saw that as a,
as an issue.
It's just,
I'm just enjoying the ride and it's just amazing.
It's just,
it,
so it's,
so I appreciate that you say that,
but that's not how it feels it's like it just
feels like i'm just yeah go with flow i it's interesting if we tie it back to music you hear
musicians talk about you know writing a certain song right and they're like well how did you do
that you know and a lot of times you'll hear musicians
describe it as like, well, it wasn't like I didn't set out to write a catchy song. I just
had something inside of me that I needed to express. And so this is just a process of me
expressing what's inside of me, right? And it happens to be that it's a song and it happens
to be that it resonated with
a lot of other people or whatever, but that sounds really similar to what you're talking about.
Yes. And one of the things that I do right now, so I'm in a very fortunate position that I can
talk to a lot of young people, right? Who are studying or do experiments with things that I,
and one of the things that I'm trying to get across to them is like,
whatever you do,
make something,
create something.
And if you're now a student and I,
so for example,
when I talk to you at the business school or something,
because then if I talk about software,
it's always very,
like a lot of people show up because no tech.
So,
and I said like,
if you now work at a Starbucks,
stick,
keep working at the Starbucks and try to build
something.
Don't get excited by these big companies, you know, offering you big.
Try to use this time you'll have to make something.
And I don't care.
So the two talents, I guess, that I have are in working with software and in music.
But if it's cooking, then cook, right?
If it's writing, write.
If it's branding, start a design, whatever, right?
But make something.
And because life is so much fun,
if you make stuff,
whatever stuff you make.
And that is what I've been doing always.
I've been making stuff and that's a, and what I'm doing now is just the company building.
That's a form of making that gives a lot of joy.
Right.
So that's a, so yeah.
So that.
I love it.
I mean, gosh, what a good, we could keep going down the path, but let's, okay, so I want
to get technical.
And can we talk about OLAP databases?
So let's start with the, so Weave8 is a form of database, and we can talk about the specifics.
But I want to go back to basics. Most people who are
interacting with a database are interacting with LTP or OLAP database, right? And mostly OLAP,
if you're doing any sort of SQL-based workflow where you're building analytics or anything,
right? And so that runs the world, right? I mean, any KPIs at any company of any
size, it's just, we're talking about OLAP workflows, right? Yes. Okay. Weevy8 is sort of on a,
like, if we think about the spectrum of complexity of databases and use cases, Weevy8 to me,
as I understand it, is sort of is much further along the spectrum of complexity.
And I think the step in between, and you correct me if I'm wrong because you're the expert,
but a lot of teams, when they're working in OLAP and they have 10,000 lines of SQL and
it's getting really crazy, a lot of times they'll say like, okay, maybe we need to do
like graph, which will help us solve some of the relationships that we're trying to represent in
SQL. Okay, great. So they like move to graph. So I think, you know, probably a lot of people are
familiar with a graph database. And then we have a vector database. And so we actually haven't on the show, I think,
had a specific discussion about vector database. And so can you just help us understand when you
go from sort of OLAP to graph to vector cases and use paint the spectrum of,
of database use cases for us in terms of complexity.
Yeah.
So I can offer a, a way for people to think about this.
Right.
So, so, and, or I think about it.
So if you, if you envision it, like, let's say you have, like, a big circle. And in the center of the circle, you have databases like Postgres, MySQL, those kind of things.
And these kind of databases are, you could say, catch-all databases, right?
So you can do everything with them.
So you can make graph connections, like, in the form of, like, joins.
You can store vector embeddings nowadays in them.
You can store strings in them and those kinds of things.
And that's great.
For a lot of use cases, these databases are fantastic.
And the people designing these databases,
they make trade-offs in the decisions they make
to build these kinds of databases to support all these cases.
But that means that there is a limit to that, right?
There's a limit.
So let's say graph as an example, I think, because graph is a great example.
If it turns out that your data set is very reliant on these graph connections, then you
run into an issue at some point.
And we all know the, or what, maybe not everybody, but there's this term like join hell, right?
So at some point.
Sure.
Yeah.
So, so then you say, well, actually it's just, we run into join hell.
So now we, so in that circle in the center, we have these core SQL days.
So we move a little bit outside of that.
Right.
So we start to move in the no SQL space.
It's not SQL anymore. So we're moving it. We're saying, we got to design something from the ground up
that is very good at dealing with these graph structures. So if you don't have these graph
structures or just a tiny graph, then it's fine. Stay in the center. But if you want to do
something more towards the frame. Sure.
And what do we often see is that if you look at the data types you have in the center, a relation became graph databases.
Date stamps became time series databases.
Yep.
Searches became search engines.
Yep. that the specific data types that you have in these databases, they kind of ask for bigger use case.
They kind of ask for their own category, basically.
Sure.
As you start to scale, databases emerge that solve these particular problems, for sure.
Exactly.
So what you start to see is with vector databases, exactly is the same thing.
I think, so you have a vector embedding, which is a data type in itself, right?
And the uniqueness in these databases is not so much in storing them as much as in retrieving them fast.
Only one of the things, or a thing that we started to see, right, if you look back at history, is that start from the perspective of that SQL database, and then we go out to the fringes, is kind
of skipped.
So then we go and say, okay, we see this new data type, e.g. vector embedding.
Let's just start in that category, right?
So let's just create that category and work in that category, right? So let's just create that category and work in that category. Because what starts to happen
is that these databases,
hence that everything does not SQL,
no SQL,
is that you start to have different ways
to interact with the database
that is very well suited
for that specific data type.
You want to have different APIs
with a time series database
than you might want to have
with a vector database, for example. So that is how I visualize it. So yes, in that center, everything comes
together in the center. But the moment you want to really double down on one of these data types
in the SQL databases, you're probably better off with a purpose-built database, regardless if it's
graph, time series, vector, whatever.
Yeah, 100%.
Okay, well, one more question for me, because I've been monopolizing here, because I know
Kostas has a bunch of questions, but can you just define a vector database?
I mean, a graph database, I think, makes sense to a lot of people because creating relationships
between nodes and edges in SQL is brutal at scale.
And so a graph database is a very logical conclusion if you need to represent social relationships or something like that.
So that's logical, I think.
But what's a vector database?
And what's sort of the graph thing?
Let's say like that's social, right?
You move from the center out of Postgres and you need to represent complex social relationships.
And so graph makes sense.
What's the thing that pulls a vector out of the center?
And can you describe the vector database? Yeah. So a vector database is in essence,
in essence, often a search engine in the majority of cases. It's a type of search engine
where the vector embedding is the first class citizen, right? So that is the first class citizen
in the database. So the way that the database shards, the database, the way that you scale
with it, all those kinds of things, these architectural decisions in building a scalable database,
they go all the way back to looking at it from the perspective of the vector index,
right?
That sits at the heart of it.
So that's how I would define it.
It's just a database where the vector is a first-class citizen.
And then you have a UX element to that.
So the way that developers interact with the database
is tailored to that, to those types of use cases.
Okay.
What's the difference between a vector database
and something like Lucene, right?
Because I'm old enough to experience, let's say, the introduction of inverted indexes.
And suddenly we were like, oh my God, like we can have so, so much like fast retrieval of data.
By the way, I think I'm revealing a little bit like what's going on here because Lucene is also about retrieval, right? It's how we can trade off upfront processing
to be able to go and search very quickly
the kind of unstructured data, right?
And there's a lot of NLP work
that has been going inside this library, right?
But we have this for many years, right?
And we have products and businesses, actually, that have been built on top of that, right? We have had this for many years, right? And we have products and businesses
actually that have been built on top of that, right?
We have Algogia, for example.
So,
and, okay, obviously
another
company that came from the Netherlands,
if I'm not wrong, which is Elastic, right?
Yes.
So there is something in the Netherlands about...
What's that?
It's saying. Yeah.
So what's the difference and how like, and by the way, the second question, like a follow-up question to that, how do they compete?
They compliment each other.
Like, and how do you see like the business case there?
This is a, this, thank you.
This is an excellent question because I get this question a lot.
So I'm happy that you're asking it because now we can broadcast like the
inside.
So,
so a little bit of a preamble before I go into the answer.
So they're like three things that play roles here,
right?
So one is search algorithms,
regardless if that's for vector,
like approximately neighbor or in more twisted keyword, BM25, those kind of things.
So we have the algorithms.
Then we have libraries.
And these libraries, those are,
a library contains one or more of these algorithms.
And then you have databases.
And a database can be built around the library.
It doesn't have to be. And what you try databases. And a database can be built around a library. It doesn't have to be.
And what you try to do with the database is that you offer functionality that people expect from a database.
So, for example, CRUD support, create, read, update, delete support.
It can be backups.
It can be storage.
It can be, if it's transactional, you know, certain guarantees or et cetera, et cetera.
So those are three distinct different things.
So Lucene is a collection,
is a library of a collection
of search algorithms,
mostly tailored around keyword search.
It's relatively,
in the world of software,
relatively old.
And I don't mean it in a positive way, like an old library that has brought a lot of value
to a lot of people.
And there's also an equivalent actually for ML, and that's called Face.
That's built by Facebook.
So those are two libraries.
And so what you started to see with Lucene was people saying, hey, we can take Lucene
and we can turn it into a database.
Add that layer of functionality around it that makes it a database.
Elastic, solar.
I believe Neo4j uses Lucene for sure.
Et cetera, et cetera, et cetera.
So the people started to add that.
So now a very logical question would be, okay, great.
So now we have a new data type, factor embeddings.
Let's add it to Lucene, right?
That's a very logical thing.
People did that.
So if you now see these databases that I just mentioned, talk about Lucene, sorry, talk
about a factor search, they're often Lucene-based and they use the A&N algorithm
that's in Lucene.
So now the question
becomes,
so why,
then why
new databases,
right?
So why just not
leverage Lucene?
And the answer
has to do with
if you use Lucene
at the core
of your database
for search,
you're bound
to the GVM and those kinds of things.
So you shard the database in a specific way, right?
And it turns out that the algorithms used
to scale Proxmox nearest neighbor
is suboptimal in Lucene.
You can even see this in the open source Lucene community
that people are debating this till today.
Some people disagree with Lucene doing this at all.
So now all of a sudden, and that's why we thought,
hey, we believe that there's room in the market
to build something new because, of course,
a production database needs to shard and replicate
and what have you.
And that's where that's coming from.
So what people will notice is that if they use a Lucene based
search engine
or database for
really heavy
vector
processing
work, they will
run into scalability problems
and that is why you see these new
databases and you had a second
question but I forgot what the second question was
sorry about that. I also forgot
to be honest but I remembered it
I think so
might be a new one but it's okay
it's fine. Yeah if they
compete actually like or if you see
like how you see the future right because
if you think
about it like okay let's take solar and
I'll keep like
elastic a little bit like outside of the equation here for a very particular
reason, which is business oriented.
Like I won't like to come back to that.
But if we talk like primarily for information retrieval and search
of unstructured text, right?
Maybe someone can argue that, okay, why do we need both, right? Or do we need both,
like is the question. So the question here is, is there like a future for these kind of inverted
indexing techniques? Is this going to be, let's say, abandoned because vector space is like just
a better presentation and add all semantics.
Or there is a reason like to keep them both there, right?
Even if they are not like in the same system, let's say they're like just different systems.
Let's not talk about the systems here.
It's more about how complimentary of a technology they are in terms of like
the use cases that they share when they, we use them as products, right?
Stas Moukallisoukis Yes.
So this is an excellent question.
I can actually marry the two things together.
So if you kept Elastic, if I may, I'll bring that back in,
right, to answer your question.
So the thing is, there's one thing, for example, what we know when it comes to
NLP, because of course we can
do ML not only with text, but also with other things, but purely when it comes to text,
we know actually that mixing the two indices works best.
So a hybrid search yields the best results, where you mix the dense and the sparse index
together.
So the embedded from the model and then, for example, BM25 index.
That works best.
And it turns out,
and then the word scale plays an important role here
because especially on scale, that starts to play a role.
Now, what's interesting though,
and now I'm taking off like the tech hat
and I'm putting out my product slash business hat.
That's like a thing that if for everybody listening with the, and if they have the ambition to build their own database company or those kinds of things, that's like a, that's like a little, that's like a little trick or like a thing that I can share.
Right.
And with the exception of the SQL database that we talked about.
So where the, where like my SQLs and Postgres are of this world, but everything around that, what you start
to see is that people start to position these databases at something that they are uniquely
good at, right?
And so if you take Elastic as an example, right. So yes, the database is the search engine.
And yes, that's what a lot of people use and makes a lot of people happy.
Right. It is actually what they build as a business is focusing on observability and cybersecurity.
So what I'm doing in my role is that I'm asking the question,
if I take the vector database, what's that for us? And it turned out that we learned,
thanks to the release of JetGPT, played a very important role in that. We've learned what that
unique thing is for the vector database. And that comes in the form of something called retrieval
augmented generation. We can talk about that if you like, but that is to answer your question.
So that is the big difference, right?
So at some point it goes out of the realm of purely that architectural decision.
So like now this database exists and it's structured in a certain way with a certain
architecture.
What kind of use cases does that enable that
are unique for this database?
And that is how these companies start to grow around the database.
So yes, Elastic is a search engine, but maybe a business buyer might say, well, for us,
it's just an observability tool, right?
And it plays a tremendously important role in that.
So, and vector data bases start to gravitate
in another direction, right?
More the generative AI direction
where they just play such a crucial role.
Actually, you did like an excellent job
in answering my next question.
That's why I wanted like to keep Elastic outside
like for a follow-up question,
exactly for that reason, because I would like to say exactly what you said,
that Elastic ended up as something that is a product for observability,
used primarily for that.
And that's why things actually get really interesting
when you start building businesses and products on top of the technology itself.
And my follow-up question would be exactly that.
Like, where do you see these embedding databases or vector databases,
whatever you want to call them, like, going towards, right?
What's the equivalent of observability for these systems?
And we will get to that. But before we get to that,
I think I'd like to do like a bit of,
let's say, go like a little bit back to high school,
I would say, and talk a little bit about algebra,
vectors, cosine similarity,
and talk a little bit about like the basics
of how do you retrieve information from these systems.
And hopefully we've like in, it will sound like extremely
familiar, like to everyone.
Before we go to the more, let's say like in the, into the indexing part, like
the, the sophisticated algorithms that are approximated and all these things.
But let's, I'd love to hear from you, like what is like the basic
operation that actually
happens.
That's pretty much like everyone who went through high school probably knows about,
right?
Which I find very fascinating, actually.
Yeah, it's a, I've been, so it's funny that you asked this question because I've been
looking for a lot of metaphors and those kinds of things to explain it.
And it's always the question is like, how deep do you want to go?
I think if I,
and maybe this is interesting for the show notes.
So Stephen Wolfram did a very interesting blog post
where really in a very easy to understand language
goes really into like,
hey, how do you create an embedding?
And he takes Wikipedia pages as an example.
So he says, for example,
if you have the question, for example sentence finishes and the cat said on the and then how do you get to the word
met right so how do you get there and then he explains that from like distances from words
and sentences and those kind of things so i'll i don go, I will not go into that because then we just need another 30 minutes to get into it.
But let me take as my starting point
that what the machine learning model does,
and technically speaking,
you don't need a machine learning model for it.
It's just, otherwise you need to do it brute force
and it's just going to take way too long.
So you want to predict.
And what you're predicting in the effector embedding is a a geographical representation
of these data objects closely to each other so very simply put if you think about a two-dimensional
you know sheet of paper and you have individual words, then probably the word banana and apple are going
to be more closely together in that two-dimensional space than the word monkey, right?
So, and maybe the word monkey sits closer to banana than it sits to apple, right?
Because somewhere when it was training these texts, so it's like, you know, if you say like, you know, monkeys live there, they're there,
and they like to eat bananas, blah, blah, blah. So the word banana in these sentences is more
closely related to a monkey than apple, for example. But, you know, if you are at the,
for example, Wikipedia page of fruit, they make, you know, examples of fruits, apples, bananas, et cetera. And so what we also learned was if we do
that in two dimensions, we lose too much context. So we can't represent it in two dimensions. We
can't represent it in three dimensions, but it kind of starts to make sense from 90 dimensions and up. So the smallest representation back in the days from
GloVe was 90 dimensions. But geographically speaking, it's the same. And the distance
calculations that I use, so cosine similarity, Euclidean distance, those kinds of things,
those are just very similar, the same mathematical distance
metrics that you would use in a two-dimensional or three-dimensional space, just you apply them
on a multi-dimensional space. But conceptually, it's the exact same thing. And rather than, so
for example, Steve Wolfram does a great example in his article where he says like,
he said like, if I want to brute force calculate this from the wikipedia page for dog and cat that you know that's kind of doable but
then it doesn't really make sense but if i want to do it for the whole wikipedia i need a prediction
model and if i want to build gpt and i want to do it for the whole web then it's impossible to do
that brute force right so you i mean technically theoretically speaking it's, then it's impossible to do that brute force, right? So you, I mean, technically, theoretically speaking, it's possible, but it's very impractical.
So what the models do, the models predict where these words sit in the distance metric.
And later that evolved from single words to sentences.
So for example, that's why you got like things like sentence birth, right?
So that you did that for a full sentence,
et cetera.
And that's how that started to evolve.
But in the end, there are two types of models.
And the first type of model is that what it generates is a vector embedding.
And we're not talking about the generative AI, which you get with like JetGPT, just talking
about a model that generates these vectors and embeddings. And it turns out you can represent text, images, audio, heat maps,
what have you, in vector space.
And then if you store them in a database,
you can find similar items.
And that's how we do search.
Awesome.
So, okay.
One way to do it is, let's say,
you have like a bag of all these vectors there.
You get a query, which is like, when we say query, it's say you have like a bag of all these vectors there, you get a query,
which is like, when we say query, it's not like a SQL query, it's just a textual
question, like, that the human would write down, right?
Turning again into like this vector representation.
And then you start finding similarities with a bag of, let's say, vectors that you
have, right?
And you can do it like, let's say, vectors that you have, right? And you can do it like, let's say, you can brute force that, right?
Like you can go there, do like a cosine similarity across all of them
and choose, let's say, I don't know, like the best one
or the five best ones or like whatever, right?
When, why don't we just do that for a retrieval?
Like, and at what point someone who builds an application
should start
considering
indexes, approximate algorithms for that,
and at the end,
a system, a DB system
like Wave8 that
does that at scale.
Yeah, so that's an excellent
study. So
the way of doing that brute force, as you described, is a way how people do it.
To go even a step further, when I started, when I was starting to play with this, that's how I did it, right?
I did a brute force.
So what you do is this.
So let's say that you store the vector embedding again for, what did we have?
Apple, banana, and monkey.
And you have a semantic search question where you say, like, you know, which animal lives in the jungle, right?
Then you get a fact-train betting.
So you compare, if you did brute force, you compare that to apple and it gives you a distance.
Then you compare that to banana, gives you distance. And then you compare that with monkey and it gives you a distance. Then you compare that to banana, gives you
distance and then you compare that with monkey and it gives you distance. So now you have
three distances and you basically can organize it based on the shortest distance.
Yep.
Great. So that's like a linear function, right? So if I now add like a fourth data
object, you know, it takes a little bit long. Fifth takes a little bit long, et cetera,
et cetera. So if you now have a database or like a serious project,
and we're not having three data objects,
but we have 100,000, a million, 10 million,
we have users that have that in the billion scale.
So now imagine that you have a production use case
for an e-commerce player, right?
And you not only have like a billion data objects, but you also have multiple requests
at the same time.
You don't want to do that brute force because then people can go on a holiday and when they
come back, they get a search result, right?
So I exaggerated a little bit, but that's what the brute force problem.
And this is where the academic world started to help because they invented something called
approximate nearest neighbors.
So these things live in vector space.
So you place this query ahead, the animal that lives in the jungle, and you look for
the nearest neighbors.
And what that does, those algorithms, they are lightning fast.
They're super fast in retrieving that information
from vector space. You pay with something though. You pay with the A, with the approximation.
So now with brute force, you can guarantee 100% accuracy. You can't do that with the
approximate algorithm. So you pay with approximation, but what you get in return
is that speed improvement.
And now all of a sudden we can build these production systems.
And what the database does for you is make sure that you can run that production system rather than just adopting or using the algorithm.
Awesome. We can talk a lot about that stuff, but I want to spend some time about the use cases and more about where these embedding databases are going forward.
What kind of categories, product categories, are starting to form there.
You mentioned something about RU and open ai and llms
so tell us a little bit more about that how like waviate or any system like waviate
is adding value to a user and what's the use case right right? Yeah. So even I'll take off again my tech hat
and put on my business hat.
So the way you can look at this is like
if you build a new type of database
based on a new data type, e.g. a factory database,
you can look at the use case.
So we are open source, right? So people start building stuff and you just look at what these people build and you ask them, you know, at the use case. So we are open source, right?
So people start building stuff and you just look at what these people build and you ask
them, you know, what are you building?
And then what you try to do with the answers that people give you, you basically put them
in a box, right?
So you put them in the box.
Are they building a displacement service or are they doing something completely new?
Yep.
And the displacement service for us is, as I like to call it,
better search and better recommendation.
So people were like, you know, we're not
happy with the keyword search result that we're
getting. We're going to adopt these machine
learning models to better search.
So that is, for example, why
we got functionality like hybrid
search and those kind of things, because
it helps people to do
better search.
But then all of a sudden, we, and this is really in the, as I like to call it, post-GPT
era, people started to do something.
And what they started to do was that they said, well, we love the whole degenerative
models like, you know, GPT, but used in GPT, et cetera. So the generative models, but also from the open source models,
the coherent models, the nowadays centropic models, whatever, right?
But they said, like, we have one problem.
We want to do that with our own data.
And so what started to emerge was that people used the vector database
to have a semantic search query, return these data objects, and inject them into the prompt.
And that process is called retrieval augmented generation.
So what I did now is that now you can double-click on that again, right?
And you can say, okay, so, you know, but that's quite primitive, right?
It's a primitive way of, it's nice, it works,
it makes people very happy,
but it's a primitive way of injecting that into the prompt.
And now you see there's a lot of research happening
and not research as in like,
we need to wait two more years before we have it.
No, it's released now
and you can't see the first signs of that already.
Where research has said like, well, let's not do that.
Let's marry the vector store that's in the vector database with the retrieval in the model.
So the model knows that it needs to be retrieval and it gets back a bunch of vectors and it's able to generate something based on that. So now from a business perspective, you have a very unique use case within vector databases that not only have a uniquely positioned vector database, it also solves the user problems
of data privacy, not having to fine tune your model, explainability.
So the model generated X that came from document Y in the database.
So it ticks all these kind of boxes.
So now zooming back out, now I have in my displacement services too, which are great.
But now the first one sits in my new use case that we see, that people use the Factor database to do every nice, cool Gen AI thing with their own data.
Okay, so how does this work?
I mean, you mentioned that, okay, you retrieve some data, you ingest them like in the prompt, right?
That's straightforward. You mentioned also that there's research around
that goes beyond that. Tell us a little bit more about that. That sounds quite fascinating.
What happens there? The user is actually questioning the model directly and the model goes and retrieves the data from the index, from the
vector database?
Is the user aware of that? Like, how does
this work?
So,
first of all, the user is not aware of this, right?
So, this is a great thing, right?
So, that is, by the way, something, on a quick
side note, what I believe. I think we should not
make it too complex
for our
users that they need to create their embedding.
We need to help them to do that.
And especially for the rack use cases, we just need to do that for them.
So the injection in the prompt is a very, I call it primitive rack, right?
It's a very straightforward primitive.
But, and this is something I could point people to
that they might find this interesting.
If you go in Hugging Face in the Transformers library,
there's a folder in there and that's called Rack.
And there they have a little bit more
of what I call sophisticated Rack.
So they use two models,
one model to create a vector embedding
to store data in the vector space and retrieve them.
And another model where you feed in the vector embeddings and tokens come out on the other end to generate an answer.
So the more efficiently you want to marry them, as we like to say, you want to weave them together, right?
The model and the database.
Rather than them as two separate entities.
But that means what you can do is, for example, you can create smaller ML models.
Because now you don't have to store all the knowledge in the model.
You just need to have its language understand.
And the model just needs to know, I now need to retrieve something from the database.
You get to real-time use cases because these databases can update very fast and you can just keep doing vector search over them. So the closer we can
bring them together and marry them together, the more efficiently they work. So now you see,
for example, that out of the models don't come vector embeddings anymore, but like binary
representations of these embeddings and those kinds of things. And if the database can eat that information and provide an answer,
then the UX tremendously increases, right?
Because you just marry the two things together.
Yeah, 100%.
I mean, it's like, I guess we have the UX,
like the user experience, the developer experience,
and probably we start to have like working on the model experience.
Because from what I hear is that we are
starting building like databases that they are going to be primarily
interfacing like with the models, not with humans.
Yes, exactly.
And this is something, so I'm happy that you bring this up because this is like
a second use case that we're working on.
And ironically we kind of discovered that one before we had the old REC use case.
And my colleague Connor wrote a beautiful blog post about that on the Weavehead blog, what we call generative feedback loops.
So what you now start to do is you have a bunch of data in your database, in your effective database, that interacts with a generative model.
But everything we've just been discussing until now is like a flow, right?
So you have a user query, the model processes that, knows that it needs to get something from the database, injects it into the generative model, something comes out.
So we go from left to right.
But there's no reason why you can't feedback that back into your database so the use case that
connor is describing in the blog post is that he has airbnb data that has like the name of the host
the price the location but not a description yep so basically what he does is that he says okay
i use this the rack use case to to okay, show me all listings without description,
generate based on these descriptions,
a listing for an elderly couple traveling,
younger couple traveling,
store that with a vector embedding back into the database. And now all of a sudden,
the database starts to populate itself.
You can use it for the database to clean itself
and those kinds of things.
So we now not only have human interaction with the database,
but also model interaction with the database. And that's the second new use case that i'm extremely excited
about yeah okay we are close i mean we are at the end here so i'd like to get you to have you like
back to talk more about that stuff but there is one question that i cannot ask. And I want you to put back your business hat again, right?
And as the founder of Wavi8, right,
building like this database
and entering like this new era where, let's say,
Wavi8 itself is not going to be exposed to the user, right?
But to the model.
And it sounds like a complement problem here.
Like who is the complement like as a product
between like the LLM and the vector database?
So how do you see this playing out
like from a business viability perspective, right?
Is this going to be a feature of the LLMs?
Is this going to be a category that can stand out on its own?
What's your take on that?
So this is an excellent question.
And as you could imagine, in the role that I have in this company,
it's one that I think about literally daily.
So let me try to answer it, but what I think it's not.
So I don't think that pure and alone vector search, so looking at these traditional plays, so just the vector cases, I don't think that's going to be the answer, right?
So yes, people want that.
Yes, there will be some market for that.
But I think that indeed marrying the two together,
or again, weaving the two together,
that is where the big opportunity lies.
Now, how users will interact with it,
how they will get access to that,
what path they will take to that,
we do not know, right?
So by the time this episode airs, then one of the things you would see is that we have
this combined offering with AWS, right?
So that you use the models, SageMaker, we've invented database and it intertwines.
Yeah.
So people press button and it comes all together.
But we're going to learn
if people will take that path
through the models
or through the database,
but they need both of them.
So hopefully in a couple of months
or a couple of weeks,
I have the answer.
But this is all very new
and very fresh.
But I do think that for us,
at least,
that's the big new step.
So how do you, as a vector,
stay two steps ahead of what's happening in the world?
And I think this is the answer.
Yep, yep.
Awesome.
Okay, Eric, unfortunately,
I have to give the microphone back to you.
Well, on yours.
Yes, I guess. You know, it's funny.
We sort of fight in halves about the show.
Okay.
This has been amazing, but Bob, I actually want to end on a very simple question.
When you've had a really long week and you don't want to think about vector databases or embeddings or business
and you put a record on the record player,
what's your go-to?
Like what are the top couple records
that you put on the turntable?
So I am a...
Recently, I've been to relax.
I'm listening a lot to the solo concerts from Key Jarrett.
So I love that.
It goes back to what we discussed.
It's like, for people who don't know him,
he gives a one hour concert he sits
behind the grand piano and he just starts improvising and every night is different and
to go on that kind of a journey so that is something that i'm that i'm listening to a lot
i like a lot of what's coming out the scene in la now with like people like thundercat etc i like
that a lot so those kind of things i i so i
would argue so music that's i would answer with music that takes me on that journey that i can
take my mind off things so that's what i'm currently listening to love it all right well
bob it's been so wonderful to have you on the show we've learned so much and i think we need
to get another one of the books because obviously we've gone over time, which, sorry, Brooks, but we just had so much
to cover. So Bob, yeah, thanks for giving us your time and we'll have you back on soon.
Thank you so much. And I would love to join you. Thank you.
Kostas, I think one of the things that is... Actually, my takeaway is in many ways a question for you.
We've had some really prestigious academic personas on the show
who have done really significant things.
I think about the Materialize
team. I mean, there's just some
people who have done some really cool
things. And what
was interesting about Bob is that
he studied music, number one,
but he also drew a lot
of influence from academia,
but he's self-taught.
And he's building a
vector database and dealing with embeddings,
which is really interesting to me. And so I guess my takeaway question for you
is, how do you think about that? Because you, of course, studied a lot of complex engineering
stuff in school. But it's amazing to me that he can sort of study music
and apply some of those concepts to very deep engineering concepts and mathematical concepts
to produce an actual product without formal training. I mean, that's pretty amazing. So I
don't know. I'm thinking about that, but I think you're the best
person to help me understand that.
Yeah.
I think if you ask me, we probably need like more people from music into
tech and business.
I don't find it that strange that music studies helped him so much because at the core of,
let's say, music itself, there's creativity, right?
It's like people who play, let's say, an instrument or they go and try to even have
a career around that, they have a very strong need to be to express and create they are creators right i think like the definition of a
creator right and it's a definition of a creator who is i'll say that like it's like a mental
creation right like music always comes out of like something that it's in your mind right
yeah so i think like there are like many similarities with even like going and like
writing code at the end right like you start from something very abstract something that it's
yeah it can be represented like with math or like whatever right that's also true like for music
but at the same time like okay and that's something that we talked with him,
is that he learned a lot of things about business, right?
Because music is, like, if you want to survive in there,
like, as an industry, it's brutal.
Like, you have to expose yourself.
Like, it's the definition of, like, exposing yourself
and getting rejected, right?
So I think that there are, like, from let's say like the people who are there
like many similarities in a way but with let's say platform it's not like a keyboard and
writing code it's like an instrument but at the end it has like a lot to do they have
like things in common that are very important of like creativity and also being they'd like to create
something completely new and take it out there like convince people that like there's value in
that right so that's one thing the other thing is that okay he's like i'm amazingly good at expressing himself and some very deep and complex,
let's say, concepts,
which I think is very important for anything that has to do with
all this craziness around AI.
And that's one of the reasons that I would
ask anyone to go and listen to him
because I think they are going to feel
much more confident around AI as a technology and how like it actually has
substance and value.
And we have also like some very interesting conversations about businesses,
right?
New business categories,
like new product categories that they are out there.
So please listen to him. And I hope that we are going to
have him back again and talk more about that because we can spend hours with him for sure.
I agree. Well, if you're interested in vector databases, embeddings, or sort of database
history in general, listen to the show, subscribe if you haven't, tell a friend, and of
course, we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rutterstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rutterstack.com.