Software Huddle - SQL Meets Vector Search with Linpeng Tang of MyScale
Episode Date: April 2, 2024Welcome back to an episode where we're talking Vectors, Vector Databases, and AI with Linpeng Tang, CTO and co-founder of MyScale. MyScale is a super interesting technology. They're combining the best... of OLAP databases with Vector Search. The project started back in 2019 where they forked ClickHouse and then adapted it to support Vector Storage, Indexing, and Search. The really unique and cool thing is you get the familiarity and usability of SQL with the power of being able to compare the similarity between unstructured data. We think this has really fascinating use cases for analytics well beyond what we're seeing with other vector database technology that's mostly restricted to building RAG models for LLMs. Also, because it's built on ClickHouse, MyScale is massively scalable, which is an area that many of the dedicated vector databases actually struggle with. We cover a lot about how vector databases work, why they decided to build off of ClickHouse, and how they plan to open source the database. Timestamps 02:29 Introduction 06:22 Value of a Vector Database 12:40 Forking ClickHouse 18:53 Transforming Clickhouse into a SQL vector database 32:08 Data modeling 32:56 What data can be Vectorized 38:37 Indexing 43:35 Achieving Scale 46:35 Bottlenecks 48:41 MyScale vs other dedicated Vector Databases 51:38 Going Open Source 56:04 Closing thoughts
Transcript
Discussion (0)
How does MyScale compare to some of the dedicated vector databases?
So if you have some familiarity with SQL, then you basically need a learning curve with MyScale.
Do you think that basically all databases in some fashion need to be open source at some point?
Certainly not all. I think most, I would say. For more independent companies, I think open source
is often a good move. So you can get more feedback and you can get more use case, right?
And you can get more trust between you and the users, more technology transparency.
But frankly, you also need to think carefully about how to commercialize your product.
In terms of a vector, you mentioned at the beginning there
that essentially a vector can be the representation of an object
in sort of compute space. So can you just explain a little bit about
how that works? Like if I have an image, how do I essentially turn an image into
a vector that then represents that image?
Hey everyone, Sean here from Software Huddle. Welcome back to an episode where we're talking vectors,
vector databases and AI with Wing Pang Tang,
CTO and co-founder of MyScale.
MyScale is a super interesting technology.
They're combining the best of OLAP databases
with vector search.
The project started back in 2019
where they forked ClickHouse
and then adapted it to support vector storage,
indexing, and search. The really unique and cool thing is you get the familiarity and usability of
SQL with the power of being able to compare the similarity between unstructured data. I think this
has really fascinating use cases for analytics well beyond what we're seeing with other vector
database technology that's most restricted to building RAG models for LLMs. Also because it's
built on ClickHouse,
MyScale is massively scalable, hence the name,
which is an area that many of the dedicated vector databases
actually struggle with.
We cover a lot about how vector databases work,
why they decided to build off of ClickHouse,
and how they plan to open source the database.
Before I get you over the interview,
Alex and I will be in Miami in April
at the Shift Developer Conference doing interviews.
If you're in the area, you should come on by, see some great talks, say hi to us.
You can learn more about the conference at shift.infobip.com.us.
Okay, let's kick you over to my interview with Lin Peng.
Lin Peng, welcome to Software Huddle.
Hi.
Hey, Sean.
Nice to meet you.
Yeah.
Thanks for jumping on and doing this.
So you're the CTO and co-founder of MyScale, a SQL vector database.
So can you give me the backstory of MyScale?
How did it start?
Where are you guys today?
Yeah.
Actually, ever since I finished my PhD in print central neuroscience, I have been doing
things related to machine learning, deep learning, and big data. And actually, we started this idea, I think, in 2019 or 2020, something like that.
And we have been doing this for the first four or five years.
But the SaaS version just launched last year.
And our MyScale company is headquartered in Singapore.
But we have a distributed team across
Asia, Africa,
and North America.
It seems like there's a couple different
companies in the vector database
space that have
had their headquarters in Asia.
Do you think, is there
something that led to
some of these companies leading the
forefront of the vector database technology coming out of Asia?
Or is it just a coincidence?
I think currently it's a hot area and we have different companies.
Pinecone is primarily in the U.S.
And BV8 and Current are in Europe.
And I think it's an important area.
So you see companies popping up all around, right?
Yeah, I guess there's complete geographical coverage by everybody.
So you went directly from completing your PhD to starting MyScales, right?
We did something related to machine learning deep learning before,
but it's also related to vectors.
And this product
launched uh like 2019 2020 something like that yeah yeah but i have been doing start since my
since my graduate school yeah okay and then how was that transition kind of like moving from
more like you know working on academic projects like how did you balance i guess like uh sort of
working on an academic project but also like starting a company
and the transition to doing the company full-time?
Yeah, I think it's just a comparison
between working, say, in big corporates
and doing your own startup, right?
I was actually interning at Facebook for several years,
actually doing internship and contracts.
And, you know, it's a hyperscaler, right?
Because what you do, even if it's a small amount, you can affect, say, 1 billion or
2 billion people.
So actually, you can make a difference.
But in a startup, the scaler is much smaller, right?
You can only affect a much smaller percentage of people,
but you can actually make a huge difference.
You can make the product 10 times better or even 100 times better.
So they are all possible.
So I think it's just different pros and cons.
And I think it's more interesting to work on 10x and 100x problems
and hopefully to affect more people. Yeah, absolutely. I think I'm more interesting to work on 10x and 100x problems and hopefully to affect more people.
Yeah, absolutely. I think I'm there with you.
I think the impact of working for one of the big FANG companies
or one of the big companies is that a product you work on
or a feature you work on could legitimately be used by a billion people,
which is pretty cool.
But it's a lot harder to make these huge
leaps in terms of like an a hundred X improvement and so forth, because, you know, the product's
been around for 20 years. A lot of those like easy wins or improvements have already been made.
And then it's also harder, I think, to be, to make like company level impact when you're in,
uh, you know, a hundred thousand person organization versus.
I totally feel you there. I totally feel you there.
Yeah, exactly. It's your, this is your baby.. I totally feel you there. Yeah, exactly.
This is your baby.
You can do whatever you want.
But that also means there's also a lot of pressure on you.
Yeah, of course.
Of course.
But it's different choices, right?
Yeah, exactly. So let's get back to vector databases.
So before we get into the specifics of MyScale, what is sort of the value of a vector database?
Why are they so popular
and used in combination with LLM?
Just so that we can kind of set context
for people who are maybe
less familiar with this area.
So even before large language model,
I think vector database
are pretty important
in machine learning
because a vector is a universal representation
for like images,
like audio and also text.
So they were used in similarity search,
and also recommendation and other systems.
And I have been working on those areas before the LM era.
But since LM, because it's such an important and universal
technology, and actually, vector database basically serves
as the external memory for the large language models
because large language, they have several issues,
although they are very useful for such.
The knowledge is static, right?
So once you train the model,
they don't update until you get a newer version,
which might be half a year or one year from the initial version.
But you cannot get the latest knowledge.
And second, there are so many vertical areas of knowledge.
Say you are interested in finance or a knowledge base within an enterprise,
the large-diameter cannot access those knowledge.
So basically, vector-based serves as a way
to bind the large-diameter model
with these up-to-date and specific
and vertical area knowledge.
So ultimately, you get a more useful system.
So that's, that's a common pattern and that's why it's so universally useful.
Yeah.
In terms of a vector, you mentioned at the beginning there that essentially a
vector can be the representation of an object in sort of compute space.
So can you just explain a little bit about how that works?
Like if I have an image, how do I essentially turn an image into a vector that then represents that image?
Basically, you have different models for before the transformers get popular.
You are actually using like a confnets or resnets like we saw convolution, mainly architecture, and you take the bias of the 2D image and then
transform it to several convolutions and then to a 1D vector and something like that.
And nowadays, like transformer, the architecture, because it can take care of both
the global correlation and also look at non-local
correlations compared to the
ComNet.
So they are more powerful and
they are basically taking over both
the text, the images,
the videos. So nowadays people are
largely using transformer
architecture to transform
the different modalities of data into vectors.
And then semantically similar objects will be also nearby in the high dimensional vector
space and also they are often tuned for different purposes.
For example, when you are doing like question and answering, you not only like cluster
like similar paragraphs
into nearby vectors, but you also
train it so that the question
the vector
representation of the question is very
close to the contest
for answering the
question. So that's why
it's so powerful than
traditional, say, keyword-based searches like BM25 and such, because they can basically consider the semantic relationships between the question and the answer. to LLMs, I'm using it as kind of like this external memory where I might have domain-specific
information or I might have information that's more sort of like real-time and the foundation
model doesn't necessarily need to be as up-to-date. It's kind of like I have things going on in my
mind, but then if I need additional information, I can go and Google additional information. It
serves as my external memory essentially, right. So sort of same idea in terms
of an LLM. But then in terms of the use of the vector database, essentially, we're taking a
bunch of objects, we're transforming them into vectors into this high-dimensional space. And
then the job of the vector database is to kind of organize those vectors into a way that we can figure out, okay, this object's
actually similar to this object, or for this question, these are the potential answers.
Is that right?
Yes, yes, yes.
So database basically provides you a way to manage and query vectors and other metadata,
right?
So database is such a technology that dates like 70 years ago, right?
And they have been moving around structured data,
then NoSQL data, then Big Data, and then SQL again,
and then now Vectors.
Yeah, so you see these traditional technologies.
But the basic theme is to manage query search, right?
So, yeah.
Yeah, I mean, like the relational database,
like you said, goes back to like,
I think the first commercial relational database
is from like the late 1970s.
And then SQL came, you know, shortly after that.
So it's been around, it's been with us for a long time.
And then it's really only been sort of the last decade or so
that we figured out ways of starting to sort of unlock
like unstructured data so that we can actually use it.
And it feels like the vector database
is kind of like the next natural evolution of that,
of being able to take things like videos and images
and actually make them useful
in a way that we can search, index them, and compare them.
Yes, yes, of course.
So it's a very, very, very exciting area.
We're seeing many, many potentials in this area, yeah.
Yeah, awesome.
Yeah, and I think that's probably why this is such an interesting area. And we have a lot of players that are starting to come to the forefront of the vector database space.
So I want to start to talk a little bit about the specifics of MyScale.
So ClickHouse is this super fast, open source, column-oriented DBMS.
And actually, recently, I took part in this 1 billion rows challenge back in January,
where the idea was to take a file that contains a billion rows about data from temperature
observations.
And the goal was to figure out the min, the max, and the average temperatures by location,
print that out in alphabetical order, and do that as fast as possible.
And the original contest was for Java programmers. So they wanted to see like,
how fast could you do this using just pure Java, the latest things in Java. But a bunch of people
started to do this with other types of technologies. Like I did it with Snowflake,
but I saw one person actually do it with ClickHouse. And without any kind of like
optimizations, they got super impressive results just running it on their local machine, which is pretty cool.
But I guess essentially my scale is a fork of QuickHouse.
That's how it started.
Why did it make sense to essentially take this column-oriented, super-fast database that's for analytical store and turn that into a vector database.
Yeah. There are different ways of doing a vector database.
You see all these specialized vector base such as Pinecone,
VV8, Qtend, MiLBUS.
But when we started, we want actually not only want to do a vector database,
we want to do a database that's designed for AI.
And then back then there was not even LLM, we want to do it for all AI data.
So that would include structured data, like JSON data, text, time series, geolocation, and others.
So they naturally include all the modalities because they're so widely used.
You see they are also useful in many situations.
So we didn't want to reinvent the wheel.
So we want to also take
existing technology and transform it into something greater.
So we want to do it based on a SQL database because I think,
mathematically speaking, SQL is very beautiful.
It's like a regional algebra.
So we want to do that.
And then there is a question of choosing whether a like a low storage database like
PostgreSQL and MySQL or a columnar storage like ClickHouse was a natural choice for
that because it's a very fast, open-source Apache-licensed
column-based.
And then when we started
looking at vectors,
we realized that
actually you need
a lot of fast scanning
in order to combine
structured data
and vector data well.
For example,
you want to first,
often you want to scan
on the structured data first,
very fast,
and then fill out
and then find out a subset and then run to search on the subset of the data.
So you see this pattern again and again.
So in order to do you want to do a local search, or you want to remove the non-related data
and something else.
So actually, colon data is very bad at that.
Sorry, load-based database is very bad at that
because each time you need to read the entire row, right?
But for Clearshouse, the colon is compressed
and you have all the skipping desks
and also you have the SIMD-like execution.
So it's very fast and store all the big data
and query with SQL.
So it's very nice.
So that's why we decided to implement based on Clearhouse.
And also, it has some disadvantages
because you cannot process the transactions well
because if you need to have a lot of small transactions,
then Clearhouse is very bad.
But we find actually for
AI workloads,
for most of the AI workloads, you don't need
that. So you
can actually work around that. But we
still need to optimize, say, for point
reads and such and such. So we actually
need to modify clear house a bit to
suit the AI workloads better.
So yeah, that's the story.
Yeah, in terms of the transaction speed,
so there you're talking about, like,
transactions in terms of writes to the database.
Is that correct?
Yeah, so, like, a row-based database
is very fast and writes transactions.
Yeah.
Yeah, but that's not, like, the most common use case,
I would think, in terms of, like, a vector database,
is you need to be
able to read and compare and basically find out that this object is similar to this other
object as quickly as possible.
And the writes probably happen in batch if we're talking about something like retrieval
on the meta generation or RAG.
Yeah.
Yeah, yeah, yeah.
And they don't modify the data often.
They just do writing batches.
That's most often workloads. Yeah. Yeah. writing batches. That's the most often workloads.
Yeah, so that changes.
That's why the row level wouldn't necessarily make sense.
So the ClickHouse, my understanding is it supports some level of vector search natively.
So I guess, did that already exist when you started MyScale?
When we started, it didn't exist yet.
Now, I think it has the experimental support for vector search,
but I don't think it's very mature yet.
We are actually doing even better than the mature vector database
because we did so many optimizations on both the vector algorithms and
also modify the SQL execution engine and the storage engine. So we are actually open source
the product very soon. I think by late March. So we'll see. Yeah. So you'll see the code.
And so actually we forked from Clearhouse, but the whole project, I think we are oriented as just the AI workload.
So that's our differentiation with Clearhouse.
We are mostly interested in the vector and structure-related search and analytics.
And for the OLAP features, we actually contribute a lot of the features to the ClickHouse open community.
So I think there are also, because ClickHouse is such a great product, so I think there
are also other folks like us.
What was involved with actually transforming ClickHouse into something that could support
the vectors?
So first of all, you need the vector ordinance. I think that's suitable for the OLAP system.
I think, for example, the graph-based vector ordinance are very popular, but they actually
build very slowly and they also take a lot of resources. So I think that's not very good.
So actually, we only use the graph structure in our algorithm very consciously.
We would combine the tree structure, which actually builds faster and does better at field research,
and combine it with the graph structure and design our own vector algorithm.
And also we fuse the algorithm with the SQL execution engine and
the storage engine. You also need a major modification like that. So, how do you query,
combine the structured data and vector in a single query? And the query can be very complex.
And sometimes you need to do the structured data filter first. Sometimes you need to do the vector search first. How do you do that?
And also the storage engine is also very complex because structured data, you basically have
something like a merge tree, but structured data, they merge very fast. You can easily merge to
data bars. But for vectors, when you go into the merge, you also need to rebuild the vector index, at least in some time.
But vector index rebuilding is very slow and takes a lot of resources. So how do you handle that?
So we need many modifications on both the execution and the storage. And also, in addition from vectors,
we are also adding many advanced search technologies.
We are adding VertiIndex for keyword search and other stuff.
So gradually, you deviate from the original OLAPs
and adding all these features.
So what's a situation where I want to join information
from the vector side of my database
with the non-vectorized data store
that's maybe traditionally organized
into tables, rows, and columns?
Yeah.
So there are many use cases.
So for example, you want to search for a restaurant,
say, near your current location, like say within
10 miles or something like that, then basically the geolocation is more like a structured
data.
We also see people want to search against very complex documents, and they have seen
financial documents, then they sometimes want
to filter on the company name,
which is like a
string filter, and then
do the
vector search.
And also
in many cases,
search is quite fuzzy, so
in many ways, you want
to narrow your search
in order to
get accurate results
and you see
all these combinations.
And in real case,
we also find
like the SQL data model
we think is quite useful
because
so recently
we have been
collaborating
with our
like
clients
on these academic projects
where they actually
store and query hundreds of millions of papers, you actually come with
like 20 tables for this project.
And they store it in MySQL, and sometimes you store some of the
attributes with vectors and store others in other tables and do the joins, you do the updates.
So actually, you see in these real-world
examples, actually, you need the SQL semantics, right?
Just the vectors is not enough. And we see this again and again.
And we are seeing this is a norm in the future.
Right. Like if I wanted like an e-commerce sense, I could do something like I want to like, I don't know, to show me navy blue t-shirts that are within three miles from my home and are of this brand.
So then essentially three miles from my home is going to be a box based on the geolocation.
So I can do that using traditional database lookup. I can look up the brand, but then
navy blue, I might want to do
some sort of vector-based similarity search. Because if I asked a person that, and they
really knew where shirts are sold and were a retail guru, they might recognize, okay, well,
there's not a navy blue shirt at this place that's only a walking distance from your home,
but they have a baby blue shirt or something like that,
and they would give you that option, essentially.
But traditional search can't really perform that kind of operation very well.
Yeah, yeah, yeah, yeah.
So I think like ElastSearch, they are also adding this functionality,
but I don't think they perform well with all these complex filters and analytics.
And what I think is even more useful to do not only search,
but also do the analytics.
You want to analyze different styles that people,
we wish styles they prefer in
different age groups or different locations.
You can do all these analytics.
How do you do that with just a simple vector search?
Can you walk me through what's actually happening
when I, at the query engine level, in the indexing level,
when I write a query like that, where I want to essentially
do a join across, or essentially doing a lookup
on a particular column data, but I'm also essentially
doing a similarity search across the vector store as well?
Yeah. So there are different execution paths.
For example, PGVector,
they do this post-filtering,
so they do the vector search first.
Then after you get the candidates,
they basically filter on the conditions.
But if the condition percentage of matching rows is small, then you might fill out almost all the results or you might get very inaccurate results.
So that's why we benchmark PGVectorouse is actually mostly pre-filtering.
Pre-filtering that you condition on the structured data first, you find the rows that are matching
this condition, the 1% of the rows matches, and then within that 1%, then you basically
compute a big map.
But sometimes it can be done very efficiently, especially
you organize the data.
Like according to this label, you have all these skipping desks and such.
You can do it very efficiently.
And then you do the vector search within that small range.
And then you also do your vector search over there, need to be very friendly, very fast,
and very accurate with this arbitrary, almost arbitrary filtering conditions. And then you
return the results. So that's
how we usually do it. But because
SQL is very flexible, we also provide
the option if you want to do the post-filtering.
Sometimes that's useful.
If the conditioning is very
complex and costly. So
you see this flexibility with SQL
and Vector together.
Okay, so basically you're reducing the search space in sort of the traditional case.
You're first reducing the search space by matching against sort of the traditional database, like columns and indices.
And then let's say we get like 100 potential matches there, then you're taking essentially the vector input
and comparing it against that reduced search space
of the hundred and figuring out
what is the most likely candidates.
And then you're going to have some level of similarity
between the input value for the vector search
and all the potential results.
So how do you know where to cut off
essentially the value of similarity
that is high enough to actually be
valuable to the search?
That's a good question.
So currently we
just provide the
top case search.
So after narrowing down the
candidates, say 1%,
based on the metadata filtering, then
we do search and we just
return you, say, the top 10 or top 100.
And then you can decide whether those are useful or not based on the distance or something like that.
But also, I think sometimes it's also useful to provide, say, the range query.
Say you want to return all the readouts
with similarity larger than, say, 0.7.
Then even if it's 1,000 or 10,000 readouts,
you want to return all of that.
So I think that's also an interesting feature.
But we haven't actually done that in MySQL yet.
So the idea there would be,
I would essentially, as part of my query conditions, I could say,
check the similarity between the input vector and the results, and make sure that it's above
a certain value or certain threshold that makes sense for the query operation I'm running.
Yeah, yeah.
Usually, I think most of the use case is just a top-k search.
You just return the top-k. But sometimes you want to return all the results larger than a threshold. You can also mimic that with top K because you find, say, 100 is not enough, then you can increase it to 200 or 400 or 1000. interface into MyScale is SQL, that there's certain advantages that make it more accessible
to, say, a traditional data analyst that is used to working in SQL, where if they were working with
a different type of vector database, then they need to learn a new way of actually querying
those vector databases. That maybe doesn't feel natural, if you look at even in the warehousing space, you know, Bigtable originally was, or sorry, BigQuery didn't
actually support SQL, it supported like a domain specific language that Google had come up with.
And they, that was fine internally when it was an internal project. But then when they released it
to the public, it was kind of a like, lost traction, essentially, because analysts were like,
well, I don't want to learn a new domain-specific language.
I've been working in SQL for a decade.
And then that essentially helped, I think, companies like Snowflake and so forth really corner that market.
So there's, I would think, some value, essentially, of enabling this kind of search to people who have been working in SQL forever that maybe is not available to
them with some of the other vector database
technologies?
I think the data analysis,
I think SQL is a de facto
programming language.
We have customers
in financial that
they prefer SQL a lot, and also
the other customer
in the news agency, I think it's New Times, they also analyze the government lot. And also, I think the other customer in the news agency,
I think it's New York Times,
they also analyze the government data.
They also want to use SQL.
And also because SQL is so widely used,
that's a large number of those know SQL very well.
So they can even help you translate from text to SQL.
So even people who don't want to write a long SQL,
complex SQL queries, they can also do that. And for complex use cases, SQL is very powerful, even for Elasticsearch. It's just started from
search and with their own domestic language, they also gradually added SQL-like interface to their system.
So I think SQL is pretty important.
And basically, you see SQL and then NoSQL and then NewSQL.
So SQL is coming back and taking, I think, a very large portion of the data management
system space.
Yeah.
I mean, even NoSQL, like the sort of traditional NoSQL database that supports SQL now.
So NoSQL is not really a valid name
for MongoDB and so forth anymore
because it actually supports SQL.
Every time someone tries to kill SQL
and bring something else in,
it ends up being pulled back in
because I think it's too widely used
and it's hard to shift away
from the existing momentum
and all the users that have been working in it forever.
Yeah, yeah. I think that's very duped for.
It's something designed like 50 years ago and still widely used.
And people just cannot get away.
But also, we are seeing that SQL is also incorporating NoSQL because you see this native support of JSON in many databases.
And also, you are seeing some sort of mix of
between row and colon
storage.
So we are seeing this
fusion of
different technologies and we are also
doing that. We are basically fusing
SQL with Spectre.
Snowflake now is launching
Unistore, which brings essentially
the transactional database and the analytical database together in some fashion.
So in terms of, you know, my scale, like how does data modeling work?
Am I creating a table that essentially contains traditional structured data as well as vectors?
Yeah, yeah, yeah.
So basically you have different columns.
Like you can have like traditional structured data columns and you can have vector columns.
You can even have more than one vector. You can have traditional structured data columns and you can have vector columns. You can even have more than one vector column. You can have two or three. And then you can decide
you want to have three vector columns and you can decide that you want to build indexes on two of
them and that's fine. And then as long as you build the index, then you can do very efficient
approximate nearest neighbor search. Otherwise, you can only do proof of search.
You scan the whole column, which is not very efficient.
So you can see you can blend all these data modalities
and different operations with SQL.
What kind of data can I vectorize?
Now, almost all unstructured or complex data can be vectorized, like images, videos, text, sound, time series.
Yeah, almost all.
Yeah, I think.
And basically, you also see this convergence of the algorithms.
Because they are now all using transformer
architecture.
And people are also talking about the ultimate algorithm.
I don't think the transformer is the ultimate algorithm.
I think it needs some improvement, but it's a good structure.
And also, you need the algorithm to adapt to its hardware well Well, you see GPUs are getting faster and faster.
So basically, you need to adapt the algorithm to the GPU platform.
So you see this very beautiful reason of software,
always in the hardware,
and then coping with the unstructured data,
all coming together.
And we talked a little bit about
essentially determining the similarity between vectors
so that you can figure out,
okay, well, these are potential answers,
or this object is similar to this other object.
But in terms of determining similarity,
there's lots of different ways to essentially do that in vector space.
So what is the metric in my scale in terms of determining
that two vectorized objects are actually similar?
Yeah.
So I think the most commonly used metrics for floating-based vectors,
they are the L2 distance and the IP distance in a product and also the cosine
similarity. So basically you compare this after normalizing
the vectors, that's the floating vectors. And also for the binary
vectors, I think you can just compare the L1 distance or the
Jaccard distance. So that's, I think those are the most useful.
And you have other metrics.
I cannot ignore them all,
but those are the ones we also support in MyScale.
Yeah.
What is L1 and L2 distance?
L1 is just like,
the other name for L1 is called the Manhattan distance.
So think about the Poisson grid,
and then you cannot
travel in a diagonal, right? You can just come up and then come in vertical and horizontal.
So that's the L1. And for L2, you can travel from one point to other in a straightforward fashion.
Yeah, so that's L1 and L2 you can think about.
Okay, so it's like, are you traversing
a city grid versus essentially
I can go as the crow flies and go
direct from point to point?
Yeah.
Is this a choice as a user of MyScale
that I'm making in terms of which similarity
metric I'm using, or is this
something that MyScale figures out
for me, or essentially what are that my scale figures out for me?
Or essentially, what are the options there?
Actually, for those distances, it usually depends on the model that you are using.
I think nowadays, for the semantic retrieval, most of the models are using the cosine similarity.
And sometimes they are returning the normalized vectors.
So the square of the elements of vector sum up to 1.
And then in this case, the cosine similarity
is actually the same as the L2 distance in this way.
And for the dot product.
And also to the dot product,
they are all the same
if you are using
normalized vectors.
So that simplifies things a lot.
So, yeah.
So I don't think
this is an issue.
But you need to figure it out.
I think the other simplification
we made that
we made the algorithm
parameter very simple.
So if you use
the vector ordinance
like the IVF or HNSW, you have
all these parameters that you cannot figure out why, but we actually made our in-house algorithm
very simple. So you actually don't need to configure anything when you build index.
And we provide just one single knob. We call it alpha. Alpha is from one to four. And if you choose four, then you are very accurate,
but a bit slower.
And if you choose one, it's very fast,
but you lose a bit of accuracy.
And you can do trade-off in between.
So I think that's actually one of the pain
points of many of the vector search libraries
and also the database.
So you really need to make the...
Because users, there's so many complexities
behind these different algorithms,
and you actually want to just provide
very simple tuning knobs for the users.
Yeah, I mean, it's like there was a time
with even regular relational databases
where you needed a DBA that was doing
a lot of performance tuning and so forth. But now with most databases and managed services, unless you're reaching a certain
like, you know, like you're really pushing the limits of the database, you can kind of use a lot
of the out of the box settings. And for most people, that's like good enough until they reach
a certain level. And the ideal, of course, in the world of vector databases, those much newer would
be that a lot of those things are offloaded to me because
I'm not an expert in how to
tune these things. I just want to get
my RAG model working or something like that.
Yeah, yeah, yeah. So that's why we
simplify most of the things for our users.
Yeah.
And then,
I want to talk about the indexing.
So in a conventional database,
you're adding an index to a column
so I can avoid a full table scan
when I'm doing a lookup or partial match.
With vectors, what's the inner workings
of creating a vector index?
Yeah, it's somewhat very similar
to the traditional structured data indexing.
Because you basically do some sketches and
clusterings of the vectors. So you have different clusterings, like you can compute
discrete clusters. Say you have one million data, then you cluster the data into, say,
1,000 clusters. And then when you have a query you you have the locality
right you don't you don't need to search against all the classes maybe just you search like 10
or 5 percent of the nearest clusters then then you can just speed up of like 10x or 20x right and for
the graph structure you are basically doing greedy descent so you start from one point, from a query point, and then you immerse through the neighbors, and then you basically find the index according to the locality of the vectors to accelerate your computation.
And you can also do compression, right?
You can do dot products, quantization, or other technologies.
So you don't need to do computation on the full precision of the floating vectors,
but only do a personal computation, only a sketch of the
original data. So that's also much faster. And often you need to employ the combination of
these different technologies and then you can get to the acceleration of say 100x or even
like 1000 times faster or more efficient. So that's why you, if you're dealing with vector data
or large quantity vector,
you need to have a vector database
and vector index.
And then, so for the graph approach,
like essentially I'm building a graph
where essentially, you know,
one node is going to be connected
to other nodes that are similar to it.
And then I'm going to walk that graph,
essentially the index of that graph,
like a breadth-first search,
where it's going to cut off any of the branching
when it gets too far away from whatever the similarity threshold is.
And then with the cluster approach, am I having,
let's say I split up all the vectors into 10 different clusters to start with.
And then within those clusters,
are there essentially a subset of clusters as well?
So if I determine that, okay,
cluster five is the place where I want to start the search.
And like essentially within there,
is there a subset of clusters as well
that I'm then going to like branch down into?
Like once I get a cluster,
how does the next part of the index lookup work?
Yeah, I think for a large amount of data, the hierarchical clustering can help. For smaller
amount of data, say just even hundreds of thousands, then just one level of cluster
would work fine. And for the popular graph-based algorithms, you also like the HNSW algorithm.
You also have this hierarchical cluster to basically from the upper layer and then go each layer, you find the nearest point to your query, and then you go down the next level.
So you can see this hierarchical structure, you can see it again and again in different algorithms. Yeah, and that's kind of similar to the B-tree index
that I might use in a standard relational database.
And then in terms of what MyScale is doing,
are you using one of these particular techniques
or are you doing something unique
where you're combining a couple of these different things?
We are actually combining different algorithms
in our in-house algorithm. And we are actually uh combining uh different algorithms in our
in-house algorithm and we are also doing uh more innovations like we are actually combining
uh like the the trees and the graphs and also we are combining uh like a memory and also mvme disk
for for different parts of the data because it naturally forms a hierarchy, as we just mentioned. So we can maximize the performance while minimizing the resources because vector data, they are
very large and cost a lot of resources.
By doing this, we can save the resources.
That's why you can see we have our open source benchmark that we are much more cost efficient
than even the specialized vector databases.
Often like three times, four times, or even sometimes ten times more cost efficient for the simple vector search and also the filter search cases.
How does scaling work? Do you end up having to
shard the vector indexes across multiple nodes or basically multiple machines?
That's also a very good question.
So actually today's machines, they get very large.
You can have hundreds of gigabytes of memory.
And especially when you have our, we put the data on the DSS file,
you have terabytes of space on very fast MBM SSDs.
So even one machine, on one machine, we can host more than one billion vectors.
So one machine, more than one billion vectors. So I think actually a single machine would work for
most of the use cases. However, like if you are say building a search engine with MyScale, which we are actually
helping some of our customers doing right now, then they need hundreds of billions of vectors.
Then we need tens of machines or even hundreds of machines distributedly to build this search engine.
And then we need data sharding.
And then when you do data sharding, there are also many ways you can do that.
You can shard according to time or according to certain topics,
or even you can cluster the data,
and then you can put different shards according to the clustering.
So you can arrange the data in different ways and
you can accelerate
by exporting
this data locality.
So it's very
flexible. So we are utilizing
the data partitioning and sharding
very mature technology in SQL database.
And then is that something
when I'm using the managed service,
like essentially how those shards get created, Is that like offloaded for me?
So for the managed service, like you can, you need to specify the data partition, like manually actually. And we have a guide to tell you how to do that. So for example, one of our users say they are actually providing a very popular
online chat service. Then they partition their vector data according to the user ID. And then
within each partition, they also order by the user ID and the session ID. So you get very good
locality. So the messages of one session is stored continuously, and then you can do very fast search for that
session data or for the user data.
So actually you need to specify that manually, but we also have some guides telling you how
to do that.
For very large scale systems like building a search engine or something like that, then
definitely you need the experts to tell you how to do that.
So if you want to do that, we can try more.
Yeah.
Yeah.
But you're getting into sort of rarefied air in terms of probably
the complexity of scale there.
But where do people kind of run into bottlenecks with vector database?
I think a lot of the traditional, or maybe they're not traditional since they're not that old, but sort of like the vector databases like Pinecone and stuff, they reach certain scale issues where they might run into certain issues, either from a scale perspective or just from certain, I don't know, tripwires that they
should be aware of?
So for just a simple vector search, I think we already optimize for our users very well.
So most of the case, they don't need to worry about that.
But the question comes in when you need to do the complex queries.
So say you want to filter on certain metadata first and do the vector search,
or you want to return also multiple columns of data and then also do joins and such.
And then the query can become slow.
They can easily draw from hundreds of queries per second to even, say, 10 or 20 queries per second.
And this is actually quite normal for the OLAP cases
because for like analytic processing,
you can easily run a query like for several seconds
and that's very normal, right?
So, and if you want to optimize for this query,
actually you need to model your data very carefully.
Like how many, how do you put your data
into different tables? Which data should be
in one table? Which index should you put on how to partition the data? What are the primary keys?
So if you want to optimize for these queries, then you need to design your data model
very, very carefully. And also if you are building, say, a large-scale system with very high QPS,
then you also need more careful tuning, like how do you do data sharding,
how do you do the partitioning.
So actually, you cannot avoid that.
We cannot do it for all, but we are also,
we hope to make it more friendly for our users.
And then how does MySQL compare to some of the dedicated vector databases We've touched on a little bit of this, but in terms of Pinecone, NILVIS, Chroma, Vespa, and so on,
as a user, how is that experienced? What's the experience difference in the value that you bring?
If you have some familiarity with SQL, then you basically need no learning curve.
You need no learning curve with MyScale.
Because I think many people have learned SQL in high school or in their career.
So then it's very familiar.
You have to do like one hour, two hours, just create a table,
build index, and do the search.
And it's very flexible, right?
You can do all the structured data
and non-structured data together.
You can have different tables.
So it's a very powerful tool.
And you will see it's very different
from just a specialized database.
So if you just want a very simple vector search,
I think those APIs are fine.
And you have this, I say, designed like open APIs.
But if you want more complex data modeling, you want to do complex queries,
then I think those data, they just cannot do that.
Then you need the hassle of managing data in different systems.
You need synchronization and you need to worry about
performance issues because you are dealing with different systems. So then I think a SQL Vector
database will help you a lot to resolve those issues. And if you want to build your application
to be future safe, you really want to build a more advanced application
to, I think the area will gradually move towards SQL vector because I don't see any inherent
difficulty of this like paradigm and I can see many of these advantages.
Yeah, it seems like more and more of the databases that we're all familiar with, whether they're essentially a row level or a column store,
are adding vector support in some fashion.
Postgres now has that.
There's a lot of different traditional databases that are adding that.
So it feels like that is the natural direction that people might go in
because of some of the things you're talking about.
It unlocks use cases that you couldn't really necessarily solve
with just a purely dedicated vector database.
So if you have a very specific problem that requires only vectors,
then maybe the traditional vector database makes sense.
But if you need to do sort of this more nuanced, complex analysis
across both structured data, unstructured data,
where I think a lot of people are probably going to live, you need basically support for both of those and SQL feels like a natural interface to that.
Yeah, I think if you are going from those toy projects, you basically need that. You cannot get away with those.
So you're releasing MyScale as an open source project, you mentioned, I think towards the end of March.
What kind of motivated that move? Why open source it now?
Yeah, because we have been working on this for the past
few years. Actually, we actually iterated the design quite a bit.
I think Clarity Engine has been rewritten
four times by now. So actually, it's not such...
Yeah, it looks easy now,
but it's actually quite detoured
because you need to figure out things right.
And so currently, I think we are...
I think we have reached a design
that we are pleased with.
And we actually want more of a user
because SaaS, because SaaS service
for many users, you don't need to manage
the service yourself. But some of
our
like the engineers say,
they actually want just to run a software
in their laptop or in their server.
They want lower latency.
So actually, we want
to use MyScale as
well and we want them feedback so we can continuously improve the product.
And if they want to contribute, that's even better.
We, of course, want a healthy open-source community.
We're all around that.
So I think software had the audience.
I think many of the audience belong to this category.
So we very welcome you to try MyScale, either as a SaaS or just pull the image
or even build the software yourself
to just try it out.
And I think you can feel the power
of SQL and Vector together.
I think you'll feel something very new, right?
So compared to the specialized Vector database.
So I think that's good.
Yeah, I'm super interested in actually trying it out
with some of,
like I do a lot of data analysis stuff
in my day job
for figuring out
the performance of our company,
how are we doing as a business,
go-to-market opportunities
and things like that.
And a lot of times
that requires,
I think, combining
both structured and unstructured data
and that hasn't really been something
that we've been able to
really take
advantage of. So I think, you know, super interested in trying it out myself.
Yeah.
Do you think that, you know,
I feel like more and more databases are probably the majority I'm kind of
guessing here, but feel like they're open source.
And I think over the last, you know,
20 years we've seen that with programming languages as well.
I think it's pretty hard to have a closed source programming language at this point.
Even things like C Sharp that came out of Microsoft 25 years ago is open source now.
Do you think that this is basically like all databases in some fashion need to be open source at some point in order to continue to contribute and grow and be part of the zeitgeist?
I think certainly not all. I think most, I would say. I think those products are
supported, especially supported by big corporates like Oracle or Microsoft,
because they have so many customers and such a large engineering team, they can still afford to
be closed source, because you have all these features and you have the large engineering team, they can still afford to be closed source, right?
Because you have all these features and you have the large engineering team.
But for more independent companies, I think open source is often a good move.
So you can get more feedback and you can get more use case, right?
And you can get more trust between you and the users, more technology transparency. But frankly,
you also need to think carefully about how to commercialize your product. So actually,
we open source, I think, more than 95% of the code, but some of the more advanced features
are still available, only available in our SaaS version
or enterprise version. But we also still hope to bring value and transparency
to our users. And we also try to
commercialize by providing better support
to our users. So that's how
also I think many of the database companies try to make users. So that's how also I think many of the database companies
try to make profit.
Yeah, I mean, I think that's the trick, right?
I think lots of people,
especially that come from engineering backgrounds,
see value in open sourcing their technology,
but can you essentially monetize it
and turn it into a business?
It's also a tricky beast.
So as we start to wrap up, is there anything else you'd like to share?
Yeah, I think it's a fascinating
area. I want to maybe chat more about
AGI and that stuff. I think you
also think a lot about that because we have chatted so many
on this with our friends.
So
human brains, they have
like a
I think the neurons
parameters is like 100 times
larger than GPT-4
on that scale.
And I think Sam Altman just
said that they are going to reach GPT-5
I think probably 10 times larger than GPT-4, maybe half a year or maybe three months from now.
And then I think GPT-6 or GPT-7, maybe very soon, like maybe three years or
five years.
So we are seeing also many problems, like for instance, it's very costly to run the
large models.
And it's not very efficient to just crunch all those parameters into your GD memory.
So we are actually scoring ways to actually offload many of the knowledge from the large language model parameters to a very efficient vector database.
And also to do some co-optimization of the model and the database. So that's where I think you can get 10x or even more efficiency
if you do it properly with a very fast vector database
and also some co-optimization.
So I think that's an interesting doubt that we are taking.
And basically, you actually need to build a search engine
actually using your database because you are actually, the large models, they are actually embedding a search engine, the web data in the model, right?
So they are training on tens of terabytes of web data, right?
And that's one thing that I want to point out. And the other thing that you need to think carefully about the relationship between
like AGI and human because they basically coexist, right? So how can you turn, how can you make this
a healthy relationship, right? I think the AGI, they are going to be very powerful, but how do
you find, like most people find the purpose and how do they contribute to be very powerful, but how do you find, like most people find a purpose? And how do they
contribute to this very powerful system, right? And how does this system just not,
they need to be an open system, right? They cannot just generate fake images or tell fake stories, right?
I think they need to connect
better to the real
world and then take feedback
from humans. And
this bridge is actually
data, right? Data is a bridge
between real
world and digital world.
And so that's why actually
database, I think, they have a purpose. They have a purpose, a bigger role to play here. And so that's why actually database, I think they have a purpose,
they have purpose, a bigger role to play here. And we want to build this bridge to be a much
stronger one than current. So then I think human and AGI, they can have a
more harmonious interplay. You can inject your knowledge
or your sensory data
into the system
and also get feedback.
And then we can...
I think people will find more purpose
and this will be a healthy system
than a closed-loop system
just relying on the model parameters.
So I think that's the other thing
that the data people want to carefully think about.
I think that almost all IT people
should think about this nowadays.
That's just my two cents,
but just thinking these issues nowadays.
And I think we really need to jump out
of our narrow domains, I think, about this issue because it's not far away.
It's just going to happen in just three or five years.
Yeah, I tend to agree with that.
I think I had Bob Muglia, the former CEO of Snowflake, on a few months ago.
And in his book, he predicts AGI by 2030.
So, you know, we're looking at, you know,
five to six years essentially from now.
And I think there's some work to be done.
There's a lot of challenges to figure out,
like, you know, everything from bias and ethics.
But, you know, to your point,
like data is really the love language of AI.
You can't build these large language models without really good data.
And the database is going to play a major role in the current evolution of AI and sort of the future evolution as well.
So maybe we'll have you back sometime down the road just to do an AGI-focused episode.
I think that would be fun.
Yeah, probably. Yeah, yeah. I think that would be fun.
Yeah, hopefully.
Yeah, yeah.
Thank you for the invitation.
Okay.
Yeah.
All right.
Wipeng, thank you so much for being here.
I really enjoyed it. And best of luck with open sourcing MyScale.
Yeah, yeah, sure.
Thank you.