Software Huddle - SQL Meets Vector Search with Linpeng Tang of MyScale

Episode Date: April 2, 2024

Welcome back to an episode where we're talking Vectors, Vector Databases, and AI with Linpeng Tang, CTO and co-founder of MyScale. MyScale is a super interesting technology. They're combining the best... of OLAP databases with Vector Search. The project started back in 2019 where they forked ClickHouse and then adapted it to support Vector Storage, Indexing, and Search. The really unique and cool thing is you get the familiarity and usability of SQL with the power of being able to compare the similarity between unstructured data. We think this has really fascinating use cases for analytics well beyond what we're seeing with other vector database technology that's mostly restricted to building RAG models for LLMs. Also, because it's built on ClickHouse, MyScale is massively scalable, which is an area that many of the dedicated vector databases actually struggle with. We cover a lot about how vector databases work, why they decided to build off of ClickHouse, and how they plan to open source the database. Timestamps 02:29 Introduction 06:22 Value of a Vector Database 12:40 Forking ClickHouse 18:53 Transforming Clickhouse into a SQL vector database 32:08 Data modeling 32:56 What data can be Vectorized 38:37 Indexing 43:35 Achieving Scale 46:35 Bottlenecks 48:41 MyScale vs other dedicated Vector Databases 51:38 Going Open Source 56:04 Closing thoughts

Transcript
Discussion (0)
Starting point is 00:00:00 How does MyScale compare to some of the dedicated vector databases? So if you have some familiarity with SQL, then you basically need a learning curve with MyScale. Do you think that basically all databases in some fashion need to be open source at some point? Certainly not all. I think most, I would say. For more independent companies, I think open source is often a good move. So you can get more feedback and you can get more use case, right? And you can get more trust between you and the users, more technology transparency. But frankly, you also need to think carefully about how to commercialize your product. In terms of a vector, you mentioned at the beginning there
Starting point is 00:00:48 that essentially a vector can be the representation of an object in sort of compute space. So can you just explain a little bit about how that works? Like if I have an image, how do I essentially turn an image into a vector that then represents that image? Hey everyone, Sean here from Software Huddle. Welcome back to an episode where we're talking vectors, vector databases and AI with Wing Pang Tang, CTO and co-founder of MyScale. MyScale is a super interesting technology.
Starting point is 00:01:15 They're combining the best of OLAP databases with vector search. The project started back in 2019 where they forked ClickHouse and then adapted it to support vector storage, indexing, and search. The really unique and cool thing is you get the familiarity and usability of SQL with the power of being able to compare the similarity between unstructured data. I think this has really fascinating use cases for analytics well beyond what we're seeing with other vector
Starting point is 00:01:38 database technology that's most restricted to building RAG models for LLMs. Also because it's built on ClickHouse, MyScale is massively scalable, hence the name, which is an area that many of the dedicated vector databases actually struggle with. We cover a lot about how vector databases work, why they decided to build off of ClickHouse, and how they plan to open source the database.
Starting point is 00:02:00 Before I get you over the interview, Alex and I will be in Miami in April at the Shift Developer Conference doing interviews. If you're in the area, you should come on by, see some great talks, say hi to us. You can learn more about the conference at shift.infobip.com.us. Okay, let's kick you over to my interview with Lin Peng. Lin Peng, welcome to Software Huddle. Hi.
Starting point is 00:02:19 Hey, Sean. Nice to meet you. Yeah. Thanks for jumping on and doing this. So you're the CTO and co-founder of MyScale, a SQL vector database. So can you give me the backstory of MyScale? How did it start? Where are you guys today?
Starting point is 00:02:34 Yeah. Actually, ever since I finished my PhD in print central neuroscience, I have been doing things related to machine learning, deep learning, and big data. And actually, we started this idea, I think, in 2019 or 2020, something like that. And we have been doing this for the first four or five years. But the SaaS version just launched last year. And our MyScale company is headquartered in Singapore. But we have a distributed team across Asia, Africa,
Starting point is 00:03:08 and North America. It seems like there's a couple different companies in the vector database space that have had their headquarters in Asia. Do you think, is there something that led to some of these companies leading the
Starting point is 00:03:24 forefront of the vector database technology coming out of Asia? Or is it just a coincidence? I think currently it's a hot area and we have different companies. Pinecone is primarily in the U.S. And BV8 and Current are in Europe. And I think it's an important area. So you see companies popping up all around, right? Yeah, I guess there's complete geographical coverage by everybody.
Starting point is 00:03:52 So you went directly from completing your PhD to starting MyScales, right? We did something related to machine learning deep learning before, but it's also related to vectors. And this product launched uh like 2019 2020 something like that yeah yeah but i have been doing start since my since my graduate school yeah okay and then how was that transition kind of like moving from more like you know working on academic projects like how did you balance i guess like uh sort of working on an academic project but also like starting a company
Starting point is 00:04:25 and the transition to doing the company full-time? Yeah, I think it's just a comparison between working, say, in big corporates and doing your own startup, right? I was actually interning at Facebook for several years, actually doing internship and contracts. And, you know, it's a hyperscaler, right? Because what you do, even if it's a small amount, you can affect, say, 1 billion or
Starting point is 00:04:49 2 billion people. So actually, you can make a difference. But in a startup, the scaler is much smaller, right? You can only affect a much smaller percentage of people, but you can actually make a huge difference. You can make the product 10 times better or even 100 times better. So they are all possible. So I think it's just different pros and cons.
Starting point is 00:05:19 And I think it's more interesting to work on 10x and 100x problems and hopefully to affect more people. Yeah, absolutely. I think I'm more interesting to work on 10x and 100x problems and hopefully to affect more people. Yeah, absolutely. I think I'm there with you. I think the impact of working for one of the big FANG companies or one of the big companies is that a product you work on or a feature you work on could legitimately be used by a billion people, which is pretty cool. But it's a lot harder to make these huge
Starting point is 00:05:45 leaps in terms of like an a hundred X improvement and so forth, because, you know, the product's been around for 20 years. A lot of those like easy wins or improvements have already been made. And then it's also harder, I think, to be, to make like company level impact when you're in, uh, you know, a hundred thousand person organization versus. I totally feel you there. I totally feel you there. Yeah, exactly. It's your, this is your baby.. I totally feel you there. Yeah, exactly. This is your baby. You can do whatever you want.
Starting point is 00:06:11 But that also means there's also a lot of pressure on you. Yeah, of course. Of course. But it's different choices, right? Yeah, exactly. So let's get back to vector databases. So before we get into the specifics of MyScale, what is sort of the value of a vector database? Why are they so popular and used in combination with LLM?
Starting point is 00:06:29 Just so that we can kind of set context for people who are maybe less familiar with this area. So even before large language model, I think vector database are pretty important in machine learning because a vector is a universal representation
Starting point is 00:06:40 for like images, like audio and also text. So they were used in similarity search, and also recommendation and other systems. And I have been working on those areas before the LM era. But since LM, because it's such an important and universal technology, and actually, vector database basically serves as the external memory for the large language models
Starting point is 00:07:11 because large language, they have several issues, although they are very useful for such. The knowledge is static, right? So once you train the model, they don't update until you get a newer version, which might be half a year or one year from the initial version. But you cannot get the latest knowledge. And second, there are so many vertical areas of knowledge.
Starting point is 00:07:35 Say you are interested in finance or a knowledge base within an enterprise, the large-diameter cannot access those knowledge. So basically, vector-based serves as a way to bind the large-diameter model with these up-to-date and specific and vertical area knowledge. So ultimately, you get a more useful system. So that's, that's a common pattern and that's why it's so universally useful.
Starting point is 00:08:11 Yeah. In terms of a vector, you mentioned at the beginning there that essentially a vector can be the representation of an object in sort of compute space. So can you just explain a little bit about how that works? Like if I have an image, how do I essentially turn an image into a vector that then represents that image? Basically, you have different models for before the transformers get popular. You are actually using like a confnets or resnets like we saw convolution, mainly architecture, and you take the bias of the 2D image and then transform it to several convolutions and then to a 1D vector and something like that.
Starting point is 00:08:53 And nowadays, like transformer, the architecture, because it can take care of both the global correlation and also look at non-local correlations compared to the ComNet. So they are more powerful and they are basically taking over both the text, the images, the videos. So nowadays people are
Starting point is 00:09:19 largely using transformer architecture to transform the different modalities of data into vectors. And then semantically similar objects will be also nearby in the high dimensional vector space and also they are often tuned for different purposes. For example, when you are doing like question and answering, you not only like cluster like similar paragraphs into nearby vectors, but you also
Starting point is 00:09:50 train it so that the question the vector representation of the question is very close to the contest for answering the question. So that's why it's so powerful than traditional, say, keyword-based searches like BM25 and such, because they can basically consider the semantic relationships between the question and the answer. to LLMs, I'm using it as kind of like this external memory where I might have domain-specific
Starting point is 00:10:25 information or I might have information that's more sort of like real-time and the foundation model doesn't necessarily need to be as up-to-date. It's kind of like I have things going on in my mind, but then if I need additional information, I can go and Google additional information. It serves as my external memory essentially, right. So sort of same idea in terms of an LLM. But then in terms of the use of the vector database, essentially, we're taking a bunch of objects, we're transforming them into vectors into this high-dimensional space. And then the job of the vector database is to kind of organize those vectors into a way that we can figure out, okay, this object's actually similar to this object, or for this question, these are the potential answers.
Starting point is 00:11:11 Is that right? Yes, yes, yes. So database basically provides you a way to manage and query vectors and other metadata, right? So database is such a technology that dates like 70 years ago, right? And they have been moving around structured data, then NoSQL data, then Big Data, and then SQL again, and then now Vectors.
Starting point is 00:11:36 Yeah, so you see these traditional technologies. But the basic theme is to manage query search, right? So, yeah. Yeah, I mean, like the relational database, like you said, goes back to like, I think the first commercial relational database is from like the late 1970s. And then SQL came, you know, shortly after that.
Starting point is 00:11:54 So it's been around, it's been with us for a long time. And then it's really only been sort of the last decade or so that we figured out ways of starting to sort of unlock like unstructured data so that we can actually use it. And it feels like the vector database is kind of like the next natural evolution of that, of being able to take things like videos and images and actually make them useful
Starting point is 00:12:15 in a way that we can search, index them, and compare them. Yes, yes, of course. So it's a very, very, very exciting area. We're seeing many, many potentials in this area, yeah. Yeah, awesome. Yeah, and I think that's probably why this is such an interesting area. And we have a lot of players that are starting to come to the forefront of the vector database space. So I want to start to talk a little bit about the specifics of MyScale. So ClickHouse is this super fast, open source, column-oriented DBMS.
Starting point is 00:12:45 And actually, recently, I took part in this 1 billion rows challenge back in January, where the idea was to take a file that contains a billion rows about data from temperature observations. And the goal was to figure out the min, the max, and the average temperatures by location, print that out in alphabetical order, and do that as fast as possible. And the original contest was for Java programmers. So they wanted to see like, how fast could you do this using just pure Java, the latest things in Java. But a bunch of people started to do this with other types of technologies. Like I did it with Snowflake,
Starting point is 00:13:16 but I saw one person actually do it with ClickHouse. And without any kind of like optimizations, they got super impressive results just running it on their local machine, which is pretty cool. But I guess essentially my scale is a fork of QuickHouse. That's how it started. Why did it make sense to essentially take this column-oriented, super-fast database that's for analytical store and turn that into a vector database. Yeah. There are different ways of doing a vector database. You see all these specialized vector base such as Pinecone, VV8, Qtend, MiLBUS.
Starting point is 00:13:57 But when we started, we want actually not only want to do a vector database, we want to do a database that's designed for AI. And then back then there was not even LLM, we want to do it for all AI data. So that would include structured data, like JSON data, text, time series, geolocation, and others. So they naturally include all the modalities because they're so widely used. You see they are also useful in many situations. So we didn't want to reinvent the wheel. So we want to also take
Starting point is 00:14:31 existing technology and transform it into something greater. So we want to do it based on a SQL database because I think, mathematically speaking, SQL is very beautiful. It's like a regional algebra. So we want to do that. And then there is a question of choosing whether a like a low storage database like PostgreSQL and MySQL or a columnar storage like ClickHouse was a natural choice for that because it's a very fast, open-source Apache-licensed
Starting point is 00:15:05 column-based. And then when we started looking at vectors, we realized that actually you need a lot of fast scanning in order to combine structured data
Starting point is 00:15:16 and vector data well. For example, you want to first, often you want to scan on the structured data first, very fast, and then fill out and then find out a subset and then run to search on the subset of the data.
Starting point is 00:15:30 So you see this pattern again and again. So in order to do you want to do a local search, or you want to remove the non-related data and something else. So actually, colon data is very bad at that. Sorry, load-based database is very bad at that because each time you need to read the entire row, right? But for Clearshouse, the colon is compressed and you have all the skipping desks
Starting point is 00:15:55 and also you have the SIMD-like execution. So it's very fast and store all the big data and query with SQL. So it's very nice. So that's why we decided to implement based on Clearhouse. And also, it has some disadvantages because you cannot process the transactions well because if you need to have a lot of small transactions,
Starting point is 00:16:21 then Clearhouse is very bad. But we find actually for AI workloads, for most of the AI workloads, you don't need that. So you can actually work around that. But we still need to optimize, say, for point reads and such and such. So we actually
Starting point is 00:16:38 need to modify clear house a bit to suit the AI workloads better. So yeah, that's the story. Yeah, in terms of the transaction speed, so there you're talking about, like, transactions in terms of writes to the database. Is that correct? Yeah, so, like, a row-based database
Starting point is 00:16:54 is very fast and writes transactions. Yeah. Yeah, but that's not, like, the most common use case, I would think, in terms of, like, a vector database, is you need to be able to read and compare and basically find out that this object is similar to this other object as quickly as possible. And the writes probably happen in batch if we're talking about something like retrieval
Starting point is 00:17:16 on the meta generation or RAG. Yeah. Yeah, yeah, yeah. And they don't modify the data often. They just do writing batches. That's most often workloads. Yeah. Yeah. writing batches. That's the most often workloads. Yeah, so that changes. That's why the row level wouldn't necessarily make sense.
Starting point is 00:17:30 So the ClickHouse, my understanding is it supports some level of vector search natively. So I guess, did that already exist when you started MyScale? When we started, it didn't exist yet. Now, I think it has the experimental support for vector search, but I don't think it's very mature yet. We are actually doing even better than the mature vector database because we did so many optimizations on both the vector algorithms and also modify the SQL execution engine and the storage engine. So we are actually open source
Starting point is 00:18:13 the product very soon. I think by late March. So we'll see. Yeah. So you'll see the code. And so actually we forked from Clearhouse, but the whole project, I think we are oriented as just the AI workload. So that's our differentiation with Clearhouse. We are mostly interested in the vector and structure-related search and analytics. And for the OLAP features, we actually contribute a lot of the features to the ClickHouse open community. So I think there are also, because ClickHouse is such a great product, so I think there are also other folks like us. What was involved with actually transforming ClickHouse into something that could support
Starting point is 00:18:59 the vectors? So first of all, you need the vector ordinance. I think that's suitable for the OLAP system. I think, for example, the graph-based vector ordinance are very popular, but they actually build very slowly and they also take a lot of resources. So I think that's not very good. So actually, we only use the graph structure in our algorithm very consciously. We would combine the tree structure, which actually builds faster and does better at field research, and combine it with the graph structure and design our own vector algorithm. And also we fuse the algorithm with the SQL execution engine and
Starting point is 00:19:46 the storage engine. You also need a major modification like that. So, how do you query, combine the structured data and vector in a single query? And the query can be very complex. And sometimes you need to do the structured data filter first. Sometimes you need to do the vector search first. How do you do that? And also the storage engine is also very complex because structured data, you basically have something like a merge tree, but structured data, they merge very fast. You can easily merge to data bars. But for vectors, when you go into the merge, you also need to rebuild the vector index, at least in some time. But vector index rebuilding is very slow and takes a lot of resources. So how do you handle that? So we need many modifications on both the execution and the storage. And also, in addition from vectors,
Starting point is 00:20:45 we are also adding many advanced search technologies. We are adding VertiIndex for keyword search and other stuff. So gradually, you deviate from the original OLAPs and adding all these features. So what's a situation where I want to join information from the vector side of my database with the non-vectorized data store that's maybe traditionally organized
Starting point is 00:21:14 into tables, rows, and columns? Yeah. So there are many use cases. So for example, you want to search for a restaurant, say, near your current location, like say within 10 miles or something like that, then basically the geolocation is more like a structured data. We also see people want to search against very complex documents, and they have seen
Starting point is 00:21:43 financial documents, then they sometimes want to filter on the company name, which is like a string filter, and then do the vector search. And also in many cases,
Starting point is 00:21:59 search is quite fuzzy, so in many ways, you want to narrow your search in order to get accurate results and you see all these combinations. And in real case,
Starting point is 00:22:10 we also find like the SQL data model we think is quite useful because so recently we have been collaborating with our
Starting point is 00:22:20 like clients on these academic projects where they actually store and query hundreds of millions of papers, you actually come with like 20 tables for this project. And they store it in MySQL, and sometimes you store some of the attributes with vectors and store others in other tables and do the joins, you do the updates.
Starting point is 00:22:41 So actually, you see in these real-world examples, actually, you need the SQL semantics, right? Just the vectors is not enough. And we see this again and again. And we are seeing this is a norm in the future. Right. Like if I wanted like an e-commerce sense, I could do something like I want to like, I don't know, to show me navy blue t-shirts that are within three miles from my home and are of this brand. So then essentially three miles from my home is going to be a box based on the geolocation. So I can do that using traditional database lookup. I can look up the brand, but then navy blue, I might want to do
Starting point is 00:23:26 some sort of vector-based similarity search. Because if I asked a person that, and they really knew where shirts are sold and were a retail guru, they might recognize, okay, well, there's not a navy blue shirt at this place that's only a walking distance from your home, but they have a baby blue shirt or something like that, and they would give you that option, essentially. But traditional search can't really perform that kind of operation very well. Yeah, yeah, yeah, yeah. So I think like ElastSearch, they are also adding this functionality,
Starting point is 00:23:57 but I don't think they perform well with all these complex filters and analytics. And what I think is even more useful to do not only search, but also do the analytics. You want to analyze different styles that people, we wish styles they prefer in different age groups or different locations. You can do all these analytics. How do you do that with just a simple vector search?
Starting point is 00:24:22 Can you walk me through what's actually happening when I, at the query engine level, in the indexing level, when I write a query like that, where I want to essentially do a join across, or essentially doing a lookup on a particular column data, but I'm also essentially doing a similarity search across the vector store as well? Yeah. So there are different execution paths. For example, PGVector,
Starting point is 00:24:50 they do this post-filtering, so they do the vector search first. Then after you get the candidates, they basically filter on the conditions. But if the condition percentage of matching rows is small, then you might fill out almost all the results or you might get very inaccurate results. So that's why we benchmark PGVectorouse is actually mostly pre-filtering. Pre-filtering that you condition on the structured data first, you find the rows that are matching this condition, the 1% of the rows matches, and then within that 1%, then you basically
Starting point is 00:25:40 compute a big map. But sometimes it can be done very efficiently, especially you organize the data. Like according to this label, you have all these skipping desks and such. You can do it very efficiently. And then you do the vector search within that small range. And then you also do your vector search over there, need to be very friendly, very fast, and very accurate with this arbitrary, almost arbitrary filtering conditions. And then you
Starting point is 00:26:06 return the results. So that's how we usually do it. But because SQL is very flexible, we also provide the option if you want to do the post-filtering. Sometimes that's useful. If the conditioning is very complex and costly. So you see this flexibility with SQL
Starting point is 00:26:22 and Vector together. Okay, so basically you're reducing the search space in sort of the traditional case. You're first reducing the search space by matching against sort of the traditional database, like columns and indices. And then let's say we get like 100 potential matches there, then you're taking essentially the vector input and comparing it against that reduced search space of the hundred and figuring out what is the most likely candidates. And then you're going to have some level of similarity
Starting point is 00:26:54 between the input value for the vector search and all the potential results. So how do you know where to cut off essentially the value of similarity that is high enough to actually be valuable to the search? That's a good question. So currently we
Starting point is 00:27:11 just provide the top case search. So after narrowing down the candidates, say 1%, based on the metadata filtering, then we do search and we just return you, say, the top 10 or top 100. And then you can decide whether those are useful or not based on the distance or something like that.
Starting point is 00:27:37 But also, I think sometimes it's also useful to provide, say, the range query. Say you want to return all the readouts with similarity larger than, say, 0.7. Then even if it's 1,000 or 10,000 readouts, you want to return all of that. So I think that's also an interesting feature. But we haven't actually done that in MySQL yet. So the idea there would be,
Starting point is 00:28:02 I would essentially, as part of my query conditions, I could say, check the similarity between the input vector and the results, and make sure that it's above a certain value or certain threshold that makes sense for the query operation I'm running. Yeah, yeah. Usually, I think most of the use case is just a top-k search. You just return the top-k. But sometimes you want to return all the results larger than a threshold. You can also mimic that with top K because you find, say, 100 is not enough, then you can increase it to 200 or 400 or 1000. interface into MyScale is SQL, that there's certain advantages that make it more accessible to, say, a traditional data analyst that is used to working in SQL, where if they were working with a different type of vector database, then they need to learn a new way of actually querying
Starting point is 00:28:59 those vector databases. That maybe doesn't feel natural, if you look at even in the warehousing space, you know, Bigtable originally was, or sorry, BigQuery didn't actually support SQL, it supported like a domain specific language that Google had come up with. And they, that was fine internally when it was an internal project. But then when they released it to the public, it was kind of a like, lost traction, essentially, because analysts were like, well, I don't want to learn a new domain-specific language. I've been working in SQL for a decade. And then that essentially helped, I think, companies like Snowflake and so forth really corner that market. So there's, I would think, some value, essentially, of enabling this kind of search to people who have been working in SQL forever that maybe is not available to
Starting point is 00:29:46 them with some of the other vector database technologies? I think the data analysis, I think SQL is a de facto programming language. We have customers in financial that they prefer SQL a lot, and also
Starting point is 00:30:01 the other customer in the news agency, I think it's New Times, they also analyze the government lot. And also, I think the other customer in the news agency, I think it's New York Times, they also analyze the government data. They also want to use SQL. And also because SQL is so widely used, that's a large number of those know SQL very well. So they can even help you translate from text to SQL.
Starting point is 00:30:21 So even people who don't want to write a long SQL, complex SQL queries, they can also do that. And for complex use cases, SQL is very powerful, even for Elasticsearch. It's just started from search and with their own domestic language, they also gradually added SQL-like interface to their system. So I think SQL is pretty important. And basically, you see SQL and then NoSQL and then NewSQL. So SQL is coming back and taking, I think, a very large portion of the data management system space. Yeah.
Starting point is 00:30:59 I mean, even NoSQL, like the sort of traditional NoSQL database that supports SQL now. So NoSQL is not really a valid name for MongoDB and so forth anymore because it actually supports SQL. Every time someone tries to kill SQL and bring something else in, it ends up being pulled back in because I think it's too widely used
Starting point is 00:31:19 and it's hard to shift away from the existing momentum and all the users that have been working in it forever. Yeah, yeah. I think that's very duped for. It's something designed like 50 years ago and still widely used. And people just cannot get away. But also, we are seeing that SQL is also incorporating NoSQL because you see this native support of JSON in many databases. And also, you are seeing some sort of mix of
Starting point is 00:31:46 between row and colon storage. So we are seeing this fusion of different technologies and we are also doing that. We are basically fusing SQL with Spectre. Snowflake now is launching
Starting point is 00:32:02 Unistore, which brings essentially the transactional database and the analytical database together in some fashion. So in terms of, you know, my scale, like how does data modeling work? Am I creating a table that essentially contains traditional structured data as well as vectors? Yeah, yeah, yeah. So basically you have different columns. Like you can have like traditional structured data columns and you can have vector columns. You can even have more than one vector. You can have traditional structured data columns and you can have vector columns. You can even have more than one vector column. You can have two or three. And then you can decide
Starting point is 00:32:30 you want to have three vector columns and you can decide that you want to build indexes on two of them and that's fine. And then as long as you build the index, then you can do very efficient approximate nearest neighbor search. Otherwise, you can only do proof of search. You scan the whole column, which is not very efficient. So you can see you can blend all these data modalities and different operations with SQL. What kind of data can I vectorize? Now, almost all unstructured or complex data can be vectorized, like images, videos, text, sound, time series.
Starting point is 00:33:14 Yeah, almost all. Yeah, I think. And basically, you also see this convergence of the algorithms. Because they are now all using transformer architecture. And people are also talking about the ultimate algorithm. I don't think the transformer is the ultimate algorithm. I think it needs some improvement, but it's a good structure.
Starting point is 00:33:39 And also, you need the algorithm to adapt to its hardware well Well, you see GPUs are getting faster and faster. So basically, you need to adapt the algorithm to the GPU platform. So you see this very beautiful reason of software, always in the hardware, and then coping with the unstructured data, all coming together. And we talked a little bit about essentially determining the similarity between vectors
Starting point is 00:34:11 so that you can figure out, okay, well, these are potential answers, or this object is similar to this other object. But in terms of determining similarity, there's lots of different ways to essentially do that in vector space. So what is the metric in my scale in terms of determining that two vectorized objects are actually similar? Yeah.
Starting point is 00:34:32 So I think the most commonly used metrics for floating-based vectors, they are the L2 distance and the IP distance in a product and also the cosine similarity. So basically you compare this after normalizing the vectors, that's the floating vectors. And also for the binary vectors, I think you can just compare the L1 distance or the Jaccard distance. So that's, I think those are the most useful. And you have other metrics. I cannot ignore them all,
Starting point is 00:35:08 but those are the ones we also support in MyScale. Yeah. What is L1 and L2 distance? L1 is just like, the other name for L1 is called the Manhattan distance. So think about the Poisson grid, and then you cannot travel in a diagonal, right? You can just come up and then come in vertical and horizontal.
Starting point is 00:35:31 So that's the L1. And for L2, you can travel from one point to other in a straightforward fashion. Yeah, so that's L1 and L2 you can think about. Okay, so it's like, are you traversing a city grid versus essentially I can go as the crow flies and go direct from point to point? Yeah. Is this a choice as a user of MyScale
Starting point is 00:35:58 that I'm making in terms of which similarity metric I'm using, or is this something that MyScale figures out for me, or essentially what are that my scale figures out for me? Or essentially, what are the options there? Actually, for those distances, it usually depends on the model that you are using. I think nowadays, for the semantic retrieval, most of the models are using the cosine similarity. And sometimes they are returning the normalized vectors.
Starting point is 00:36:29 So the square of the elements of vector sum up to 1. And then in this case, the cosine similarity is actually the same as the L2 distance in this way. And for the dot product. And also to the dot product, they are all the same if you are using normalized vectors.
Starting point is 00:36:49 So that simplifies things a lot. So, yeah. So I don't think this is an issue. But you need to figure it out. I think the other simplification we made that we made the algorithm
Starting point is 00:37:00 parameter very simple. So if you use the vector ordinance like the IVF or HNSW, you have all these parameters that you cannot figure out why, but we actually made our in-house algorithm very simple. So you actually don't need to configure anything when you build index. And we provide just one single knob. We call it alpha. Alpha is from one to four. And if you choose four, then you are very accurate, but a bit slower.
Starting point is 00:37:26 And if you choose one, it's very fast, but you lose a bit of accuracy. And you can do trade-off in between. So I think that's actually one of the pain points of many of the vector search libraries and also the database. So you really need to make the... Because users, there's so many complexities
Starting point is 00:37:48 behind these different algorithms, and you actually want to just provide very simple tuning knobs for the users. Yeah, I mean, it's like there was a time with even regular relational databases where you needed a DBA that was doing a lot of performance tuning and so forth. But now with most databases and managed services, unless you're reaching a certain like, you know, like you're really pushing the limits of the database, you can kind of use a lot
Starting point is 00:38:14 of the out of the box settings. And for most people, that's like good enough until they reach a certain level. And the ideal, of course, in the world of vector databases, those much newer would be that a lot of those things are offloaded to me because I'm not an expert in how to tune these things. I just want to get my RAG model working or something like that. Yeah, yeah, yeah. So that's why we simplify most of the things for our users.
Starting point is 00:38:37 Yeah. And then, I want to talk about the indexing. So in a conventional database, you're adding an index to a column so I can avoid a full table scan when I'm doing a lookup or partial match. With vectors, what's the inner workings
Starting point is 00:38:54 of creating a vector index? Yeah, it's somewhat very similar to the traditional structured data indexing. Because you basically do some sketches and clusterings of the vectors. So you have different clusterings, like you can compute discrete clusters. Say you have one million data, then you cluster the data into, say, 1,000 clusters. And then when you have a query you you have the locality right you don't you don't need to search against all the classes maybe just you search like 10
Starting point is 00:39:31 or 5 percent of the nearest clusters then then you can just speed up of like 10x or 20x right and for the graph structure you are basically doing greedy descent so you start from one point, from a query point, and then you immerse through the neighbors, and then you basically find the index according to the locality of the vectors to accelerate your computation. And you can also do compression, right? You can do dot products, quantization, or other technologies. So you don't need to do computation on the full precision of the floating vectors, but only do a personal computation, only a sketch of the original data. So that's also much faster. And often you need to employ the combination of these different technologies and then you can get to the acceleration of say 100x or even
Starting point is 00:40:39 like 1000 times faster or more efficient. So that's why you, if you're dealing with vector data or large quantity vector, you need to have a vector database and vector index. And then, so for the graph approach, like essentially I'm building a graph where essentially, you know, one node is going to be connected
Starting point is 00:40:57 to other nodes that are similar to it. And then I'm going to walk that graph, essentially the index of that graph, like a breadth-first search, where it's going to cut off any of the branching when it gets too far away from whatever the similarity threshold is. And then with the cluster approach, am I having, let's say I split up all the vectors into 10 different clusters to start with.
Starting point is 00:41:22 And then within those clusters, are there essentially a subset of clusters as well? So if I determine that, okay, cluster five is the place where I want to start the search. And like essentially within there, is there a subset of clusters as well that I'm then going to like branch down into? Like once I get a cluster,
Starting point is 00:41:40 how does the next part of the index lookup work? Yeah, I think for a large amount of data, the hierarchical clustering can help. For smaller amount of data, say just even hundreds of thousands, then just one level of cluster would work fine. And for the popular graph-based algorithms, you also like the HNSW algorithm. You also have this hierarchical cluster to basically from the upper layer and then go each layer, you find the nearest point to your query, and then you go down the next level. So you can see this hierarchical structure, you can see it again and again in different algorithms. Yeah, and that's kind of similar to the B-tree index that I might use in a standard relational database. And then in terms of what MyScale is doing,
Starting point is 00:42:35 are you using one of these particular techniques or are you doing something unique where you're combining a couple of these different things? We are actually combining different algorithms in our in-house algorithm. And we are actually uh combining uh different algorithms in our in-house algorithm and we are also doing uh more innovations like we are actually combining uh like the the trees and the graphs and also we are combining uh like a memory and also mvme disk for for different parts of the data because it naturally forms a hierarchy, as we just mentioned. So we can maximize the performance while minimizing the resources because vector data, they are
Starting point is 00:43:10 very large and cost a lot of resources. By doing this, we can save the resources. That's why you can see we have our open source benchmark that we are much more cost efficient than even the specialized vector databases. Often like three times, four times, or even sometimes ten times more cost efficient for the simple vector search and also the filter search cases. How does scaling work? Do you end up having to shard the vector indexes across multiple nodes or basically multiple machines? That's also a very good question.
Starting point is 00:43:47 So actually today's machines, they get very large. You can have hundreds of gigabytes of memory. And especially when you have our, we put the data on the DSS file, you have terabytes of space on very fast MBM SSDs. So even one machine, on one machine, we can host more than one billion vectors. So one machine, more than one billion vectors. So I think actually a single machine would work for most of the use cases. However, like if you are say building a search engine with MyScale, which we are actually helping some of our customers doing right now, then they need hundreds of billions of vectors.
Starting point is 00:44:35 Then we need tens of machines or even hundreds of machines distributedly to build this search engine. And then we need data sharding. And then when you do data sharding, there are also many ways you can do that. You can shard according to time or according to certain topics, or even you can cluster the data, and then you can put different shards according to the clustering. So you can arrange the data in different ways and you can accelerate
Starting point is 00:45:07 by exporting this data locality. So it's very flexible. So we are utilizing the data partitioning and sharding very mature technology in SQL database. And then is that something when I'm using the managed service,
Starting point is 00:45:23 like essentially how those shards get created, Is that like offloaded for me? So for the managed service, like you can, you need to specify the data partition, like manually actually. And we have a guide to tell you how to do that. So for example, one of our users say they are actually providing a very popular online chat service. Then they partition their vector data according to the user ID. And then within each partition, they also order by the user ID and the session ID. So you get very good locality. So the messages of one session is stored continuously, and then you can do very fast search for that session data or for the user data. So actually you need to specify that manually, but we also have some guides telling you how to do that.
Starting point is 00:46:16 For very large scale systems like building a search engine or something like that, then definitely you need the experts to tell you how to do that. So if you want to do that, we can try more. Yeah. Yeah. But you're getting into sort of rarefied air in terms of probably the complexity of scale there. But where do people kind of run into bottlenecks with vector database?
Starting point is 00:46:42 I think a lot of the traditional, or maybe they're not traditional since they're not that old, but sort of like the vector databases like Pinecone and stuff, they reach certain scale issues where they might run into certain issues, either from a scale perspective or just from certain, I don't know, tripwires that they should be aware of? So for just a simple vector search, I think we already optimize for our users very well. So most of the case, they don't need to worry about that. But the question comes in when you need to do the complex queries. So say you want to filter on certain metadata first and do the vector search, or you want to return also multiple columns of data and then also do joins and such. And then the query can become slow.
Starting point is 00:47:36 They can easily draw from hundreds of queries per second to even, say, 10 or 20 queries per second. And this is actually quite normal for the OLAP cases because for like analytic processing, you can easily run a query like for several seconds and that's very normal, right? So, and if you want to optimize for this query, actually you need to model your data very carefully. Like how many, how do you put your data
Starting point is 00:48:03 into different tables? Which data should be in one table? Which index should you put on how to partition the data? What are the primary keys? So if you want to optimize for these queries, then you need to design your data model very, very carefully. And also if you are building, say, a large-scale system with very high QPS, then you also need more careful tuning, like how do you do data sharding, how do you do the partitioning. So actually, you cannot avoid that. We cannot do it for all, but we are also,
Starting point is 00:48:41 we hope to make it more friendly for our users. And then how does MySQL compare to some of the dedicated vector databases We've touched on a little bit of this, but in terms of Pinecone, NILVIS, Chroma, Vespa, and so on, as a user, how is that experienced? What's the experience difference in the value that you bring? If you have some familiarity with SQL, then you basically need no learning curve. You need no learning curve with MyScale. Because I think many people have learned SQL in high school or in their career. So then it's very familiar. You have to do like one hour, two hours, just create a table,
Starting point is 00:49:21 build index, and do the search. And it's very flexible, right? You can do all the structured data and non-structured data together. You can have different tables. So it's a very powerful tool. And you will see it's very different from just a specialized database.
Starting point is 00:49:40 So if you just want a very simple vector search, I think those APIs are fine. And you have this, I say, designed like open APIs. But if you want more complex data modeling, you want to do complex queries, then I think those data, they just cannot do that. Then you need the hassle of managing data in different systems. You need synchronization and you need to worry about performance issues because you are dealing with different systems. So then I think a SQL Vector
Starting point is 00:50:13 database will help you a lot to resolve those issues. And if you want to build your application to be future safe, you really want to build a more advanced application to, I think the area will gradually move towards SQL vector because I don't see any inherent difficulty of this like paradigm and I can see many of these advantages. Yeah, it seems like more and more of the databases that we're all familiar with, whether they're essentially a row level or a column store, are adding vector support in some fashion. Postgres now has that. There's a lot of different traditional databases that are adding that.
Starting point is 00:50:57 So it feels like that is the natural direction that people might go in because of some of the things you're talking about. It unlocks use cases that you couldn't really necessarily solve with just a purely dedicated vector database. So if you have a very specific problem that requires only vectors, then maybe the traditional vector database makes sense. But if you need to do sort of this more nuanced, complex analysis across both structured data, unstructured data,
Starting point is 00:51:27 where I think a lot of people are probably going to live, you need basically support for both of those and SQL feels like a natural interface to that. Yeah, I think if you are going from those toy projects, you basically need that. You cannot get away with those. So you're releasing MyScale as an open source project, you mentioned, I think towards the end of March. What kind of motivated that move? Why open source it now? Yeah, because we have been working on this for the past few years. Actually, we actually iterated the design quite a bit. I think Clarity Engine has been rewritten four times by now. So actually, it's not such...
Starting point is 00:52:04 Yeah, it looks easy now, but it's actually quite detoured because you need to figure out things right. And so currently, I think we are... I think we have reached a design that we are pleased with. And we actually want more of a user because SaaS, because SaaS service
Starting point is 00:52:25 for many users, you don't need to manage the service yourself. But some of our like the engineers say, they actually want just to run a software in their laptop or in their server. They want lower latency. So actually, we want
Starting point is 00:52:41 to use MyScale as well and we want them feedback so we can continuously improve the product. And if they want to contribute, that's even better. We, of course, want a healthy open-source community. We're all around that. So I think software had the audience. I think many of the audience belong to this category. So we very welcome you to try MyScale, either as a SaaS or just pull the image
Starting point is 00:53:06 or even build the software yourself to just try it out. And I think you can feel the power of SQL and Vector together. I think you'll feel something very new, right? So compared to the specialized Vector database. So I think that's good. Yeah, I'm super interested in actually trying it out
Starting point is 00:53:24 with some of, like I do a lot of data analysis stuff in my day job for figuring out the performance of our company, how are we doing as a business, go-to-market opportunities and things like that.
Starting point is 00:53:36 And a lot of times that requires, I think, combining both structured and unstructured data and that hasn't really been something that we've been able to really take advantage of. So I think, you know, super interested in trying it out myself.
Starting point is 00:53:49 Yeah. Do you think that, you know, I feel like more and more databases are probably the majority I'm kind of guessing here, but feel like they're open source. And I think over the last, you know, 20 years we've seen that with programming languages as well. I think it's pretty hard to have a closed source programming language at this point. Even things like C Sharp that came out of Microsoft 25 years ago is open source now.
Starting point is 00:54:13 Do you think that this is basically like all databases in some fashion need to be open source at some point in order to continue to contribute and grow and be part of the zeitgeist? I think certainly not all. I think most, I would say. I think those products are supported, especially supported by big corporates like Oracle or Microsoft, because they have so many customers and such a large engineering team, they can still afford to be closed source, because you have all these features and you have the large engineering team, they can still afford to be closed source, right? Because you have all these features and you have the large engineering team. But for more independent companies, I think open source is often a good move. So you can get more feedback and you can get more use case, right?
Starting point is 00:54:59 And you can get more trust between you and the users, more technology transparency. But frankly, you also need to think carefully about how to commercialize your product. So actually, we open source, I think, more than 95% of the code, but some of the more advanced features are still available, only available in our SaaS version or enterprise version. But we also still hope to bring value and transparency to our users. And we also try to commercialize by providing better support to our users. So that's how
Starting point is 00:55:43 also I think many of the database companies try to make users. So that's how also I think many of the database companies try to make profit. Yeah, I mean, I think that's the trick, right? I think lots of people, especially that come from engineering backgrounds, see value in open sourcing their technology, but can you essentially monetize it and turn it into a business?
Starting point is 00:56:02 It's also a tricky beast. So as we start to wrap up, is there anything else you'd like to share? Yeah, I think it's a fascinating area. I want to maybe chat more about AGI and that stuff. I think you also think a lot about that because we have chatted so many on this with our friends. So
Starting point is 00:56:26 human brains, they have like a I think the neurons parameters is like 100 times larger than GPT-4 on that scale. And I think Sam Altman just said that they are going to reach GPT-5
Starting point is 00:56:44 I think probably 10 times larger than GPT-4, maybe half a year or maybe three months from now. And then I think GPT-6 or GPT-7, maybe very soon, like maybe three years or five years. So we are seeing also many problems, like for instance, it's very costly to run the large models. And it's not very efficient to just crunch all those parameters into your GD memory. So we are actually scoring ways to actually offload many of the knowledge from the large language model parameters to a very efficient vector database. And also to do some co-optimization of the model and the database. So that's where I think you can get 10x or even more efficiency
Starting point is 00:57:47 if you do it properly with a very fast vector database and also some co-optimization. So I think that's an interesting doubt that we are taking. And basically, you actually need to build a search engine actually using your database because you are actually, the large models, they are actually embedding a search engine, the web data in the model, right? So they are training on tens of terabytes of web data, right? And that's one thing that I want to point out. And the other thing that you need to think carefully about the relationship between like AGI and human because they basically coexist, right? So how can you turn, how can you make this
Starting point is 00:58:35 a healthy relationship, right? I think the AGI, they are going to be very powerful, but how do you find, like most people find the purpose and how do they contribute to be very powerful, but how do you find, like most people find a purpose? And how do they contribute to this very powerful system, right? And how does this system just not, they need to be an open system, right? They cannot just generate fake images or tell fake stories, right? I think they need to connect better to the real world and then take feedback from humans. And
Starting point is 00:59:13 this bridge is actually data, right? Data is a bridge between real world and digital world. And so that's why actually database, I think, they have a purpose. They have a purpose, a bigger role to play here. And so that's why actually database, I think they have a purpose, they have purpose, a bigger role to play here. And we want to build this bridge to be a much stronger one than current. So then I think human and AGI, they can have a
Starting point is 00:59:41 more harmonious interplay. You can inject your knowledge or your sensory data into the system and also get feedback. And then we can... I think people will find more purpose and this will be a healthy system than a closed-loop system
Starting point is 01:00:02 just relying on the model parameters. So I think that's the other thing that the data people want to carefully think about. I think that almost all IT people should think about this nowadays. That's just my two cents, but just thinking these issues nowadays. And I think we really need to jump out
Starting point is 01:00:24 of our narrow domains, I think, about this issue because it's not far away. It's just going to happen in just three or five years. Yeah, I tend to agree with that. I think I had Bob Muglia, the former CEO of Snowflake, on a few months ago. And in his book, he predicts AGI by 2030. So, you know, we're looking at, you know, five to six years essentially from now. And I think there's some work to be done.
Starting point is 01:00:53 There's a lot of challenges to figure out, like, you know, everything from bias and ethics. But, you know, to your point, like data is really the love language of AI. You can't build these large language models without really good data. And the database is going to play a major role in the current evolution of AI and sort of the future evolution as well. So maybe we'll have you back sometime down the road just to do an AGI-focused episode. I think that would be fun.
Starting point is 01:01:24 Yeah, probably. Yeah, yeah. I think that would be fun. Yeah, hopefully. Yeah, yeah. Thank you for the invitation. Okay. Yeah. All right. Wipeng, thank you so much for being here.
Starting point is 01:01:33 I really enjoyed it. And best of luck with open sourcing MyScale. Yeah, yeah, sure. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.