The Good Tech Companies - 6 Critical Challenges of Productionizing Vector Search

Episode Date: April 23, 2024

This story was originally published on HackerNoon at: https://hackernoon.com/6-critical-challenges-of-productionizing-vector-search. Prepare for complexities of deployin...g vector search in production with insights on indexing, metadata filtering, query language, and vector lifecycle management Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #vector-search, #vector-database, #app-development, #rockset, #cloud-computing, #scaling-vector-search, #vector-lifecycle-management, #good-company, and more. This story was written by: @rocksetcloud. Learn more about this writer by checking @rocksetcloud's about page, and for more stories, please visit hackernoon.com. Productionizing vector search involves addressing challenges in indexing, metadata filtering, query language, and vector lifecycle management. Understanding these complexities is crucial for successful deployment and application development.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Six Critical Challenges of Productionizing Vector Search By Roxette You've decided to use vector search in your application, product, or business. You've researched how and why embeddings in vector search make a problem-solvable or can enable new features. You've dipped your toes into the hot, emerging area of approximate nearest-neighbor algorithms and vector databases. Almost immediately upon productionizing vector search applications, you will start to run into very hard and potentially unanticipated difficulties.
Starting point is 00:00:34 This blog attempts to arm you with some knowledge of your future, the problems you will face, and questions you may not know yet that you need to ask. 1. Vector search does not equal vector database. Vector search and all the associated clever algorithms are the central intelligence of any system trying to leverage vectors. However, all of the associated infrastructure to make it maximally useful and production-ready is enormous and very, very easy to underestimate. To put this as strongly as I can, a production-ready vector database will solve many, many more database problems than vector problems. By no means is vector search itself an easy problem, and we will cover many of the hard sub-problems below, but the mountain of traditional
Starting point is 00:01:17 database problems that a vector database needs to solve certainly remain the hard part. Databases solve a host of very real and very well-studied problems from atomicity and transactions, consistency, performance and query optimization, durability, backups, access control, multi-tenancy, scaling and sharding and much more. Vector databases will require answers in all of these dimensions for any product, business or enterprise. Be very wary of home-rolled vector search infra. It's not that hard to download a state-of-the-art vector search library and start approximate nearest neighboring your way towards an interesting prototype. Continuing down this path, however, is a path to accidentally reinventing
Starting point is 00:01:58 your OWN database. That's probably a choice you want to make consciously. 2. Incremental indexing of vectors. Due to the nature of the most modern and vector search algorithms, incrementally updating a vector index is a massive challenge. This is a well-known, hard problem. The issue here is that these indexes are carefully organized for fast look UPS and any attempt to incrementally update them with new vectors will rapidly deteriorate the fast lookup properties. As such, in order to maintain fast lookups as vectors are added, these indexes need to be periodically rebuilt from scratch. Any application hoping to stream new vectors continuously,
Starting point is 00:02:36 with requirements that both the vectors show up in the index quickly and the queries remain fast, will need serious support for the incremental indexing problem. This is a very crucial area for you to understand about your database and a good place to ask a number of hard questions. There are many potential approaches that a database might take to help solve this problem for you. A proper survey of these approaches would fill many blog posts of this size. It's important to understand some of the technical details of your database's approach because it may have unexpected trade-offs or consequences in your application. For example, if a database chooses to do a full re-index with some frequency, it may cause high CPU load and therefore periodically affect query latencies. You should understand your application's need for incremental indexing and the capabilities
Starting point is 00:03:23 of the system you're relying on to serve you. 3. Data latency for both vectors and metadata. Every application should understand its need and tolerance for data latency. Vector-based indexes have, at least by other database standards, relatively high indexing costs. There is a significant trade-off between cost and data latency. How long after you create a vector do you need it to be searchable in your index? If it's soon, vector latency is a major design point in these systems. The same applies to the metadata of your system. As a general rule, mutating metadata is fairly common. E.G. Change whether a user is online or not. In SWA is typically very important that metadata filtered queries rapidly react to updates to metadata. Taking the above example, it's not useful if
Starting point is 00:04:10 your vector search returns a query for someone who has recently gone offline. If you need to stream vectors continuously to the system, or update the metadata of those vectors continuously, you will require a different underlying database architecture than if it's acceptable for your use case t.g. Rebuild the full index every evening to be used the next day. 4. Metadata filtering. I will strongly state this point. I think in almost all circumstances, the product experience will be better if the underlying vector search infrastructure can be augmented by metadata filtering or hybrid search. Greater than show me all the restaurants I might like, a vector search, that are located greater than within 10 miles and are low to medium priced, metadata filter. The second part of this query is a traditional SQL-like
Starting point is 00:04:55 clause intersected with, in the first part, a vector search result. Because of the nature of these large, relatively static, relatively monolithic vector indexes, it's very difficult to do joint vector plus metadata search efficiently. This is another of the well-known, hard problems that vector databases need to address on your behalf. There are many technical approaches that databases might take to solve this problem for you. You can pre-filter, which means to apply the filter first and then do a vector lookup. This approach suffers from not being able to effectively leverage the pre-built vector index. You can post-filter the results after you've done a full vector search.
Starting point is 00:05:35 This works great unless your filter is very selective, in which case, you spend huge amounts of time finding vectors you later toss out because they don't meet the specified criteria. Sometimes, as is the case in Rockset, you can do single-stage filtering which is to attempt to merge the metadata filtering stage with the vector lookup stage in a way that preserves the best of both worlds. If you believe that metadata filtering will be critical to your application and deposit above that it will almost always be. The metadata filtering trade-off sand functionality will become something you want to examine very carefully. 5. Metadata Query Language. If I'm right and metadata filtering is crucial to the application you are building, congratulations,
Starting point is 00:06:16 you have yet another problem. You need a way to specify filters over this metadata. This is a query language coming from a database angle and as this is a Rockset blog, you can probably expect where I am going with this. SQL is the industry standard way to express these kinds of statements. Metadata filters in vector language is simply the clause to a traditional database. It has the advantage of also being relatively easy to port between different systems. Furthermore, these filters are queries, and queries can be optimized. The sophistication of the query optimizer can have a huge impact on the performance of your queries. For example, sophisticated optimizers will try to
Starting point is 00:06:56 apply the most selective of the metadata filters first because this will minimize the work later stages of the filtering require, resulting in a large performance win. If you plan on writing non-trivial applications using vector search and metadata filters, it's important to understand and be comfortable with the query language, both ergonomics and implementation, you are signing up to use, write, and maintain. 6. Vector Lifecycle Management. Alright, you've made it this far. You've got a vector database that has all the right database fundamentals you require, has the right incremental indexing strategy for your use case, has a good story around your metadata filtering needs, and will keep its index up to date with latencies you can tolerate. Awesome, your ML team, or maybe OpenAI, comes out
Starting point is 00:07:40 with a new version of their embedding model. You have a gigantic database filled with old vectors that now need to be updated. Now what? Where are you going to run this large batch ML job? How are you going to store the intermediate results? How are you going to do the switchover to the new version? How do you plan to do this in a way that doesn't affect your production workload? Ask the hard questions. Vector search is a rapidly emerging area, and we're seeing a lot of users starting to bring applications to production. My goal for this post was to arm you with some of the crucial hard questions you might not yet know to ask. And you'll benefit greatly from having them answered sooner rather than later. In this post what I didn't cover was how Rockset has and is working to solve all of these problems and why some of our solutions to
Starting point is 00:08:24 these are groundbreaking and better than most other attempts at the state of the art. Covering that would require many blog posts of this size, which is, I think, precisely what we'll do. Stay tuned for more. Thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.