The Good Tech Companies - Beyond Text Embeddings: Addressing the Gaps in RAG Applications for Structured Data Queries

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Beyond Text Embeddings. Addressing the Gaps in RAG Applications for Structured Data Queries by Neo4j. Everyone loves text embedding models, and for good reason, they excel at encoding unstructured text, making it easier to discover semantically similar content. It's no surprise that they form the backbone of most RAG applications, especially with the current emphasis on encoding and retrieving relevant information from documents and other textual resources. However, there are clear examples of questions one might ask where the text embedding approach to RAG applications falls short and delivers incorrect information.

Starting point is 00:00:41 As mentioned, text embeddings are great at encoding unstructured text. On the other hand, they aren't that great at dealing with structured information and operations such as filtering, sorting, or aggregations. Imagine a simple question like greater than what is the highest rated movie released in 2024? To answer this question, we must first filter by release year, followed by sorting by rating. We'll examine how a naive approach with text embeddings performs and then demonstrate how to deal with such questions. This blog post showcases that when dealing with structured data operations such as filtering, sorting, or aggregating, you need to use other tools that provide structure such as knowledge graphs.

Starting point is 00:01:20 The code is available on GitHub. Environment setup. For this blog post, we'll use the Recommendations project in Neo4j Sandbox. The Recommendations project uses the MovieLens dataset, which contains movies, actors, ratings, and more information. The following code will instantiate a langchain wrapper to connect to Neo4j database. Additionally, you will require an OpenAI API key that you pass in the following code. The database contains 10,000 movies, but text embeddings are not yet stored. To avoid calculating embeddings for all of them, we'll tag the 1,000 top-rated films with a secondary label called Target Calculating and Storing Text Embeddings. Deciding what to embed is an important

Starting point is 00:02:02 consideration. Since we'll be demonstrating filtering by year and sorting by rating, it wouldn't be fair to exclude those details from the embedded text. That's why I chose to capture the release year, rating, title, and description of each movie. Here is an example of text we will embed for the Wolf of Wall Street movie. You might say this is not a good approach to embedding structured data, and I wouldn't argue since I don't know the best approach. Maybe instead of key value items, we should convert them to text or something. Let me know if you have some ideas about what might work better. The Neo4j vector object in langchain has a convenient method from __existing__graph

Starting point is 00:02:39 where you can select which text properties should be encoded in this example. We use OpenAI's text embedding 3 small model for embedding generation. We initialize the Neo4j vector object using the from underscore existing underscore graph method. The node underscore label parameter filters the nodes to be encoded, specifically those labeled target. The text underscore node underscore properties parameter defines the node properties to be embedded, including plot, title, year, and IMDB rating. Finally, the embedding underscore node underscore property defines the property where the generated embeddings will be stored, designated as embedding. The naive approach, let's start by trying to find a movie based on its plot or description results the results seem pretty solid overall. There's consistently a little boy involved, though I'm not sure if he always meets his hero.

Starting point is 00:03:35 Then again, the dataset only includes 1000 movies, so the options are somewhat limited. Now let's try a query that requires some basic filtering results it's funny, but not a single movie from 2016 was selected. Maybe we could get better results with different text preparation for encoding. However, text embeddings aren't applicable here since we're dealing with a simple structured data operation where we need to filter documents or, in this example, movies based on a metadata property. Metadata filtering is a well-established technique often employed to enhance the accuracy of RAG systems. The next query we'll try requires a bit of sorting results if you're familiar with IMDb ratings, you know there are plenty of movies scoring above 8.

Starting point is 00:04:13 3. The highest rated title in our database is actually a series, Band of Brothers, with an impressive 9. 6 rating. Once again, text embeddings perform poorly when it comes to sorting results. Let's also evaluate a question that requires some sort of aggregation results. The results are definitely not helpful here because we get 4 random movies returned. It's virtually impossible to get from these random 4 movies a conclusion that there are a total of 1000 movies we tagged and embedded for this example. So what's the solution? It's straightforward.

Starting point is 00:04:48 Questions involving structured operations like filtering, sorting, and aggregation need tools designed to pair it with structured data. Tools for structured data. At the moment, it seems that most people think about the text-to-query approach, where an LLM generates a database query to interact with a database based on the provided question and schema. For Neo4j, this is text-to-cypher, database query to interact with a database based on the provided question and schema. For Neo4j, this is Text2Cypher, but there is also Text2SQL for SQL databases. However, it turns out in practice that it isn't reliable and not robust enough for production use.

Starting point is 00:05:18 Cypher Statement Generation Evaluation Taken from my blog post about Cypher evaluation, you can use techniques like chain of thought, few shot examples, or fine tuning, but achieving high accuracy remains nearly impossible at this stage. The text-to-query approach works well for simple questions on straightforward database schemas, but that's not the reality of production environments. To address this, we shift the complexity of generating database queries away from an LLM and treat it as a code problem where we generate database queries deterministically based on function inputs. The advantage is significantly improved robustness, though it comes at the cost of reduced flexibility. It's better to narrow the scope of the RAG application and answer those

Starting point is 00:05:58 questions accurately, rather than attempt to answer everything but do so inaccurately. Since we are generating database queries, in this case, cipher statements, based on function inputs, we can leverage the tool capabilities of LLMs. In this process, the LLM populates the relevant parameters based on user input, while the function handles retrieving the necessary information. For this demonstration, we'll first implement two tools, one for counting movies and another for listing them, and then create an LLM agent using Lang Graph. Tool for counting movies. We begin by implementing a tool for counting movies based on predefined filters.

Starting point is 00:06:36 First, we have to define what those filters are and describe to an LLM when and how to use them. Langchain offers several ways to define function inputs, but I prefer the pydantic approach. In this example, we have three filters available to refine movie results. Min underscore year, max underscore year, and min underscore rating. These filters are based on structured data and are optional, as the user may choose to include any, all, orn one of them. Additionally, we've introduced a grouping underscore key input that tells the function whether to group the count by a specific property. In this case, the only supported grouping is by year, as defined in the enum section. Now let's define the actual

Starting point is 00:07:16 function the movie underscore count function generates a cipher query to count movies based on optional filters and grouping key. It begins by defining a list of filters with corresponding values provided as arguments. The filters are used to dynamically build the where clause, which is responsible for applying the specified filtering conditions in the cipher statement, including only those conditions where values are not none. The return clause of the cipher query is then constructed, either grouping by the provided grouping underscore key or simply counting the total number of movies. Finally, the function executes the query and returns the results. The function can be extended with more arguments and

Starting point is 00:07:55 more involved logic AS needed, but it's important to ensure that it remains clear so that an LLM can call it correctly and accurately. Tool for listing movies. Again, we have to start by defining the arguments of the function we keep the same three filters as in the movie count function but add the description argument. This argument lets us search and list movies based on their plot using vector similarity search. Just because we're using structured tools and filters doesn't mean we can't incorporate text embedding in vector search methods. Since we don't want to return all movies most of the time, why include an optional k input with a default value? Additionally, for listing, we want to sort the movies to return only the most relevant ones. In this case, we can sort them by rating or release year. Let's implement

Starting point is 00:08:41 the function this function retrieves a list of movies based on multiple optional filters, description, year range, minimum rating, and sorting preferences. If only a description is given with no other filters, it performs a vector index similarity search to find relevant movies. When additional filters are applied, the function constructs a cipher query to match movies based on the specified criteria, such as release year and IMDB rating, combining them with an optional description-based similarity. The results are then sorted by either the similarity score, IMDB rating, or year, and limited to K movies. Putting it all together as a Lang graph agent, we will implement a straightforward React agent using Lang graph. The agent consists of an LLM and

Starting point is 00:09:25 tools step. As we interact with the agent, we'll first call the LLM to decide if we should use tools. Then we'll run a loop 1. If the agent said to take an action, i.e. call tool, we'll run the tools and pass the results back to the agent. 2. If the agent did not ask to run tools, we'll finish, respond to the user. The code implementation is as straightforward as it gets. First we bind the tools to the LLM and define the assistant step next we define the lang graph flow we define two nodes in the lang graph and link them with a conditional edge. If a tool is called, the flow is directed to the tools, otherwise, the results are resent back to the user. Let's now test our agent results. In the first step, the agent chooses to use the movie

Starting point is 00:10:10 list tool with the appropriate description parameter. It's unclear why it selects a k value of 5, but it seems to favor that number. The tool returns the top 5 most relevant movies based on the plot, and the LLM simply summarizes them for the user at the end. If we ask ChadGPT why it likes K value of 5, we get the following response. Next, let's ask a slightly more complex question that requires metadata filtering results. This time, additional arguments were used to filter movies only from the 1990s. This example would be a typical example of metadata filtering using the pre-filtering approach. The generated cipher statement first narrows down the movies by

Starting point is 00:10:50 filtering on their release year. In the next part, the cipher statement uses text embeddings and vector similarity search to find movies about a little girl meeting her hero. Let's try to count movies based on various conditions results. With a dedicated tool for counting, the complexity shifts from the LLM to the tool, leaving the LLM responsible only for populating the relevant function parameters. This separation of tasks makes the system more efficient and robust and reduces the complexity of the LLM input. Since the agent can invoke multiple tools sequentially or in parallel, let's test it with something even more complex results as mentioned. The agent can invoke multiple tools to gather all the necessary information to answer the question.

Starting point is 00:11:33 In this example, it begins by listing the highest rated movies to identify when the top rated film was released. Once it has that data, it calls the movie count tool to gather the number of movies released after the specified year, using a grouping key as defined in the question. Summary, while text embeddings are excellent for searching through unstructured data, they fall short when it comes to structured operations like filtering, sorting, and aggregating. These tasks require tools designed for structured data, which offer the precision and flexibility needed to handle these operations. The key takeaway is that expanding the set of tools in your system allows you to address a broader range of user queries, making your applications more robust and versatile. Combining structured data approaches and unstructured text search techniques can deliver more accurate and relevant responses, ultimately enhancing the user experience in

Starting point is 00:12:23 RAG applications. As always, the code is available on GitHub. To learn more about this topic, join us at Nodes 2024 on November 7, our free virtual developer conference on intelligent apps, knowledge graphs, and AI. Register now. Thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

The Good Tech Companies - Beyond Text Embeddings: Addressing the Gaps in RAG Applications for Structured Data Queries

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.