The Good Tech Companies - Building Knowledge Graphs for RAG: Exploring GraphRAG with Neo4j and LangChain

Episode Date: October 22, 2024

This story was originally published on HackerNoon at: https://hackernoon.com/building-knowledge-graphs-for-rag-exploring-graphrag-with-neo4j-and-langchain. Combine text ...extraction, network analysis, and LLM prompting and summarization for improved RAG accuracy. Check more stories related to tech-stories at: https://hackernoon.com/c/tech-stories. You can also check exclusive content about #graphrag, #retrieval-augmented-generation, #knowledge-graph, #neo4j, #langchain, #llms, #llmgraphtransformer, #good-company, and more. This story was written by: @neo4j. Learn more about this writer by checking @neo4j's about page, and for more stories, please visit hackernoon.com. This article explores the implementation of a "From Local to Global" GraphRAG pipeline using Neo4j and LangChain. It covers the process of constructing knowledge graphs from text, summarizing communities of entities using Large Language Models (LLMs), and enhancing Retrieval-Augmented Generation (RAG) accuracy by combining graph algorithms with LLM-based summarization. The approach condenses information from multiple sources into structured graphs and generates natural language summaries, offering an efficient method for complex information retrieval.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Building Knowledge Graphs for RAG. Exploring Graph RAG with Neo4j and Langchain, by Neo4j. I am always intrigued by new approaches to implementing retrieval augmented generation RAG over graphs, often called Graph RAG. However, it seems that everyone has a different implementation in mind when they hear the term Graph RA rag. In this blog post, we will dive deep into the, from local to global graph rag, article and implementation by Microsoft researchers. We will cover the knowledge graph construction and summarization part and leave the retrievers forth next blog post. The researchers were so kind as to provide us with the code repository,
Starting point is 00:00:43 and they have a project page as well. The approach taken in the article mentioned above is quite interesting. As far as I understand, it involves using a knowledge graph as a step in the pipeline for condensing and combining information from multiple sources. Extracting entities and relationships from text is nothing new. However, the authors introduce a novel, at least to me, idea of summarizing condensed graph structure and information back as natural language text. The pipeline begins with input text from documents, which are processed to generate a graph. The graph is the end converted back into natural language text, where the generated text contains condensed information about specific entities or graph communities previously spread
Starting point is 00:01:25 across multiple documents. At a very high level, the input to the graph rag pipeline are source documents containing various information. The documents are processed using an LLM to extract structured information about entities appearing in the papers along with their relationships. This extracted structured information is then used to construct a knowledge graph. The advantage of using a knowledge graph data representation is that it can quickly and straightforwardly combine information from multiple documents or data sources about particular entities. As mentioned, the knowledge graph is not the only data representation, though. After the knowledge graph has been constructed, they use a combination of graph algorithms and LLM prompting to generate natural language summaries of communities of entities found
Starting point is 00:02:09 in the knowledge graph. These summaries then contain condensed information spreading across multiple data sources and documents for particular entities and communities. For a more detailed understanding of the pipeline, we can refer to the step-by-step description provided in the original paper. Steps in the pipeline. Image from the graph rag paper, licensed under CC by 4.0 following is a high-level summary of the pipeline that we will use to reproduce their approach using Neo4j and Langchain. Indexing, graph generation source documents to text chunks. Source documents are split into smaller text chunks for processing. Text chunks to element instances. Each text chunk is analyzed to extract entities and relationships,
Starting point is 00:02:52 producing a list of tuples representing these elements. Element instances to element summaries. Extracted entities and relationships are summarized by the LLM into descriptive text blocks for each element. Element summaries to graph communities. These entity summaries form a graph, which is then partitioned into communities using algorithms like Leiden for hierarchical structure. Graph communities to community summaries. Summaries of each community are generated with the LLM to understand the dataset's global topical structure and semantics. Retrieval. Answering community summaries to global answers. Community summaries are used to answer a user query by generating intermediate answers,
Starting point is 00:03:32 which are then aggregated into a final global answer. Note that my implementation was done before their code was available, so there might be slight differences in the underlying approach or LLM prompts being used. I'll try to explain those differences as we go along. The code is available on GitHub. Setting up the Neo4j environment, we will use Neo4j as the underlying graph store. The easiest way to get started is to use a free instance of Neo4j Sandbox, which offers cloud instances of the Neo4j database with the Graph Data Science plugin installed. Alternatively, you can set up a local instance of the Neo4j database with the Graph Data Science plugin installed. Alternatively, you can set up a local instance of the Neo4j database by downloading the Neo4j
Starting point is 00:04:10 desktop application and creating a local database instance. If you are using a local version, make sure to install both APOC and GDS plugins. For productions at UPS, you can use the paid, managed AuraDS instance, which provides the GDS plugin. We start by creating a Neo4j Graph instance, which is the convenience wrapper wedded to Langchain dataset. We will use a news article dataset I created some time ago using DiffBots API. I have uploaded it to my GitHub for easier reuse let's examine the first couple of
Starting point is 00:04:44 rows from the dataset. We have the title and text of the articles available along with their publishing date and token count using the TikTok and library. Text chunking. The text chunking step is crucial and significantly impacts downstream results. The paper authors found that using smaller text chunks results in extracting more entities overall. Number of extract entities given the size of text chunks. Image from the graph rag paper, licensed under CC by 4.0. As you can see, using text chunks of 2400 tokens results in fewer extracted entities than when they used 600 tokens. Additionally, they identified that LLMs might not extract all entities on the first run.
Starting point is 00:05:25 In that case, they introduce AH heuristics to perform the extraction multiple times. We will talk about that more in the next section. However, there are always trade-offs. Using smaller text chunks can result in losing the context and CO references of specific entities spread across the documents. For example, if a document mentions John and he in separate sentences, breaking the text into smaller chunks might make it unclear that he refers to John. Some of the CO reference issues can be solved using an overlap text chunking strategy, but not all of them. Let's examine the size of our article texts. The distribution of article token counts is approximately normal, with a peak of around 400 tokens. The frequency of chunks gradually
Starting point is 00:06:09 increases up to this peak, then decreases symmetrically, indicating most text chunks are near the 400 token mark. Due to this distribution, we will not perform any text chunking here to avoid co-reference issues. By default, the GraphRag project uses chunk sizes of 300 tokens with 100 tokens of overlap. Extracting nodes and relationships, the next step is constructing knowledge from text chunks. For this use case, woos an LLM to extract structured information in the form of nodes and relationships from the text. You can examine the LLM prompt the authors used in the paper. They have LLM prompts where we can predefine node labels if needed, but by default, that's optional.
Starting point is 00:06:51 Additionally, the extracted relationships in the original documentation don't really have a type, only a description. I imagine the reason behind this choice is to allow the LLM to extract and retain richer and more nuanced information as relationships. But it's difficult to have a clean knowledge graph with no relationship type specifications, the descriptions could go into a property. In our implementation, we will use the LLM Graph Transformer, which is available in the Langchain library. Instead of using pure prompt engineering, as the implementation in the article paper does, the LLM graph transformer uses the built-in function calling support to extract structured information, structured output LLMs in langchain. You can inspect the system prompt in this example.
Starting point is 00:07:36 We use GPT-4-0 for graph extraction. The authors specifically instruct the LLM to extract entities and relationships and their descriptions. With the LANG chain implementation, you can use the in-attributes to specify which node or relationship properties you want the LLM to extract. The difference with the LLM graph transformer implementation is that all node or relationship properties are optional, so not all nodes will have the property. If we wanted, we could define a custom extraction to have a mandatory property, but we will skip we could define a custom extraction to have a mandatory property, but we will skip that in this implementation. We will parallelize the requests to make the graph extraction faster and store results to Neo4j in this example. We extract
Starting point is 00:08:15 graph information from 2000 articles and store results to Neo4j. We have extracted around 13,000 entities and 16,000 relationships. Here is an example of an extracted document in the graph. It takes about 35, plus or minus 5, minutes to complete extraction and costs about $30 with GPT-4-0. In this step, the authors introduce heuristics to decide whether to extract graph information in more than one pass. For simplicity's sake, we will only doon pass. However, if we wanted to do multiple passes, we could put the first extraction results as conversational history and simply instruct the LLM that many entities are missing, and it should extract more, like the
Starting point is 00:08:56 graph RA gothers do. Previously, I mentioned how vital text chunk size is and how it affects the number of entities extracted. Since we didn't perform any additional text chunking, we can evaluate the distribution of extracted entities based on text chunk size. The scatter plot shows that while there is a positive trend indicated by their add line, the relationship is sublinear. Most data points cluster at lower entity counts even as token counts increase. This indicates that the number of entities extracted does not scale proportionally with the size of the text chunks. Although some outliers exist, the general pattern shows that higher token counts do not consistently lead to higher entity counts.
Starting point is 00:09:36 This validates the author's finding that lower text chunk sizes will extract more information. I also thought it would be interesting to inspect the node degree distributions of the constructed graph. The following code retrieves and visualizes node degree distributions. The node degree distribution follows a power law pattern, indicating most node-a-shave very few connections while a few nodes are highly connected. The mean degree is 2.45 and the median is 1.00, showing that more than half the node-a-shave only one connection. Most nodes, 75%, have two or fewer connections, and 90% have five or fewer. This distribution is typical of many real-world networks, where a small number of hubs have many connections, and most nodes have few. Since both node and relationship descriptions are not
Starting point is 00:10:23 mandatory properties, we will also examine how many were extracted. The results show that 5,926 nodes out of 12,994, 45.6% have the description property. On the other hand, only 5,569 relationships out of 15,921, 35%, have such a property. Note that due to the probabilistic nature of LLMs, the numbers can vary on different runs and different source data, LLMs, and prompts. Entity resolution. Entity resolution, deduplication, is crucial when constructing knowledge graphs because it ensures that each entity is uniquely and accurately represented, preventing duplicates and merging records that refer to the same real-world entity. This process is essential for maintaining data integrity and consistency within the graph. Without entity resolution,
Starting point is 00:11:14 knowledge graphs would suffer from fragmented and inconsistent data, leading to errors and unreliable insights. This image demonstrates how a single real-world entity might appear under slightly different names in different documents and, consequently, in our graph. Moreover, sparse data becomes a significant issue without entity resolution. Incomplete or partial data from various sources can result in scattered and disconnected pieces of information, making it difficult to form a coherent and comprehensive understanding of entities. Accurate entity resolution addresses this by consolidating data, filling in gaps, and creating a unified view of each entity. Before, after using sensing entity resolution to connect the International Consortium
Starting point is 00:11:56 of Investigative Journalists, ICIJ. Offshore leaks data, image from Paco Nathan, the left part of the visualization presents a sparse and unconnected graph. However, as shown on the right-hand side, such a graph can become well connected with efficient entity resolution. Greater than overall, entity resolution enhances the efficiency of data retrieval and greater than integration, providing a cohesive view of information across different greater than sources. It ultimately enables more effective question answering based on a greater-than-reliable and complete knowledge graph. Unfortunately, the authors of the GraphRag paper did not include any entity resolution code in their repo, although they mention it in their paper.
Starting point is 00:12:38 One reason for leaving this code out could be that it is tough to implement a robust and well-performing entity resolution for any given domain. You can implement custom heuristics for different nodes when dealing with predefined types of nodes. When they aren't predefined, they aren't consistent enough, like company, organization, business, etc. However, if the node labels or types aren't known in advance, as in our case, this becomes an even harder problem. Nonetheless, we will implement a version of entity resolution in our project here, combining text embeddings and graph algorithms with word distance and LLMs. Our process for entity resolution involves the following steps. 1. Entities in the graph. Start with all entities within the graph.
Starting point is 00:13:21 2. K-nearest graph. Construct a K-nearest neighbor graph, connecting similar entities based on text embeddings. 3. Weekly connected components. Identify weekly connected components in the K-nearest graph, grouping entities that are likely to be similar. Add a word distance filtering step after these components have been identified.
Starting point is 00:13:41 4. LLM evaluation. Use an LLM to evaluate these components and decide whether the entities within each component should be merged, resulting in a final decision on entity resolution, for example, merging Silicon Valley Bank and Silicon Valley Bank, while rejecting the merge for different dates like September 16, 2023 and September 2, 2023. We begin by calculating text embeddings for the name and description properties of entities. We can use the method in the integration in langchain to achieve this. We can use these embeddings to find potential candidates that are similar based in the cosine distance of these embeddings. We will use graph algorithms
Starting point is 00:14:22 available in the graph data Science, GDS, library. Therefore, we can use the GDS Python client for ease of use in a Pythonic way. If you are not familiar with the GDS library, we first have to project an in-memory graph before we can execute any graph algorithms. First, the Neo4j stored graph is projected into an in-memory graph for faster processing and analysis. Next, a graph algorithm is executed on the in-memory graph. Optionally, the algorithm's results can be stored back into the Neo4j database. Learn more about it in the documentation.
Starting point is 00:14:57 To create the k-nearest neighbor graph, we will project all entities along with their text embeddings now that the graph is projected under the name, we can execute graph algorithms. We will begin by constructing a k-nearest graph. The two most important parameters influencing how sparse or dense the k-nearest graph will be are in. The is the number of neighbors defined for each node, with a minimum value of 1. The filters out relationships with similarity below this threshold. Here, we will use a default of 10 and a relatively high similarity cutoff of 0. 95. Using a high similarity cutoff, such as 0. 95, ensures that only highly similar pairs are considered matches, minimizing false positives and improving accuracy. Since we want to store the results back to the projected in-memory graph instead of the knowledge graph, we will use the mode of the algorithm. The next step is to identify groups of entities that are connected with the newly inferred similarity relationships. Identifying groups of connected nodes is a frequent process in network analysis,
Starting point is 00:15:58 often called community detection or clustering, which involves finding subgroups of densely connected nodes. In this example, we will use the weakly connected components algorithm, which helps us find parts of a graph where all nodes are connected, even if we ignore the direction of the connections. We use the algorithms mode to store the results back to the database, stored graph, text embedding comparison helps find potential duplicates, but it is only part of the entity resolution process. For example, Google and Apple are very close in the embedding space, 0.96 cosine similarity using the embedding model. The SOME goes for BMW and Mercedes-Benz, 0.97 cosine similarity.
Starting point is 00:16:40 High text embedding similarity is a good start, but we can improve it. Therefore, we will add an additional filter allowing only pairs of words with a text distance of 3 or fewer, meaning that only the characters can be changed. This cipher statement is slightly more involved, and its interpretation is beyond the scope of this blog post. You can always ask an LLM to interpret it. Additionally, the word distance cutoff could be a function of the length of the word instead of a single number and the implementation could be more scalable. What is important is that it outputs groups of potential entities we might want to merge.
Starting point is 00:17:14 Here is an list of potential nodes to merge as you can see, our resolution approach works better for some node types th and others. Based on a quick examination, it seems to work better for people and organizations while it's pretty bad for dates. If we used predefined node types, we could prepare different heuristics for various node types. In this example, we do not have predefined node labels, so we will turn to an LLM to make the final decision about whether entities should be merged or not. First, we need to formulate the LLM prompt to effectively guide and inform the final decision regarding the merging of the nodes I always like to use method in langchain when expecting structured data output O avoid having to parse the outputs manually. Here, we will define the
Starting point is 00:17:57 output as A, where each inner list contains the entities that should be merged. This structure is used to handle scenarios where, for example, the input might be. In such cases, you would want to merge Sony and Sony Inc. separately from Google and Google Inc. Next, we integrate the LLM prompt with the structured output to create a chain using Lang Chain Expression Language
Starting point is 00:18:21 syntax and encapsulate it with Theena function. We need to run all potential candidate nodes through the function to decide whether they should be merged. To speed up the process, we will again parallelize the LLM calls the final step of entity resolution involves taking the results from the LLMond writing them back to the database by merging the specified nodes. This entity resolution is not perfect, but it gives us a starting point upon which we can improve. Additionally, we can improve the logic for determining which entities should be retained. Element summarization. In the next step, the authors perform an element summarization step. Essentially, every node and relationship gets passed through an entity summarization prompt. The authors note the novelty and interest of their approach greater than, overall, our use of rich descriptive text for homogeneous nodes in a
Starting point is 00:19:10 greater than potentially noisy graph structure is aligned with both the capabilities of greater than LLMs and the needs of global, query-focused summarization. These qualities greater than also differentiate our graph index from typical knowledge graphs, which rely greater than on concise and consistent knowledge triples, subject, predicate, object, for greater than downstream reasoning tasks. The idea is exciting. We still extract subject and object IDs or names from text, which allows us to link relationships to correct entities, even when entities appear across multiple text chunks. However, the relationships are introduced to a single type. Instead, the relationship type is actually a free-form text that allows us to retain richer and more nuanced information. Additionally, the entity information is
Starting point is 00:19:55 summarized using an LLM, allowing us toned and index this information and entities more efficiently for more accurate retrieval. One could argue that this richer and more nuanced information could also be retained by adding additional, possibly arbitrary, node and relationship properties. One issue with arbitrary node and relationship properties is that it could be hard to extract the information consistently because the LLM might use different property names or focus on various details on every execution. Some of these problems could be solved using predefined property names with additional type and description information. names or focus on various details on every execution. Some of these problems could be solved using predefined property names with additional type and description information.
Starting point is 00:20:29 In that case, you would need a subject matter expert to help define those properties, leaving little room for an LLM to extract any vital information outside the predefined descriptions. It's an exciting approach to represent richer information in a knowledge graph. One potential issue with the element summarization step is that it does not scale well since it requires an LLM call for every entity and relationship in the graph. Our graph is relatively tiny with 13,000 nodes and 16,000 relationships. Even for such a small graph, we would require 29,000 LLM calls, and each call would use a couple hundred tokens, making it quite expensive and time-intensive. Therefore, we will avoid this step here. We
Starting point is 00:21:11 can still use the description properties extracted during the initial text processing. Constructing and Summarizing Communities The final step in the graph construction and indexing process involves identifying communities within the graph. In this context, a community is a group of nodes that are more densely connected to each other than to the rest of the graph, indicating a higher level of interaction or similarity. The following visualization shows an example of community detection results. Once these entity communities are identified with a clustering algorithm, an LLM generates a summary for each community,
Starting point is 00:21:44 providing insights into their individual characteristics and relationships. Again, we use the Graph Data Science Library. We start by projecting an in-memory graph. To follow the original article precisely, we will project the graph of entities as an undirected weighted network, where the network represents the number of connections between two entities the authors employed the Leiden algorithm, a hierarchical clustering method, to identify communities within the graph. One advantage of using a hierarchical community detection algorithm is the ability to examine communities at multiple levels of granularity. The authors suggest
Starting point is 00:22:18 summarizing all communities at each level, providing a comprehensive understanding of the graph's structure. First, we will use the Weakly Connected Components, WCC, algorithm to assess the connectivity of our graph. This algorithm identifies isolated sections within the graph, meaning it detects subsets of nodes or components that are connected to each other but not to the rest of the graph. These components helpunderstand the fragmentation within the network and identify groups of nodes that are independent from others. WCC is vital for analyzing the overall structure and connectivity of the graph. The WCC algorithm results identified 1,119 distinct components. Notably, the largest component comprises 9,109 nodes, common in real-world networks where a
Starting point is 00:23:06 single supercomponent coexists with numerous smaller isolated components. The smellest component has one node, and the average component size is about 11.3 nodes. Next, we will run the Leiden algorithm, which is also available in the GDS library, and enable the parameter to return and store communities at all levels. We have also included a parameter to run the weighted variant of the Leiden algorithm. Using the mode of the algorithm stores the results as a node property. The algorithm identified five levels of communities, with the highest, least granular level where communities are largest, having 1,188 communities,
Starting point is 00:23:43 as opposed to 1 to 1119 components. Here is the visualization of the communities on the last level using Gephi. Visualizing more than 1000 communities is hard. Even picking the colors for each one is practically impossible. However, they make for nice artistic renditions. Building on this, we will create a distinct node for each community and represent their hierarchical structure as an interconnected graph. Later, we will also store community summaries and other attributes as node properties. The authors also introduce a indicating the number of distinct text chunks in which the entities within the community appear. Now let's examine a sample hierarchical structure with many intermediate communities merging at higher levels. The communities are non-overlapping, meaning that each entity belongs to precisely a single community at each level. The image represents a hierarchical structure
Starting point is 00:24:34 resulting from the Leiden community detection algorithm. The purple nodes represent individual entities, while the orange nodes represent hierarchical communities. The hierarchy shows the organization of these entities into various communities, with smaller communities merging into larger ones on higher levels. Let's now examine how smaller communities merge at higher levels. This image illustrates that less connected entities and consequently smaller communities experience minimal changes across levels. For example, the community structure here only changes in the first two levels but remains identical fourth last three levels. Consequently, the hierarchical levels often appear redundant for these entities,
Starting point is 00:25:14 as the overall organization does not significantly alter at different tiers. Let's examine the number of communities and their sizes and different levels in more detail in the original implementation. Communities on every level were summarized. In our case, that would be 8,590 communities and, consequently, 8,590 LLM calls. I would argue that depending on the hierarchical community structure, not every level needs to be summarized. For example, the difference between the last and the next-to-last level is only four communities. Therefore, we would be creating a lot of redundant summaries. One solution is to create an implementation that can make a single summary for communities on different levels that don't change. Another one would be to collapse community hierarchies that
Starting point is 00:26:03 don't change. Also, I am unsure if community hierarchies that don't change. Also, I am unsure if we want to summarize communities with only one member, as they might not provide much value or information. Here, we will summarize communities on levels 0, 1, and 4. First, we need to retrieve their information from the database at the moment. The community information has the following structure now. We need to prepare an LLM prompt that generates a natural language summarization based on the information provided by the elements of our community. We can take some inspiration from the prompt the researchers used. The authors not only summarized communities but also generated findings for each of them.
Starting point is 00:26:40 A finding can be defined as concise information regarding a specific event or a piece of information. One such example my intuition suggests that extracting findings with just a single pass might not be as comprehensive as we need, much like extracting entities and relationships. Furthermore, I have not found any references or examples of their use in their code in either local or global search retrievers. As a result, we'll refrain from extracting findings in this instance. Or, as academics often put it, this exercise is left to the reader. Additionally, we have also skipped the claims or covariate information extraction, which looks similar to findings at first glance. The prompt we'll use to produce the community summaries is fairly straightforward.
Starting point is 00:27:23 The only thing left is to turn community representations into strings to reduce the number of tokens by avoiding JSON token overhead and wrap the chain as a function now we can generate community summaries for the selected levels. Again we parallelize calls for faster execution one aspect I didn't mention is that the authors also address the potential issue of exceeding context size when inputting community information. As graphs expand, the communities can grow significantly as well. In our case, the largest community comprised 545 members. Given that GPT-4-0 has a context size exceeding 100,000 tokens, we decided to skip this step. As our final step, we will store the community summaries back to the database the final graph structure. The graph now contains the original documents, extracted entities and
Starting point is 00:28:10 relationships, as well as hierarchical community structure and summaries. Summary. The authors of the From Local to Global paper have done a great job in demonstrating a new approach to GraphRag. They show how we can combine and summarize information from various documents into a hierarchical knowledge graph structure. One thing that isn't explicitly mentioned is that we can also integrate structured data sources in a graph. The input doesn't have to be limited to unstructured text only. What I particularly appreciate about their extraction approach is that they capture descriptions
Starting point is 00:28:42 for both nodes and relationships. Descriptions allow the LLM to retain more information than reducing everything to just node IDs and relationship types. Additionally, they demonstrate that a single extraction pass over the text might not capture all relevant information and introduce logic to perform multiple passes if necessary. The authors also present an interesting idea for performing summarizations over graph communities, allowing us to embed and index condensed topical information across multiple data sources. In the next blog post, we will go over the local and global search retriever
Starting point is 00:29:15 implementations and talk about other approaches we could implement based on the given graph structure. As always, the code is available on GitHub. This time, I've also uploaded the database dump so you can explore the results and experiment with different retriever options. You can also import this dump into a forever free Neo4j AuraDB instance, which we can use for the retrieval explorations since we don't need graph data science algorithms for those, just graph pattern matching, vector, and full text indexes. Learn more about the Neo4j integrations with all the GenAI frameworks and practical graph algorithms in my book, Graph Algorithms for Data Science. To learn more about this topic, join us at Nodes 2024 on November 7,
Starting point is 00:29:58 our free virtual developer conference on intelligent apps, knowledge graphs, and AI. Register now. Thank you for listening to this Hackernoon story, read graphs, and AI. Register now.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.