The Good Tech Companies - Building Knowledge Graphs for RAG: Exploring GraphRAG with Neo4j and LangChain
Episode Date: October 22, 2024This story was originally published on HackerNoon at: https://hackernoon.com/building-knowledge-graphs-for-rag-exploring-graphrag-with-neo4j-and-langchain. Combine text ...extraction, network analysis, and LLM prompting and summarization for improved RAG accuracy. Check more stories related to tech-stories at: https://hackernoon.com/c/tech-stories. You can also check exclusive content about #graphrag, #retrieval-augmented-generation, #knowledge-graph, #neo4j, #langchain, #llms, #llmgraphtransformer, #good-company, and more. This story was written by: @neo4j. Learn more about this writer by checking @neo4j's about page, and for more stories, please visit hackernoon.com. This article explores the implementation of a "From Local to Global" GraphRAG pipeline using Neo4j and LangChain. It covers the process of constructing knowledge graphs from text, summarizing communities of entities using Large Language Models (LLMs), and enhancing Retrieval-Augmented Generation (RAG) accuracy by combining graph algorithms with LLM-based summarization. The approach condenses information from multiple sources into structured graphs and generates natural language summaries, offering an efficient method for complex information retrieval.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
Building Knowledge Graphs for RAG. Exploring Graph RAG with Neo4j and Langchain, by Neo4j.
I am always intrigued by new approaches to implementing retrieval augmented generation
RAG over graphs, often called Graph RAG. However, it seems that everyone has a different
implementation in mind when they hear the term Graph RA rag. In this blog post, we will dive deep into the, from local to global graph rag,
article and implementation by Microsoft researchers. We will cover the knowledge
graph construction and summarization part and leave the retrievers forth next blog post.
The researchers were so kind as to provide us with the code repository,
and they have a project page as well. The approach taken in the article mentioned above is quite
interesting. As far as I understand, it involves using a knowledge graph as a step in the pipeline
for condensing and combining information from multiple sources. Extracting entities and
relationships from text is nothing new. However, the authors introduce a novel, at least to me,
idea of summarizing condensed graph structure and information back as natural language text.
The pipeline begins with input text from documents, which are processed to generate a graph.
The graph is the end converted back into natural language text, where the generated text contains
condensed information about specific entities or graph communities previously spread
across multiple documents. At a very high level, the input to the graph rag pipeline are source
documents containing various information. The documents are processed using an LLM to extract
structured information about entities appearing in the papers along with their relationships.
This extracted structured information is then used to construct a knowledge graph. The advantage of using a knowledge graph data representation is that it can quickly and
straightforwardly combine information from multiple documents or data sources about particular
entities. As mentioned, the knowledge graph is not the only data representation, though.
After the knowledge graph has been constructed, they use a combination of graph algorithms and
LLM prompting to generate natural language summaries of communities of entities found
in the knowledge graph. These summaries then contain condensed information spreading across
multiple data sources and documents for particular entities and communities.
For a more detailed understanding of the pipeline, we can refer to the step-by-step
description provided in the original paper.
Steps in the pipeline. Image from the graph rag paper, licensed under CC by 4.0 following is a high-level summary of the pipeline that we will use to reproduce their approach using Neo4j and
Langchain. Indexing, graph generation source documents to text chunks. Source documents are
split into smaller text chunks for processing.
Text chunks to element instances. Each text chunk is analyzed to extract entities and relationships,
producing a list of tuples representing these elements. Element instances to element summaries.
Extracted entities and relationships are summarized by the LLM into descriptive text
blocks for each element. Element summaries to
graph communities. These entity summaries form a graph, which is then partitioned into communities
using algorithms like Leiden for hierarchical structure. Graph communities to community
summaries. Summaries of each community are generated with the LLM to understand the
dataset's global topical structure and semantics. Retrieval. Answering community summaries to global answers.
Community summaries are used to answer a user query by generating intermediate answers,
which are then aggregated into a final global answer.
Note that my implementation was done before their code was available,
so there might be slight differences in the underlying approach or LLM prompts being used.
I'll try to explain those differences as
we go along. The code is available on GitHub. Setting up the Neo4j environment, we will use
Neo4j as the underlying graph store. The easiest way to get started is to use a free instance of
Neo4j Sandbox, which offers cloud instances of the Neo4j database with the Graph Data Science
plugin installed. Alternatively, you can set up a local instance of the Neo4j database with the Graph Data Science plugin installed. Alternatively, you can set up a local instance of the Neo4j database by downloading the Neo4j
desktop application and creating a local database instance. If you are using a local version,
make sure to install both APOC and GDS plugins. For productions at UPS, you can use the paid,
managed AuraDS instance, which
provides the GDS plugin.
We start by creating a Neo4j Graph instance, which is the convenience wrapper wedded to
Langchain dataset.
We will use a news article dataset I created some time ago using DiffBots API.
I have uploaded it to my GitHub for easier reuse let's examine the first couple of
rows from the dataset. We have the title and text of the articles available along with their publishing date and
token count using the TikTok and library. Text chunking. The text chunking step is crucial and
significantly impacts downstream results. The paper authors found that using smaller text chunks
results in extracting more entities overall. Number of extract entities
given the size of text chunks. Image from the graph rag paper, licensed under CC by 4.0. As
you can see, using text chunks of 2400 tokens results in fewer extracted entities than when
they used 600 tokens. Additionally, they identified that LLMs might not extract all
entities on the first run.
In that case, they introduce AH heuristics to perform the extraction multiple times.
We will talk about that more in the next section. However, there are always trade-offs.
Using smaller text chunks can result in losing the context and CO references of specific entities
spread across the documents. For example, if a document mentions John and he
in separate sentences, breaking the text into smaller chunks might make it unclear that he
refers to John. Some of the CO reference issues can be solved using an overlap text chunking
strategy, but not all of them. Let's examine the size of our article texts. The distribution of
article token counts is approximately normal, with a peak of around 400 tokens. The frequency of chunks gradually
increases up to this peak, then decreases symmetrically, indicating most text chunks
are near the 400 token mark. Due to this distribution, we will not perform any text
chunking here to avoid co-reference issues. By default, the GraphRag project uses chunk sizes of 300 tokens
with 100 tokens of overlap. Extracting nodes and relationships, the next step is constructing
knowledge from text chunks. For this use case, woos an LLM to extract structured information
in the form of nodes and relationships from the text. You can examine the LLM prompt the authors
used in the paper. They have LLM
prompts where we can predefine node labels if needed, but by default, that's optional.
Additionally, the extracted relationships in the original documentation don't really have a type,
only a description. I imagine the reason behind this choice is to allow the LLM to extract and
retain richer and more nuanced information as relationships.
But it's difficult to have a clean knowledge graph with no relationship type specifications,
the descriptions could go into a property. In our implementation, we will use the LLM
Graph Transformer, which is available in the Langchain library. Instead of using pure prompt
engineering, as the implementation in the article paper does, the LLM graph transformer uses the built-in function calling support to extract structured
information, structured output LLMs in langchain. You can inspect the system prompt in this example.
We use GPT-4-0 for graph extraction. The authors specifically instruct the LLM to extract entities
and relationships and their descriptions.
With the LANG chain implementation, you can use the in-attributes to specify which node or relationship properties you want the LLM to extract. The difference with the LLM graph
transformer implementation is that all node or relationship properties are optional,
so not all nodes will have the property. If we wanted, we could define a custom extraction to
have a mandatory property, but we will skip we could define a custom extraction to have a
mandatory property, but we will skip that in this implementation. We will parallelize the requests
to make the graph extraction faster and store results to Neo4j in this example. We extract
graph information from 2000 articles and store results to Neo4j. We have extracted around 13,000
entities and 16,000 relationships.
Here is an example of an extracted document in the graph.
It takes about 35, plus or minus 5, minutes to complete extraction and costs about $30 with GPT-4-0.
In this step, the authors introduce heuristics to decide whether to extract graph information in more than one pass.
For simplicity's sake, we will only doon pass.
However, if we wanted to do multiple passes, we could put the first extraction results as conversational history
and simply instruct the LLM that many entities are missing, and it should extract more, like the
graph RA gothers do. Previously, I mentioned how vital text chunk size is and how it affects the
number of entities extracted. Since we didn't
perform any additional text chunking, we can evaluate the distribution of extracted entities
based on text chunk size. The scatter plot shows that while there is a positive trend
indicated by their add line, the relationship is sublinear. Most data points cluster at lower
entity counts even as token counts increase. This indicates that the number of entities extracted
does not scale proportionally with the size of the text chunks. Although some outliers exist,
the general pattern shows that higher token counts do not consistently lead to higher entity counts.
This validates the author's finding that lower text chunk sizes will extract more information.
I also thought it would be interesting to inspect the node degree distributions of the constructed graph. The following code retrieves and visualizes node
degree distributions. The node degree distribution follows a power law pattern,
indicating most node-a-shave very few connections while a few nodes are highly connected.
The mean degree is 2.45 and the median is 1.00, showing that more than half the node-a-shave only one connection.
Most nodes, 75%, have two or fewer connections, and 90% have five or fewer.
This distribution is typical of many real-world networks, where a small number of hubs have many
connections, and most nodes have few. Since both node and relationship descriptions are not
mandatory properties, we will also examine
how many were extracted. The results show that 5,926 nodes out of 12,994, 45.6% have the description
property. On the other hand, only 5,569 relationships out of 15,921, 35%, have such a property. Note that due to the probabilistic nature of LLMs,
the numbers can vary on different runs and different source data, LLMs, and prompts.
Entity resolution. Entity resolution, deduplication, is crucial when constructing
knowledge graphs because it ensures that each entity is uniquely and accurately represented,
preventing duplicates and merging records that refer to the same real-world entity. This process is essential
for maintaining data integrity and consistency within the graph. Without entity resolution,
knowledge graphs would suffer from fragmented and inconsistent data, leading to errors and
unreliable insights. This image demonstrates how a single real-world entity might appear
under slightly different
names in different documents and, consequently, in our graph. Moreover, sparse data becomes a
significant issue without entity resolution. Incomplete or partial data from various sources
can result in scattered and disconnected pieces of information, making it difficult to form a
coherent and comprehensive understanding of entities. Accurate entity resolution addresses this by consolidating data, filling in gaps, and creating a unified view of
each entity. Before, after using sensing entity resolution to connect the International Consortium
of Investigative Journalists, ICIJ. Offshore leaks data, image from Paco Nathan, the left
part of the visualization presents a sparse
and unconnected graph. However, as shown on the right-hand side, such a graph can become well
connected with efficient entity resolution. Greater than overall, entity resolution enhances
the efficiency of data retrieval and greater than integration, providing a cohesive view of
information across different greater than sources. It ultimately enables more effective question answering based on a greater-than-reliable and
complete knowledge graph. Unfortunately, the authors of the GraphRag paper did not include
any entity resolution code in their repo, although they mention it in their paper.
One reason for leaving this code out could be that it is tough to implement a robust and
well-performing entity resolution for any given domain. You can implement custom heuristics for different nodes when dealing with predefined
types of nodes. When they aren't predefined, they aren't consistent enough, like company,
organization, business, etc. However, if the node labels or types aren't known in advance,
as in our case, this becomes an even harder problem. Nonetheless, we will implement
a version of entity resolution in our project here, combining text embeddings and graph algorithms
with word distance and LLMs. Our process for entity resolution involves the following steps.
1. Entities in the graph. Start with all entities within the graph.
2. K-nearest graph. Construct a K-nearest neighbor graph,
connecting similar entities based on text embeddings.
3. Weekly connected components.
Identify weekly connected components
in the K-nearest graph,
grouping entities that are likely to be similar.
Add a word distance filtering step
after these components have been identified.
4. LLM evaluation.
Use an LLM to evaluate these components and decide whether the entities within each component
should be merged, resulting in a final decision on entity resolution, for example, merging
Silicon Valley Bank and Silicon Valley Bank, while rejecting the merge for different dates
like September 16, 2023 and September 2, 2023. We begin by calculating text
embeddings for the name and description properties of entities. We can use the method in the
integration in langchain to achieve this. We can use these embeddings to find potential candidates
that are similar based in the cosine distance of these embeddings. We will use graph algorithms
available in the graph data Science, GDS,
library. Therefore, we can use the GDS Python client for ease of use in a Pythonic way.
If you are not familiar with the GDS library, we first have to project an in-memory graph
before we can execute any graph algorithms. First, the Neo4j stored graph is projected
into an in-memory graph for faster processing and analysis.
Next, a graph algorithm is executed on the in-memory graph.
Optionally, the algorithm's results can be stored back into the Neo4j database.
Learn more about it in the documentation.
To create the k-nearest neighbor graph, we will project all entities along with their
text embeddings now that the graph is projected under the name, we can execute graph algorithms. We will begin by constructing a k-nearest graph.
The two most important parameters influencing how sparse or dense the k-nearest graph will be are
in. The is the number of neighbors defined for each node, with a minimum value of 1.
The filters out relationships with similarity below this threshold. Here, we will use a default of 10 and a relatively high similarity cutoff of 0.
95. Using a high similarity cutoff, such as 0. 95, ensures that only highly similar pairs are considered matches, minimizing false positives and improving accuracy. Since we want to store the results back to the projected in-memory graph instead of the knowledge graph, we will use the mode of the algorithm. The next step is to identify
groups of entities that are connected with the newly inferred similarity relationships.
Identifying groups of connected nodes is a frequent process in network analysis,
often called community detection or clustering, which involves finding subgroups of densely
connected nodes. In this example,
we will use the weakly connected components algorithm, which helps us find parts of a graph
where all nodes are connected, even if we ignore the direction of the connections.
We use the algorithms mode to store the results back to the database, stored graph, text embedding
comparison helps find potential duplicates, but it is only part of the entity resolution process.
For example, Google and Apple are very close in the embedding space, 0.96 cosine similarity using
the embedding model. The SOME goes for BMW and Mercedes-Benz, 0.97 cosine similarity.
High text embedding similarity is a good start, but we can improve it. Therefore, we will add an additional filter allowing only pairs of words with a text
distance of 3 or fewer, meaning that only the characters can be changed.
This cipher statement is slightly more involved,
and its interpretation is beyond the scope of this blog post.
You can always ask an LLM to interpret it.
Additionally, the word distance cutoff could be a function of the length
of the word instead of a single number and the implementation could be more scalable.
What is important is that it outputs groups of potential entities we might want to merge.
Here is an list of potential nodes to merge as you can see, our resolution approach works better
for some node types th and others. Based on a quick examination, it seems to work better for people and organizations
while it's pretty bad for dates. If we used predefined node types, we could prepare different
heuristics for various node types. In this example, we do not have predefined node labels,
so we will turn to an LLM to make the final decision about whether entities should be
merged or not. First, we need to formulate the LLM prompt to effectively guide and inform the final decision
regarding the merging of the nodes I always like to use method in langchain when expecting
structured data output O avoid having to parse the outputs manually. Here, we will define the
output as A, where each inner list contains the entities that should be merged. This structure
is used to handle scenarios where,
for example, the input might be.
In such cases, you would want to merge
Sony and Sony Inc.
separately from Google and Google Inc.
Next, we integrate the LLM prompt with the structured output
to create a chain using Lang Chain Expression Language
syntax and encapsulate it with Theena function. We need to run all potential
candidate nodes through the function to decide whether they should be merged. To speed up the
process, we will again parallelize the LLM calls the final step of entity resolution involves taking
the results from the LLMond writing them back to the database by merging the specified nodes.
This entity resolution is not perfect, but it gives us a starting point upon which we can improve. Additionally, we can improve the logic for determining which entities
should be retained. Element summarization. In the next step, the authors perform an element
summarization step. Essentially, every node and relationship gets passed through an entity
summarization prompt. The authors note the novelty and interest of their approach greater than, overall, our use of rich descriptive text for homogeneous nodes in a
greater than potentially noisy graph structure is aligned with both the capabilities of greater
than LLMs and the needs of global, query-focused summarization. These qualities greater than also
differentiate our graph index from typical knowledge graphs, which rely greater than on concise and consistent knowledge triples, subject, predicate, object, for greater than
downstream reasoning tasks. The idea is exciting. We still extract subject and object IDs or names
from text, which allows us to link relationships to correct entities, even when entities appear
across multiple text chunks. However, the relationships are introduced
to a single type. Instead, the relationship type is actually a free-form text that allows us to
retain richer and more nuanced information. Additionally, the entity information is
summarized using an LLM, allowing us toned and index this information and entities more
efficiently for more accurate retrieval. One could argue that this richer and
more nuanced information could also be retained by adding additional, possibly arbitrary, node
and relationship properties. One issue with arbitrary node and relationship properties is
that it could be hard to extract the information consistently because the LLM might use different
property names or focus on various details on every execution. Some of these problems could
be solved using predefined property names with additional type and description information. names or focus on various details on every execution. Some of these problems could be
solved using predefined property names with additional type and description information.
In that case, you would need a subject matter expert to help define those properties,
leaving little room for an LLM to extract any vital information outside the predefined
descriptions. It's an exciting approach to represent richer information in a knowledge graph.
One potential issue with the element summarization step is that it does not scale well since it
requires an LLM call for every entity and relationship in the graph. Our graph is
relatively tiny with 13,000 nodes and 16,000 relationships. Even for such a small graph,
we would require 29,000 LLM calls, and each call would use a couple hundred tokens,
making it quite expensive and time-intensive. Therefore, we will avoid this step here. We
can still use the description properties extracted during the initial text processing.
Constructing and Summarizing Communities
The final step in the graph construction and indexing process involves identifying communities
within the graph. In this context,
a community is a group of nodes that are more densely connected to each other than to the rest
of the graph, indicating a higher level of interaction or similarity. The following
visualization shows an example of community detection results. Once these entity communities
are identified with a clustering algorithm, an LLM generates a summary for each community,
providing insights
into their individual characteristics and relationships. Again, we use the Graph Data
Science Library. We start by projecting an in-memory graph. To follow the original article
precisely, we will project the graph of entities as an undirected weighted network, where the
network represents the number of connections between two entities the authors employed the
Leiden algorithm, a hierarchical clustering method, to identify communities
within the graph. One advantage of using a hierarchical community detection algorithm
is the ability to examine communities at multiple levels of granularity. The authors suggest
summarizing all communities at each level, providing a comprehensive understanding of
the graph's structure. First, we will use the Weakly Connected Components, WCC, algorithm to assess the
connectivity of our graph. This algorithm identifies isolated sections within the graph,
meaning it detects subsets of nodes or components that are connected to each other but not to the
rest of the graph. These components helpunderstand the fragmentation within the network and identify
groups of nodes that are independent from others. WCC is vital for analyzing the overall structure
and connectivity of the graph. The WCC algorithm results identified 1,119 distinct components.
Notably, the largest component comprises 9,109 nodes, common in real-world networks where a
single supercomponent coexists with numerous smaller isolated components. The smellest
component has one node, and the average component size is about 11.3 nodes.
Next, we will run the Leiden algorithm, which is also available in the GDS library,
and enable the parameter to return and store communities at all levels.
We have also included a parameter to run the weighted variant of the Leiden algorithm.
Using the mode of the algorithm stores the results as a node property.
The algorithm identified five levels of communities, with the highest,
least granular level where communities are largest, having 1,188 communities,
as opposed to 1 to 1119 components. Here is the visualization of
the communities on the last level using Gephi. Visualizing more than 1000 communities is hard.
Even picking the colors for each one is practically impossible. However, they make for nice artistic
renditions. Building on this, we will create a distinct node for each community and represent their hierarchical structure as an interconnected graph. Later, we will also store
community summaries and other attributes as node properties. The authors also introduce a
indicating the number of distinct text chunks in which the entities within the community appear.
Now let's examine a sample hierarchical structure with many intermediate communities merging at higher levels. The communities are non-overlapping, meaning that each entity belongs
to precisely a single community at each level. The image represents a hierarchical structure
resulting from the Leiden community detection algorithm. The purple nodes represent individual
entities, while the orange nodes represent hierarchical communities. The hierarchy shows
the organization of these entities into various communities, with smaller communities merging
into larger ones on higher levels. Let's now examine how smaller communities merge at higher
levels. This image illustrates that less connected entities and consequently smaller communities
experience minimal changes across levels. For example, the community structure
here only changes in the first two levels but remains identical fourth last three levels.
Consequently, the hierarchical levels often appear redundant for these entities,
as the overall organization does not significantly alter at different tiers.
Let's examine the number of communities and their sizes and different levels in more detail
in the original implementation. Communities on every level were summarized. In our case, that would be 8,590
communities and, consequently, 8,590 LLM calls. I would argue that depending on the hierarchical
community structure, not every level needs to be summarized. For example, the difference between the last and the next-to-last level is
only four communities. Therefore, we would be creating a lot of redundant summaries.
One solution is to create an implementation that can make a single summary for communities on
different levels that don't change. Another one would be to collapse community hierarchies that
don't change. Also, I am unsure if community hierarchies that don't change. Also,
I am unsure if we want to summarize communities with only one member, as they might not provide
much value or information. Here, we will summarize communities on levels 0, 1, and 4. First, we need
to retrieve their information from the database at the moment. The community information has the
following structure now. We need to prepare an LLM
prompt that generates a natural language summarization based on the information provided
by the elements of our community. We can take some inspiration from the prompt the researchers used.
The authors not only summarized communities but also generated findings for each of them.
A finding can be defined as concise information regarding a specific event or a piece of information. One such example my intuition suggests that extracting findings with just
a single pass might not be as comprehensive as we need, much like extracting entities and
relationships. Furthermore, I have not found any references or examples of their use in their code
in either local or global search retrievers. As a result, we'll refrain from extracting findings in this instance.
Or, as academics often put it, this exercise is left to the reader.
Additionally, we have also skipped the claims or covariate information extraction,
which looks similar to findings at first glance.
The prompt we'll use to produce the community summaries is fairly straightforward.
The only thing left is to turn community representations into strings to reduce the number of tokens by avoiding JSON token
overhead and wrap the chain as a function now we can generate community summaries for the selected
levels. Again we parallelize calls for faster execution one aspect I didn't mention is that
the authors also address the potential issue of exceeding context size when inputting community information. As graphs expand, the communities can grow significantly as well. In our case,
the largest community comprised 545 members. Given that GPT-4-0 has a context size exceeding
100,000 tokens, we decided to skip this step. As our final step, we will store the community
summaries back to the database the
final graph structure. The graph now contains the original documents, extracted entities and
relationships, as well as hierarchical community structure and summaries. Summary. The authors of
the From Local to Global paper have done a great job in demonstrating a new approach to GraphRag.
They show how we can combine and summarize information from various documents into a
hierarchical knowledge graph structure.
One thing that isn't explicitly mentioned is that we can also integrate structured data
sources in a graph.
The input doesn't have to be limited to unstructured text only.
What I particularly appreciate about their extraction approach is that they capture descriptions
for both nodes and relationships.
Descriptions allow the LLM to retain more information than reducing everything to just
node IDs and relationship types. Additionally, they demonstrate that a single extraction pass
over the text might not capture all relevant information and introduce logic to perform
multiple passes if necessary. The authors also present an interesting idea for performing
summarizations over graph
communities, allowing us to embed and index condensed topical information across multiple
data sources. In the next blog post, we will go over the local and global search retriever
implementations and talk about other approaches we could implement based on the given graph
structure. As always, the code is available on GitHub. This time, I've also uploaded the database dump so you
can explore the results and experiment with different retriever options. You can also
import this dump into a forever free Neo4j AuraDB instance, which we can use for the retrieval
explorations since we don't need graph data science algorithms for those, just graph pattern
matching, vector, and full text indexes. Learn more about the Neo4j integrations
with all the GenAI frameworks and practical graph algorithms in my book, Graph Algorithms
for Data Science. To learn more about this topic, join us at Nodes 2024 on November 7,
our free virtual developer conference on intelligent apps, knowledge graphs, and AI.
Register now. Thank you for listening to this Hackernoon story, read graphs, and AI. Register now.