The Good Tech Companies - How to Save $70K Building a Knowledge Graph for RAG on 6M Wikipedia Pages
Episode Date: October 15, 2024This story was originally published on HackerNoon at: https://hackernoon.com/how-to-save-$70k-building-a-knowledge-graph-for-rag-on-6m-wikipedia-pages. We show how conte...nt-centric knowledge graphs – a vector-store allowing links between chunks – are an easy to use and efficient approach to improve RAG results. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #vector-search, #generative-ai, #retrieval-augmented-generation, #knowledge-graphs, #graphvectorstore, #scalable-ai-solutions, #langchain-integrations, #good-company, and more. This story was written by: @datastax. Learn more about this writer by checking @datastax's about page, and for more stories, please visit hackernoon.com. We’ve argued that content-centric knowledge graphs – a vector-store allowing links between chunks – are an easier to use and more efficient approach to improving RAG results. Here, we put that to the test.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
How to save $70,000 building a knowledge graph for RAG on 6M Wikipedia pages.
By data stacks. Using knowledge graphs to improve the results of retrieval augmented generation
RAG applications has become a hot topic. Most examples demonstrate how to build a
knowledge graph using a relatively small number of documents. This might be because the typical approach, extracting fine-grained,
entity-centric information, just doesn't scale. Running each document through a model to extract
the entities, nodes, and relationships, edges, takes too long and costs too much to run on large
datasets. We've argued that content-centric knowledge graphs, a vector store
allowing links between chunks, are an easier-to-use and more efficient approach. Here, we put that to
the test. We load a subset of the Wikipedia articles from the two-wiki multi-hop dataset
using both techniques and discuss what this means for loading the entire dataset. We demonstrate the
results of some questions over the loaded data. We'll also load the entire dataset. We demonstrate the results of some questions over the loaded data. We'll also
load the entire dataset, nearly 6 million documents into a content-centric graph vector store,
entity-centric LLM graph transformer. Loading documents into an entity-centric graph store
like Neo4j was done using Langchain's LLM graph transformer. The code is based on Langchain's LLM Graph Transformer. The code is based on Langchain's How to Construct Knowledge
Graphs. Content-centric. Graph Vector Store. Loading the data into Graph Vector Store is
roughly the same as loading it into a Vector Store. The only addition is that we compute
metadata indicating how each page links to other pages. This is also a good example of how you can
add your own links between nodes. Loading benchmarks. Running
at 100 rows, the entity-centric approach using GPT-4-0 took 405.93s to extract the graph documents
and 10.99s to write them to Neo4j, while the content-centric approach took 1.43s. Extrapolating,
it would take 41 weeks to load all 5,989,847 pages using the entity-centric
approach and about 24 hours using the content-centric approach but thanks to parallelism
the content-centric approach runs in only 2.5 hours assuming the same parallelism benefits
it would still take over four weeks to load everything using the entity-centric approach.
I didn't try it since the estimated cost would be $58,700, assuming everything worked the first time. Bottom line. The entity-centric approach of extracting knowledge graphs from content using an
LLM was both time and cost prohibitive at scale. On the other hand, using Graph Vector Store was
fast and inexpensive.
Example answers. In this section, a few questions, drawn from the subset of loaded documents,
areas to address the quality of answers. Entity-centric used 7,324 prompt tokens and cost 3 cents to produce basically useless answers, while content-entric used 450 prompt tokens and cost $0.002 to produce concise answers
directly answering the questions. It may be surprising that the fine-grained Neo4j graph
returns useless answers. Looking at the logging from the chain, we see some of why this happens
so, the fine-grained schema only returned information about the record label. It makes
sense that the LLM wasn't able to answer
the question based on the retrieved information. Conclusion. Extracting fine-grained, entity-specific
knowledge graphs is time and cost prohibitive at scale. When asked questions about the subset of
data that was loaded, the additional granularity and extra cost of loading the fine-grained graph
returned more tokens to include the prompt but generated
useless answers, Graph Vector Store takes a coarse-grained, content-centric approach that
makes it fast and easy to build a knowledge graph. You can start with your existing code for
populating a vector store using langchain and add links, edges, between chunks to improve the
retrieval process. GraphRag is a useful tool for enabling generative AI rag
applications to retrieve more deeply relevant contexts. But using a fine-grained, entity-centric
approach does not scale to production needs. If you're looking to add knowledge graph capabilities
to your rag application, try GraphVectorStore. By Ben Chambers, Datastacks, thank you for
listening to this HackerNoon story, read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.