The Good Tech Companies - How to Save $70K Building a Knowledge Graph for RAG on 6M Wikipedia Pages

Episode Date: October 15, 2024

This story was originally published on HackerNoon at: https://hackernoon.com/how-to-save-$70k-building-a-knowledge-graph-for-rag-on-6m-wikipedia-pages. We show how conte...nt-centric knowledge graphs – a vector-store allowing links between chunks – are an easy to use and efficient approach to improve RAG results. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #vector-search, #generative-ai, #retrieval-augmented-generation, #knowledge-graphs, #graphvectorstore, #scalable-ai-solutions, #langchain-integrations, #good-company, and more. This story was written by: @datastax. Learn more about this writer by checking @datastax's about page, and for more stories, please visit hackernoon.com. We’ve argued that content-centric knowledge graphs – a vector-store allowing links between chunks – are an easier to use and more efficient approach to improving RAG results. Here, we put that to the test.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. How to save $70,000 building a knowledge graph for RAG on 6M Wikipedia pages. By data stacks. Using knowledge graphs to improve the results of retrieval augmented generation RAG applications has become a hot topic. Most examples demonstrate how to build a knowledge graph using a relatively small number of documents. This might be because the typical approach, extracting fine-grained, entity-centric information, just doesn't scale. Running each document through a model to extract the entities, nodes, and relationships, edges, takes too long and costs too much to run on large datasets. We've argued that content-centric knowledge graphs, a vector store
Starting point is 00:00:46 allowing links between chunks, are an easier-to-use and more efficient approach. Here, we put that to the test. We load a subset of the Wikipedia articles from the two-wiki multi-hop dataset using both techniques and discuss what this means for loading the entire dataset. We demonstrate the results of some questions over the loaded data. We'll also load the entire dataset. We demonstrate the results of some questions over the loaded data. We'll also load the entire dataset, nearly 6 million documents into a content-centric graph vector store, entity-centric LLM graph transformer. Loading documents into an entity-centric graph store like Neo4j was done using Langchain's LLM graph transformer. The code is based on Langchain's LLM Graph Transformer. The code is based on Langchain's How to Construct Knowledge Graphs. Content-centric. Graph Vector Store. Loading the data into Graph Vector Store is
Starting point is 00:01:31 roughly the same as loading it into a Vector Store. The only addition is that we compute metadata indicating how each page links to other pages. This is also a good example of how you can add your own links between nodes. Loading benchmarks. Running at 100 rows, the entity-centric approach using GPT-4-0 took 405.93s to extract the graph documents and 10.99s to write them to Neo4j, while the content-centric approach took 1.43s. Extrapolating, it would take 41 weeks to load all 5,989,847 pages using the entity-centric approach and about 24 hours using the content-centric approach but thanks to parallelism the content-centric approach runs in only 2.5 hours assuming the same parallelism benefits
Starting point is 00:02:20 it would still take over four weeks to load everything using the entity-centric approach. I didn't try it since the estimated cost would be $58,700, assuming everything worked the first time. Bottom line. The entity-centric approach of extracting knowledge graphs from content using an LLM was both time and cost prohibitive at scale. On the other hand, using Graph Vector Store was fast and inexpensive. Example answers. In this section, a few questions, drawn from the subset of loaded documents, areas to address the quality of answers. Entity-centric used 7,324 prompt tokens and cost 3 cents to produce basically useless answers, while content-entric used 450 prompt tokens and cost $0.002 to produce concise answers directly answering the questions. It may be surprising that the fine-grained Neo4j graph returns useless answers. Looking at the logging from the chain, we see some of why this happens
Starting point is 00:03:17 so, the fine-grained schema only returned information about the record label. It makes sense that the LLM wasn't able to answer the question based on the retrieved information. Conclusion. Extracting fine-grained, entity-specific knowledge graphs is time and cost prohibitive at scale. When asked questions about the subset of data that was loaded, the additional granularity and extra cost of loading the fine-grained graph returned more tokens to include the prompt but generated useless answers, Graph Vector Store takes a coarse-grained, content-centric approach that makes it fast and easy to build a knowledge graph. You can start with your existing code for
Starting point is 00:03:55 populating a vector store using langchain and add links, edges, between chunks to improve the retrieval process. GraphRag is a useful tool for enabling generative AI rag applications to retrieve more deeply relevant contexts. But using a fine-grained, entity-centric approach does not scale to production needs. If you're looking to add knowledge graph capabilities to your rag application, try GraphVectorStore. By Ben Chambers, Datastacks, thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.