The Good Tech Companies - Build Efficient Knowledge Graphs with Relik and LlamaIndex: Entity Linking & Relationship Extraction

Episode Date: November 5, 2024

This story was originally published on HackerNoon at: https://hackernoon.com/build-efficient-knowledge-graphs-with-relik-and-llamaindex-entity-linking-and-relationship-extraction. ... Explore how to construct cost-effective knowledge graphs using Relik for entity linking and Neo4j for relationship extraction, bypassing expensive LLMs. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #artificial-intelligence, #knowledge-graph, #entity-linking, #relationship-extraction, #relik-framework, #llamaindex, #llms, #good-company, and more. This story was written by: @neo4j. Learn more about this writer by checking @neo4j's about page, and for more stories, please visit hackernoon.com. Learn how Relik and LlamaIndex enable efficient knowledge graph creation without large language models. This guide covers entity linking, relationship extraction, and Neo4j integration.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Build Efficient Knowledge Graphs with Relic and Lama Index, Entity Linking and Relationship Extraction, by Neo4j. Constructing knowledge graphs from text has been a fascinating area of research for quite some time. With the advent of large language models, LLMs, this field has gained more mainstream attention. However, LLMs, this field has gained more mainstream attention. However, LLMs can be quite costly. An alternative approach is to fine-tune smaller models, which has been supported by academic research, yielding more efficient solutions. Today, we will explore Relic, a
Starting point is 00:00:37 framework for running blazing fast and lightweight information extraction models, developed by the NLP group at the Sapienza University of Rome. A typical information extraction pipeline without an LLM looks like the following. The image illustrates an information extraction pipeline, starting from input data that consists of text mentioning, Tomas likes to write blog posts. He ISP articularly interested in drawing diagrams. The process begins with co-reference resolution to identify Tomas and he as the same entity. Named entity recognition, NER, then identifies entities such as Tomas, blog, and diagram. Entity linking is the process that follows NER, where recognized entities are mapped to corresponding entries in a database or knowledge base. For example,
Starting point is 00:01:25 Tomas is linked to Tomas Britannic Q12345 and blog to blog Q321. But diagram has no match in the knowledge base. Relationship extraction is the subsequent step where the system identifies and extracts meaningful relationships between the recognized entities. This example identifies that Tomas has a relationship with blog characterized by writes indicating that Tomas writes blogs. Additionally, it identifies that Tomas has a relationship with diagram characterized by interested underscore in indicating that Tomas is interested in diagrams. Finally, this structured information, including the entities and the relationships, is stored in a knowledge graph, allowing for organized and accessible data for further
Starting point is 00:02:15 analysis or retrieval. Traditionally, without the power of LLMs, this entire process relies on a suite of specialized models, each handling a specific task from CO reference resolution to relationship extraction. While integrating these models demands more effort and coordination, it offers a significant advantage, reduced costs. By fine-tuning smaller, task-specific models, the overall expense of building and maintaining the system can be kept in check. The code is available on GitHub. Environment setup. I suggest you use a separate Python environment like Google Colab, as we will have to play around with dependencies a bit. The models are faster on GPU, so you can use a GPU-powered runtime if you have the pro version. Additionally, we need to set up Neo4j, a native graph database, to store the extracted information.
Starting point is 00:03:06 There are many ways to set up your database instance. However, I recommend using Neo4j Aura, which provides a free cloud instance that can easily be accessed from a Google Colab notebook. Neo4j Aura, fully managed cloud solution after the database has been created, we can define a connection using Lama Index Dataset we will use a news dataset I obtained via DiffBot API some time ago. The dataset is conveniently available on GitHub for us to reuse co-reference resolution. The first step in the pipeline is a CO reference resolution model. Co-reference resolution is the task of identifying all expressions in a text referred to the same entity.
Starting point is 00:03:49 To my knowledge, there aren't many open-source models available for co-reference resolution. I tried the Maverick Coref, but in my tests Corefery from Spa C worked better, so we will use that. The only disadvantage of using Corefery is that we have to deal with dependency hell, which is solved in the notebook, but we'll not go through it here. You can load the CO reference model in spaCy with the following code. The corefary model detects clusters of expression that refer to the same entity or entities. To rewrite the text based on these clusters, we have to implement our own function. Let's test the function to make sure the models and dependencies are set up properly. In this example, the model identified that Tomas and He referred to the same entity.
Starting point is 00:04:27 Using the coref underscore text function, we replace He with Tomas. Info note that the rewriting doesn't always return grammatically correct sentences due to using simple replace logic for entities within the cluster. However, it should be good enough for most scenarios. Now we apply the CO reference resolution to our news dataset and wrap the results as LAMA index documents entity linking and relationship extraction. Relic is a library with models for entity linking, L, and relationship
Starting point is 00:04:55 extraction, Re, and it also supports models that combine the two. In entity linking, Wikipedia is used as the target knowledge base to map entities in text to their corresponding entries in the encyclopedia. On the other hand, relationship extraction involves identifying and categorizing the relationships between entities within a text, enabling the extraction of structured information from unstructured data. If you are using a free collab version, use the relic IE, relic relation extraction small model, which performs only relationship extraction. If you have a pro version, or you will use it on a stronger local machine, you can test the relic IE, relic C small model, which performs entity linking and relationship extraction. Additionally, we have to define the embedding model that will be used to embed entities in the LLM for question answering flow note that the LLM will not be used during graph construction. Now that we have everything
Starting point is 00:05:49 in place, we can instantiate a property graph index and use the news documents as input data to a knowledge graph. Additionally, we need pass the relic model as the kilogram underscore extractor's value to extract the relationships after constructing the graph, you can open Neo4j browser to validate the imported graph. You should get a similar visualization by running the following cipher statement results, question answering using Lama index, it is now easy to perform question answering. To use the default graph retrievers, you can ask questions as straightforward as here is where the defined LLM and embedding model come into play. Of course, you can also implement custom retrievers for potentially better accuracy.
Starting point is 00:06:35 Constructing knowledge graphs without relying on LLMs is not only feasible but also cost-effective and efficient. By fine-tuning smaller, task-specific models, such as those in the Relic framework, you can achieve high-performance information extraction for your retrieval augmented generation, RAG, applications. Entity linking, a critical step in this process, ensures that recognized entities are accurately mapped to corresponding entries in a knowledge base, thereby maintaining the integrity and utility of the knowledge graph. By using frameworks like Relic and platforms such as Neo4j, it's possible to construct advanced knowledge graphs that facilitate complex data analysis and retrieval tasks, all without the high costs typically associated with deploying LLMs. This method not only makes powerful data processing tools more accessible but also promotes innovation and efficiency in information extraction workflows. Make sure to give the Relic Library a star. The code is available on GitHub.
Starting point is 00:07:32 To learn more about this topic, join us at Nodes 2024 on November 7, our free virtual developer conference on intelligent apps, knowledge graphs, and AI. Register now. Thank you for listening to this Hackernoon story, read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.