The Good Tech Companies - Transforming CSV Files into Graphs with LLMs: A Step-by-Step Guide

Episode Date: October 29, 2024

This story was originally published on HackerNoon at: https://hackernoon.com/transforming-csv-files-into-graphs-with-llms-a-step-by-step-guide. Learn how to use LLMs to ...convert CSV files into graph data models for Neo4j, enhancing data modeling and insights from flat files. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #llms, #neo4j, #csv-to-graph, #graph-data, #cypher, #lang-graph, #data-transformation, #good-company, and more. This story was written by: @neo4j. Learn more about this writer by checking @neo4j's about page, and for more stories, please visit hackernoon.com. Explore using LLMs to convert CSV files into graph structures, improving data modeling in Neo4j with an iterative, prompt-based approach.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Transforming CSV files into graphs with LLMs. A step-by-step guide by Neo4j. How do LLMs fare when attempting to create graphs from flat CSV files? A large part of my job is improving users' experience with Neo4j. Often, getting data into Neo4j and modeling it efficiently is a key challenge for users, especially in the early days. Although the initial data model is important and needs contemplation, it can be easily refactored to improve performance as the data size or number of users grows. So, as a challenge to myself, I thought I would see if an LLM could help with the initial data
Starting point is 00:00:42 model. If nothing else, it would demonstrate how things are connected and provide the user with some quick results they can show others. Intuitively, I know data modeling is an iterative process, and certain LLMs can be easily distracted by large amounts of data, so this presented a good opportunity to use Lang Graph to work in cycles through the data. Let's dive into the prompts that made it happen. Graph Modeling Fundamentals. The Graph Data Modeling Fundamentals course on Graph Academy guides you through the basics of modeling data in a graph. But as a first pass, I use the following rules of thumb nouns become labels. They describe the thing that the node represents. Verbs become relationship types. They describe how things are connected. Everything else becomes properties, particularly adverbs. You have a name and may drive a gray car. Verbs can also be nodes.
Starting point is 00:01:31 You may be happy to know that a person has ordered a product, but that basic model doesn't allow you to know where and when the product was ordered. In this case, order becomes a new node in the model. I'm sure this could be distilled into a prompt to create a zero-shot approach to graph data modeling. An iterative approach. I attempted this briefly a few months ago and found that the model I was using became easily distracted when dealing with larger schemas, and the prompts quite quickly reached the LLM's token limits. I thought I'd try an iterative approach this time, taking the keys one at a time. This should help avoid distraction because the LLM only needs to consider an item at a time. The final approach used the following steps 1. Load the CSV file into a pandas dataframe. 2. Analyze each column in the CSV and append it
Starting point is 00:02:18 to a data model loosely based on JSON schema. 3. Identify and add missing unique ids for each entity 4. review the data model for accuracy 5. generate cipher statements to import the nodes and relationships 6. generate the unique constraints that underpin the import statements 7. create the constraints and run the import the data i took a quick look on kaggle for an interesting dataset. The dataset that stood out was Spotify Most Streamed Songs. 5 rows times 25 columns it's relatively simple, but I can see straight away that there should be relationships between tracks and artists. There are also data cleanliness challenges to overcome, in terms of column names and artists being comma separated values within the artists underscore
Starting point is 00:03:02 name column. Choosing an LLM. I really wanted to use a local LLM for this, but I found out early on that Lama 3 wouldn't cut it. If in doubt, fall back on OpenAI creating a data model. I used an abridged set of modeling instructions to create the data modeling prompt. I had to engineer the prompt a few times to get a consistent output. The zero shot example worked relatively well, but I found that the output was inconsistent. Defining a structured output to hold the JSON output really helped few shot example output. The JSON itself was also inconsistent, so I ended up defining a schema based on the movie recommendations dataset. Example output I had to deviate from strict JSON schema and add the
Starting point is 00:03:45 column underscore name field to the output to help the LLM generate the import script. Providing examples of descriptions also helped in this regard, otherwise the properties used in the match clause were inconsistent. The chain, here is the final prompt, executing the chain. To iteratively update the model, I iterated over the keys in the data frame and passed each key, its data type, and the first five unique values to the prompt console output after a few tweaks to the prompt to handle use cases. I ended up with a model I was quite happy with. The LLM had managed to determine that the dataset consisted of track, artist, and a performed underscore by relationship to connect the two adding unique identifiers.
Starting point is 00:04:26 I noticed that the schema didn't contain any unique identifiers, and this may become a problem when it comes to importing relationships. It stands to reason that different artists would release songs with the same name and two artists may have the same name. For this reason, it was important to create an identifier for tracks so they could be differentiated within a larger dataset this step is only really required for nodes, so I extracted the nodes from the schema, ran the chain for each and then combined the relationships with the updated definitions data model review. For sanity, it is worth checking the model for optimizations. The model underscore prompt did a good job of identifying the nouns and verbs, but in
Starting point is 00:05:04 a more complex model. One iteration treated the asterisk underscore playlists and underscore charts columns as IDs and attempted to create stream nodes and in underscore playlist relationships. I assume this was due to the count over 1000 including formatting with a comma, E. G. 1001. Nice idea, but maybe a little too clever. But this shows the importance of having a human in the loop that understands the data structure. In a real-world scenario, I'd want to run this a few times to iteratively improve the data model. I would put a maximum limit, then iterate up to that point or the data model object no longer changes. Generate import statements. By this point, the schema should be robust enough and
Starting point is 00:05:45 include as much information as possible to allow an LLM to generate a set of import scripts. In line with Neo4j data importing recommendations, the file should be processed several times, each time importing a single node or relationship to avoid eager operations and locking. This chain requires a different output object to the previous steps. In this case, the cipher member is most important, but I also wanted to include a chain underscore of underscore thought key to encourage chain of thought. The same process then applies to iterate over each of the reviewed definitions and generate the cipher console output. This prompt took some engineering to achieve consistent results. Sometimes the cipher would include merge statement
Starting point is 00:06:24 with multiple fields defined, which is suboptimal at best. If any of the columns are null the entire import will fail. At times the result would include apoc period iterate which is no longer required and I wanted code I could execute with the python driver. I had to reiterate that the specified column name should be used when creating relationships. The LLM just wouldn't follow the instructions when using the unique identifier on the nodes at each end of the relationship, so this took a few attempts to get it to follow the instructions in the description. There was some back and forth between this prompt and the model underscore prompt. Backticks were needed for the column names that included special characters, e.g.
Starting point is 00:07:10 energy underscore percent. It would also be beneficial to split this into two prompts, one for nodes, a donate for relationships. But that is a task for another day. Create the unique constraints. Next, the import scripts can be used as a basis to create unique constraints in the database console output sometimes this prompt would return statements for indexes and constraints hence it's split on the semicolon run the import with everything in place it was time to execute f cipher statements qa on the data set this post wouldn't be complete without some qa on the data set using the graph cipher qa chain most popular artists who are the most popular artists in the database? The LLM seemed to judge popularity in terms of number of tracks an artist has been on rather than their overall number of streams. Beats per minute. Which track has the highest BPM? Improving the cipher generation
Starting point is 00:07:57 prompt. In this case, the cipher looks fine and the correct result was included in the prompt but GPT-40 couldn't interpret the answer. It looks like the cipher underscore generation underscore prompt passed to the GraphCypher QA chain could do with additional instructions to make the column names more verbose. Greater than always use verbose column names in the cipher statement using the label and greater than property names. For example, use person underscore name instead of name. Graph cipher QA chain with custom prompt tracks performed by the most artists. Graphs excel at returning a count of the number of relationships by type and direction. Summary, the CSV analysis and modeling is the most time intensive part. It could take more than 5 minutes to generate. The costs themselves were pretty
Starting point is 00:08:44 cheap. In 8 hours of experimentation, I must have sent hundreds of requests and I ended up spending a dollar or so. There were a number of challenges to get to this point the prompts took several iterations to get right. This problem could be overcome by fine tuning the model or providing few shot examples. JSON responses from GPT-4-0 can be inconsistent. I was recommended JSON repair, which was better than trying to get the LLM to validate its own JSON output. I can see this approach working well in a lang graph implementation where the operations are run in sequence, giving an LLM the ability to build and refine the model. As new models are released, they may also benefit from fine-tuning. Learn more,
Starting point is 00:09:26 check out Harnessing Large Language Models with Neo4j for more information about streamlining the knowledge graph creation process with LLMs. Read Create a Neo4j Graph RAG Workflow using Langchain and Langgraph for more about Langgraph and Neo4j. And to learn more about fine-tuning, check out Knowledge Graphs and LLMs, Fine-Tuning vs. Retrieval Augmented Generation. Info Feature Image. Graph Model Shows Tracks with Performed Underscore by Relationships to Artists. Photo by the Author. To learn more about this topic, join us at Nodes 2024 on November 7, our free virtual developer conference on intelligent apps, knowledge
Starting point is 00:10:05 graphs, and AI. Register now. Thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.