The Data Stack Show - 38: Graph Databases & Data Governance with David Allen of Neo4j

Starting point is 00:00:00 The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution. Thanks for joining the show today. All right, today on the show, we have David Allen, and he works for a company called Neo4j, and they are a graph database. And this is probably going to be a pretty technical discussion, which we really like. And I'm pretty interested to ask about how graph can be used in the context of event data. That's something that I deal with every day in my job. And we see that sort of the clickstream thing. And Graph seems like a really interesting application for that. So that's my burning question.

Starting point is 00:00:52 Costas? Yeah, I want to see how Neo4j and in general, Graph databases fit to a modern data stack, to be honest. I'm quite familiar with the products. I've seen how it grew from the early days of Neo4j as an open source project, and it has matured a lot. And of course, the market has changed a lot, right? So I really want to see what kind of use cases exist out there today for graph databases. Awesome. Well, let's jump in and talk graph with David. David, welcome to the Data Stack Show. Really excited to talk with you about all sorts of

Starting point is 00:01:27 things, but especially graph. Yeah, thanks for having me today. I'm looking forward to the conversation. We love getting into the technical details, but as I'm known to do, I want to start with just hearing your background, and then I have a question about something non-technical. So do you just want to give us the two-minute overview of who you are, what you're doing, and how you ended up where you are? Sure. Well, I started off my career in data management consulting. And so in that capacity, I was usually working with corporate customers, doing things like building ETL pipelines, helping them with data quality problems, helping them with master data management and governance type stuff. After a couple of jobs in that area, I ended up doing some applied research and development

Starting point is 00:02:11 for government at a company called MITRE. And it was actually when I was at MITRE that I kind of ran into graph for the first time. I didn't actually go straight to Neo4j after MITRE. I went on and did some work as a CTO for a startup and did a couple of other jobs and ended up coming back to Neo4j when I found a position opened up and I wanted to kind of re-engage in the graph space. But I, you know, I think of myself as coming from a background of like comparative databases. And at one point or another in my career, I seem to have ended up having a

Starting point is 00:02:45 situation where I needed to use all of them. Very cool. Well, I want to dig into graph stuff because there's so much there, but you're actually an author. So you wrote a book and you have spent some time in the academic space. And so I'm really interested to know on the authorship side of things, what was that experience like? And were there any lessons from sort of like data management or the engineering side of things that you took into the process of writing a book? Now, of course, it was a technical book, but that's kind of a cool thing that you're an author and sort of work in the technical side of data. Well, it's been a while since that book was published. I mean, I think that when you start the process

Starting point is 00:03:28 of writing a book, you definitely bring all of your technical experience in. And I think you set out to try to summarize some of what you've learned in the context of like technical publications, whether it's a book or a blog post, any form of publication like that. In terms of the lessons learned.

Starting point is 00:03:45 Wow. There's there, there were a lot of them for me in that process. I think one of them is, you know, don't get into technical book publishing. If you're trying to make money, that's, that's for sure. Another one is, you know, you, you set out with a clear outline and you're like, okay, I know what I want to say in this book, but you find that in developing the connective tissue and making the entire story cohesive throughout that you end up having to go do some more basic research to fill in the gaps, you know, areas where you thought you knew something, but you didn't actually to make

Starting point is 00:04:14 the whole thing hang together. And so I thought when I started the process, it was going to be, okay, we're going to sit down and write down everything we've learned. And it wasn't that exactly. We had to fill in a lot of gaps. Interesting. And how would you compare the experience of writing a technical book to maybe something more like writing technical documentation? Well, that's an interesting question. I think that technical documentation, I've written a lot of that over time. It feels like it's definitely much more narrow in scope and it has a more focused audience. Whereas in a longer form book type of a setup, you're kind of expected to give some element across the whole spectrum of like, well, there's understanding oriented material of what are the key concepts and what is the theory behind this? And then there's also the how to oriented material where you say, okay, this line of code and then that line of code. And then there's the tutorial element, which is how to apply your knowledge to a novel situation.

Starting point is 00:05:21 So when you're writing technical documentation, I find it usually falls into one of those categories, like explanatory, conceptual, how to, or tutorial. And when you're in the longer form, like if you're writing a book, you end up having to figure out how to do the proper balance of all of that to give somebody like a really comprehensive view on a subject. Right. Kind of like a narrative arc almost. Yeah, there you go. And also, you know, sort of like, you know, if you've been to college and you've seen how they structured the course materials, there's the sort of like 101 progression up to the higher level course material. And so you got to really think about that knowledge path when you're writing a longer piece of technical information, because you want to go jump in and explain the

Starting point is 00:06:07 complicated stuff, but you have to lay the groundwork and get some conceptual machinery out of the way so that the more complicated stuff will make sense later on. In the context of a tutorial or a blog post, you never do that. You just say up front, you need to know these three things before you read this. And then that's that. That's super cool. Can you tell us a little bit more about how it feels like to write a book? Like what is the process and how long it takes also? Well, I've only done this once. I wouldn't represent myself as an expert in the area, but it's a pretty grinding process, to be honest. You first start with a proposed outline before your project is even accepted. And you typically work with an editor who provides feedback on that outline.

Starting point is 00:06:46 And then you get to something like an annotated outline, which is, you know, think of it as a fleshed out outline with just like a list of bullets for each subheading of what you would talk about or how you would approach that. And then you go through the grinding process of developing out, you know, a first draft of each of the sections. And then typically they're expert reviewers. So they get other people in your field to read the book and provide notes so that the publisher themselves understands that they're not putting out, you know, bad information about this or that topic. And when you get to the point where you've got a relatively mature draft, there are, there's a lot of edits that go into that. And then there might be post-production stuff like,

Starting point is 00:07:25 is the publisher going to work with a company that's going to build an index and those sorts of things. That's super interesting. Yeah, so it's a long process. Yeah, it sounds like it is. I mean, the reason I'm asking is I have no idea how a book is written. I'm pretty sure many people don't know how much work is put behind all these books that we see out there.

Starting point is 00:07:48 And yeah, it's a very good opportunity. So what was the... Okay, last question about the book, I promise. What was the best part of writing this book and what was the worst part for you? Well, I'll give you the worst part first. The worst part was the endless revisions and not being sure what the finish line is. And I think that all book authors kind of go through that at one point or another, but not to dwell on that too much. The best part is like, I mean, hey,

Starting point is 00:08:15 I'm in the tech industry in the first place because I enjoy learning and I enjoy the process of learning. And so I would say discovering the gaps in my knowledge was the most fun part. I set out to write down what I thought I knew. And as I started getting deeper into it, I started to discover, okay, the way you're thinking about this or that topic is a little fuzzy and needs some more detail. So now I have to go do basic research. And so it is as much a learning project as it is a writing project, if that makes sense. And that's actually fun to me. I mean, I really like that element about software and technology is that it puts me in a position to keep learning new stuff. Oh, that's an amazing point. Nice. Really,

Starting point is 00:08:57 really nice. So David, let's start talking a little bit about Neo4j and graph databases. Do you want to do us a quick introduction on what Neo4j is? Sure. Well, Neo4j is what we call a native graph database. And so graphs are a data structure that folks may or may not be familiar with. But basically, every time you go to a whiteboard and you draw a bunch of circles and lines connecting the circles on the whiteboard, you are describing a graph. So graphs are composed of nodes, which are those circles that you draw, and relationships that link the nodes together. And what people find, like particularly in the whiteboarding context,

Starting point is 00:09:38 is that this structure is very rich, very easy to work with, and very associative and sort of more similar to how the inside of your own head operates than, for example, something like a table or a document. You know, when you're working with tables, you make a list of records. And so you can think of every table at the end of the day as being a list of sorts. Graphs are just this flexible, open-ended data structure with a lot of links back to, you know, basic computer science that you can use to represent any form of data under the sun. When people most often get exposed to graphs for the first time, it's usually in the context of something like a social network. So you go onto Twitter and your account, like imagine on the

Starting point is 00:10:23 whiteboard, I draw my account as a circle. And then I might draw all the accounts that I follow as other circles and then draw a line from me to them saying I follow those accounts and so forth. that is most naturally represented as a graph. You go to LinkedIn and you say who's connected to who, it's the same structure again. Facebook, same structure again. So that's how people usually first come to it. And it's in the context of other business problems where we introduce how to apply that kind of graph thinking to lots of other kinds of data. And why do we need a graph?

Starting point is 00:11:03 Like the relational databases that we have so far are not expressive enough for these types of problems that we usually solve with graph databases. Yeah, so it's not a question of expressiveness. So I've used pretty much all the data before a lot of different databases, not all of them. And so if you think about, there are these different data models.

Starting point is 00:11:23 You've got graph, relational, JSON documents, key value stores. Now, it would be silly to say that there's some kind of fact about the world that you can say with graphs that you can't say with tables, because you can represent literally anything with tables. But I find it better to think about these different kinds of databases as more like tools in a toolbox. And so the question is not whether it's more representationally powerful. The question is whether that's not a particularly graphy problem. However, if you wanted to, for example, calculate somebody's Kevin Bacon score, like how many hops away they are from Kevin Bacon in terms of the movies that they've made, that requires you to navigate a complex set of relationships between records. And that is a very graphy problem. So the way I try to sum this up in the simplest possible way is to say that sometimes problem. So the way I try to sum this up in the simplest possible way

Starting point is 00:12:25 is to say that sometimes the relationships between the data items matter more than the data items themselves. And if that sounds like it fits your problem, you're probably in the graph space. Yeah, that's a very good way to put it. I really like how you described the difference there. So can you give us,

Starting point is 00:12:44 I mean, you said like people usually get introduced to graphs by the social graph, right? That's what everyone has heard about. Can you give us a few examples of like typical problems that the graph database is like a very good solution for? Yeah, sure. So let me tell you a story about how I actually started using graphs because this is the one that's most direct to me. So I was working for that government research and development company called MITRE and we were developing a solution for data provenance. I got this intelligence report or I got this estimate or I got this financial number. And I don't know whether I can trust it or not because we have internal process weaknesses in our organization. And so what they wanted to know was how did we arrive at this judgment or decision? And so in order to know whether the information was good or not, we had to trace backwards to figure out how it was put together.

Starting point is 00:13:49 So we might say, well, we gathered some facts from Bob and then Bob summarized them in this report. And then that report was processed by that system and so forth. And so basically I want you to picture like a family tree for any given report that you might see. Well, that family tree is a graph because we can say data came from sources, was transformed in certain ways, and then got summarized into the graph. And the way that I first came to graph is that I wanted to build directed acyclic graphs of data provenance so that I could answer these questions for these executives. And I first did it on top of MySQL and just did it using two tables and joins between the tables, which MySQL is perfectly capable of doing, and discovered Neo4j later when I found that it was much easier to develop with and much faster for my purpose. So that little story, that's data lineage, which is fundamentally a graph. So we have a lot of banks and financial institutions that use it for fraud analysis.

Starting point is 00:14:51 So a question might arise, well, is this particular payment or bank transfer fraudulent? Sometimes that's very difficult to answer with the isolated details of the payment. So it's a large payment or a small payment. That doesn't really mean that it's fraudulent or not fraudulent. But if you can connect it with the wider community and you can say, this payment is one of 10 other payments that is all going into one bank account, which has been accumulating a suspicious amount of funds, the pattern of relationships would give you a stronger basis to cast suspicion on that one transaction, for example. So a lot of other companies use it for recommendation engines.

Starting point is 00:15:26 And so we have some retailers, for example, you go under their website and you add something to your shopping cart and they would like to recommend a product that you might like. One of the ways you can do that with graph is say, do a social type of recommendation and say, you have added an item to your cart that a lot of other users liked. So we're going to recommend the other items that they really enjoyed, or we're going to recommend items purchased by users whose behavior is similar to yours. And so when you think about how to express those questions, it's all about the relationships between users, products, order baskets, and behavior over time.

Starting point is 00:16:07 And those tend to be naturally graphy problems where some of the techniques that Neo4j has make your life easier. That's super cool. Database is a good tool for any problem that cares more about the relationships and the directions between entities instead of the entities themselves. That's definitely a part of it is if you care more about the relationships than the items itself, that's probably a good tip off. Usually we just talk about this in terms of connected data. So and that's that's a pretty wide space, but folks know these problems when they see them inside of their organizations. Like there are other cases where we've done a lot of work, like master data management, where you might need to connect the metadata from 15 different databases and ask questions about, you know, where are our

Starting point is 00:17:01 customer IDs split across lots and lots of different databases. And so that in turn is also a connected data kind of problem. So these kinds of connected problems run the gamut. I don't usually say that the graph database is the optimal solution for every single problem under the sun, but the expanse of connected data problems is wider than most people appreciate when they first see it. Yeah, yeah. But it seems from what you've shared so far with us that there are a couple of problems

Starting point is 00:17:35 that let's say fall under the category of data governance in general, that graph databases are a very good fit for that. Is this correct? Yes, absolutely. Oh, that's very interesting. To be honest, I wasn't aware at all that these kind of problems like I solved with graph databases, that's super, super interesting.

Starting point is 00:17:55 We hear a lot about data governance, to be honest, in this show. And it's a very hot space right now. There are many products that are popping out there about solving very specific problems around data governance. And I know, and at least I knew in the past that data lineage is like pretty tough problem to solve.

Starting point is 00:18:15 So I'd like to ask you a little bit more about that. And there are also like some selfish reason behind that because I'm very interested in it. You said that you used a graph database to do that. And also you tried to do that. And also you tried to do that using MySQL. My first question is, what was the difference between the two in terms of performance and also in terms of, let's say, ergonomics of using one system and the other to solve the same problem? How you would describe this? Okay, so this kind of goes back

Starting point is 00:18:42 to what we were talking about a little while ago about the difference in representational strength and how there isn't one. So the way that you model a graph in a relational database, we've seen 100 different customers do this 100 different ways, but the patterns are all very similar. Basically, what you do is you create a node table, like imagine that we have a person table. And then you create a separate many to many join table. Let's call the table person knows person. And so then if I want to query the database to see if David knows Costas, then what I do is I join the person table to the join table back to the person table. And then if that crosswalk exists, then that relationship in your graph exists. Make sense so far? Yeah, absolutely.

Starting point is 00:19:27 Okay. So that's a perfectly fine way of doing a graph. And other times, if all people need is a hierarchy, they will put, for example, a parent ID on a column. And they'll say, well, you know, the parent organization is ID 15. And so a foreign key that links back to the same table could be a form of a graph link, if that makes sense. So if you want to then traverse the graph, let's say that you want to navigate from one node to another node in this graph, you're always going to be doing that by SQL joins. And that's okay. But here we've already planted the seeds of where your performance problems are going to be. So the issue with SQL joins is that they need to be

Starting point is 00:20:11 recomputed each time. And so relational databases have gotten really good at this. Don't get me wrong. I mean, they got 30 years of research that goes into projecting out just the right rows very quickly and optimizing that join process. But the un-get-aroundable fact is that you're typically going to recompute this every time. So this in turn means that if you want to do lots and lots and lots of joins, I'm talking about 8, 10, 15, or more, you're going to be multiplying that computation burden. So from a performance perspective, you can see that navigating the graph via joins is going to scale poorly, no matter how you set your relational

Starting point is 00:20:51 database up. From an ergonomic perspective, SQL isn't meant to do this. And so if you're expressing a relationship traversal as a join, you already have an ergonomic gap there. And it is extremely difficult to express things like, I want to know the people that I am connected to that are between two and five hops away from me. So in other words, when you want to place constraints on the length of the path that you're navigating. Okay. I tried to do this when I wrote a custom stack of java software on top of my sql for my provenance database and i found that it is possible i kind of had to go up the mountain and consult with the local sql gurus and they gave me a set of sql constructs that i could use for recursive table joining and even you know bounded recursion to do those sorts of things.

Starting point is 00:21:47 But I'm telling you that the SQL was crazy complicated. And when I finally managed to write it, it performed really poorly. And I could see that that was going to get worse as my traversals got deeper and as the total volume of data that I was dealing with got larger. And that's the point where, you know, in my journey with my SQL for storing graphs, I had to say, whoa, I got to take a step back. What am I trying to do here? I'm trying to implement a graph abstraction on top of something that's not a graph. So are there any options for me out there that will actually store graph as a

Starting point is 00:22:21 data structure? And that's when I found Neo4j. Sometimes when we're talking with customers, we show them this slide where it's a very graphic query. The slide is something like find all people that this manager manages up to three levels down and count the number of people per manager. And you show like a four line cipher query, and then you show some gigantic, awful SQL query that does the same thing. And we use that as a jumping off point to describe these ergonomic differences between the query language. But really, if you boil out all the detail, what does it really get down to is using the right tool for the job. It's easier to use a graph query language on top of a graph

Starting point is 00:23:03 structure than to use a table query language on top of a table abstraction of a graph. Yeah, yeah. Makes total sense. And that's exactly what I was hoping to hear from you, to be honest. And I had SQL in my mind when I was asking this question. So how Neo4j addresses these two things, performance and ergonomics? Like what Neo4j is doing different than MySQL to do that? So let me take this individually. On the performance side, you'll see Neo4j talk about itself

Starting point is 00:23:34 as a native graph database. And what that really means is that the data structure is all the way down to what gets written to disk and what is stored in graph is optimized for a graph structure. So some folks may be familiar with the idea of sparse matrices or how you can represent a graph as a matrix. The way Neo4j does it is basically nodes and relationships are always fixed length records. And relationships are effectively not much more complicated than two pointers, a pointer to the originating node and a pointer to the terminating node of the relationship.

Starting point is 00:24:11 Now, Neo4j likes to have most of the graph structure live in memory where we can. And so the fixed length record aspect means that from a performance perspective, you can jump to the right node just by doing some pointer arithmetic and memory offset. And the fact that relationships are pointers means that traversing them is literally just dereferencing a pointer. And so the way a native graph database works in memory is the combination of those techniques loaded hot in RAM so that graph traversal really, when you strip away away all the fancy aspects of it, becomes pointer chasing. And that's a very fast operation to do in main memory. So in terms of the ergonomics, Neo4j has the Cypher query language. And so for people who

Starting point is 00:24:57 are not familiar with it, I usually just describe it as SQL for graphs, really. It has SQL-inspired syntax., things like where, skip, limit, all of that stuff is really operates the same in Neo4j as it does in SQL. But Cypher focuses on letting you describe the graph pattern that you're trying to match and then letting the database figure out how to go get it. And so in terms of ergonomics, I would, as a developer, I draw a distinction between declarative languages and imperative languages. And so declarative languages sort of broadly are, you tell the database what you want, and it's the database's problem to go figure out how to do that. And imperative language is more like a traversal where you say, you give

Starting point is 00:25:42 the database explicit instructions and say, go to this node, now expand out to that node, now expand out to that node, and so forth. And so Cypher is a declarative language. And that's part of why it has better ergonomics for graph is you just describe the pattern that you want, you use SQL similar syntax, and you let the database work it out. Sounds great. And you said that Neo4j, I mean, it prefers to operate in memory, right? Yes. How does this translate into scalability? How well does the system scale?

Starting point is 00:26:18 Well, you know, as somebody who's been in engineering for a while, I honestly kind of dislike the scalability question because I usually want to break it up into lots of different kinds of scalability. You know, you can scale storage, compute, you can scale high availability attributes and so on. So Neo4j does like to live in memory. And so usually what we advise customers is to have, you know, some healthy percentage of the total size of their database have that much memory. We have an arrangement called Fabric that's used for distributed databases. So if you can't fit your entire graph in memory, what you can do is partition your graph out

Starting point is 00:26:56 and you can have many different database management systems that store pieces or shards of that graph so that you're not restricted to how much RAM you can get in one box. David, your point about scaling was great. And what was, from all these different like scalability problems, right? What was the most interesting and challenging from an engineering perspective to solve for Neo4j? And I ask this question having in mind

Starting point is 00:27:22 graph partitioning in particular, to be honest, because I always thought that making a graph database distributed is kind of a hard problem. So yeah, can you share a little bit more information about that? Oh, it is a hard problem. So the let's think how best to approach this. So usually, the trouble with scalabilityability or the hardest part about scalability for me is that you're usually trying to optimize across a lot of different axes at the same time. And so for example, when people say that they want to be able to scale the number of reads, they're also typically implying that they don't want read or write performance to degrade

Starting point is 00:28:02 while they're doing that. And so you can't just make some element bigger, you have to retain a lot of other things. And so at the extremes of scalability, what I've always found with databases is that you end up making some compromises. So for example, the original eventual consistent databases all came along at a time where people were trying to scale right volumes to the point where they couldn't maintain the strong asset guarantees and scale rights to that degree. And so the hardest part of scalability is the, you know, the tan staffel rule, like there ain't no such thing as a free lunch and, or, you know, what's that other classic saying, you know, better, faster, cheaper pick two. So Neo4j as a system is basically trying to maintain strong consistency throughout and

Starting point is 00:28:50 puts a premium on making sure that the read performance and throughput is really quite good. And so in taking those design decisions, certain kinds of scalability might be a little bit more difficult. Other systems might take a drastically different approach and have different scalability attributes at the cost of different trade-offs. Does that make sense? Yeah, yeah, absolutely. Actually, I love your definition of scalability in relation to trade dose, because at the end, that's exactly what scalability is, finding the right trade dose based on the problem you're trying to solve. So yeah,

Starting point is 00:29:24 that's perfect. That's an amazing definition. To take that one step further, graph partitioning. You said graph partitioning is a hard problem. I completely agree. So there are a lot of other systems where they will automatically partition your graph. And so they just say, hey, you just got a bunch of nodes and relationships. We'll handle the sharding for you. They make an explicit trade-off that users don't usually, they're not usually aware of.

Starting point is 00:29:48 If you did, if you basically threw everything, let's say, into a MySQL table and then you did horizontal or vertical partitioning of those tables to distribute your, quote, quote, quote, you can sort of see that in doing so, you would be creating a lot of breakages where in order to traverse a relationship, you would need to cross between shards. That's an expensive thing to do in all distributed databases is moving, you know, doing the network coordination between shards. And so when we think about how to partition a graph, there's one way to partition it where it allows you to write an unlimited amount of data at the cost that your read queries might get increasingly expensive and difficult to do.

Starting point is 00:30:30 And there might be another way of partitioning your graph that is really taking into account what the schema and connected components of the graph are, which can preserve really high throughput and performance at the expense of making the partitioning scheme more difficult to create and maybe not automatic. Does that make sense? Yeah, absolutely. So that's that kind of like manual sharding versus automatic sharding. And automatic sharding is clearly possible, but it creates some trade-offs

Starting point is 00:31:02 that you might not realize you've made until you're already at scale. And in case of Neo4j, which one do you support or which one do you recommend to your customers? So in the case of Fabric, what we're usually doing is recommending that the customers come up with a partitioning scheme that makes sense for their data. I was just writing this crazy Twitter thread about this last week. I can link you to it later if you want, but it has to do with how to create cut points in your data model such that when you distribute your graph across multiple partitions, you are minimizing the number of times you're going to have to cross partitions.

Starting point is 00:31:41 And that is the property that's going to mean that your queries are still going to perform well at scale. Yeah. Yeah. Please, please share this. It's very, very interesting. All right. One last question from me, and then I'll let Eric ask his questions because I'm monopolizing the conversation here. So can you give us a little bit more information from an architectural standpoint, how Neo4j fits in a modern data stack and how it also integrates or interoperates, let's say, with other common data systems and an organization from what you've seen from your customers,

Starting point is 00:32:15 right? Sure. So let's first picture Neo4j as a complete black box in order to attack the interoperability point. And we need to see that there's got to be some way to get data into the box and some way to get data out of the box. So on the inside, we have a set of supported connectors that are available at the same price that comes with the commercial software. And so we do Kafka, for example, and Spark, and we have a connector for business intelligence

Starting point is 00:32:45 that basically treats the database like a JDBC endpoint. And so between those options, driver applications, and also the ability to do things like load CSV, a lot of things that I do with customers involve creating ingest and egress pipelines. And so on the ingest route, you can do it either batch or streaming. And a common architecture might be something like my upstream Oracle system is publishing all changes onto a Kafka topic. And so Neo4j is reading from that Kafka topic and is transforming those records into a graph pattern, let's say three or four nodes linked together with some relationships. And so Neo4j is following along with whatever is coming from Oracle. And we are augmenting that with some reference metadata that comes from other systems. And here we have a

Starting point is 00:33:36 knowledge graph in the center. Okay. On the egress side, it's again, kind of a situation of whether you want to do it batch or streaming. So you can do it streaming via Kafka or MSK, things like that. If it's batch, you can write a program, you can use things like cloud functions, like Amazon lambdas and so forth. Or you can use stuff like Databricks notebooks. I've been using a lot of Spark and Databricks myself lately for working with Neo4j to get the data downstream. So let's zoom out from that perspective for a second. So you've got this black box Neo4j, and you've got some good set of options to get data into graphs, and you've got good options

Starting point is 00:34:16 to get data out of graphs back to tables or documents or whatever it is that you need. So you can look at that entire architecture as kind of like a graph coprocessor on top of any other system. So for example, some people will do things like graph assisted search. They might have elastic search that they're using for website search, but they might also load their product taxonomy into Neo4j

Starting point is 00:34:42 and then change their website search so that it's still primarily Elasticsearch text, but it's also being informed by search expansion from the knowledge graph. You might search for black tube socks and via the knowledge graph, I might expand that out to also include search results for leggings, for example, because we know that those are related from the knowledge graph, even if tube socks as a piece of text would never match leggings in text form. So then you can basically take this graph architecture and you can add it on to other systems.

Starting point is 00:35:13 So we're not necessarily trying to replace them or say that Neo4j has to do everything for you, but it can add a lot of graph value to whatever it is that you're already doing. And that same kind of pattern tends to repeat itself when it comes to financial fraud detection engines, recommendation engines, and so on. And some of the more interesting stuff that I get to play with these days has to do with putting graphs into machine learning pipelines in exactly that pattern I'm describing. That's super interesting. Do you also see use cases where Neo4j is used together with an OLAP system, like for more, let's say, BI or analytics workloads? Definitely, yes. Well, so for BI or analytics. Can you give me just a little bit more on the

Starting point is 00:36:04 question? What kind of scenarios do you have in mind? Yeah, I'm talking more about the typical cases inside the company and around reporting. We try to figure out, for example, how marketing is performing, right? One of the issues, and I think that Eric will have much more questions around that in marketing is attribution, which in my mind, at least you can consider like attribution as a kind of graph, or you have the customer journey, which is okay. I have the customer. This is the first touch point. This is the second touch point.

Starting point is 00:36:36 Like all these problems that usually companies, let's say they put them under the umbrella of the traditional BI, because it's more something that tries to explain what happened, right? That's what I mean, if you see use cases there. Yes, definitely. The intent of what you're asking, I kind of break it into two categories. Sometimes Neo4j is the system of record because it's a transactional database. And basically, people are trying to do BI on what's happening in their Neo4j system. And in that case, we have this connector for BI I was describing earlier, where you just treat it like a JDBC endpoint, and then you can use

Starting point is 00:37:09 Tableau against Neo4j if you want to do that traditional BI stuff. A different use case is when Neo4j isn't the system of record, but it might store, for example, a knowledge graph or a set of taxonomies that you're going to use as a reference data source in the BI or in the analytics process. That is also a pattern, but a different one. That's super interesting. One follow-up question on that. And I think, of course, being from a marketing background that got my wheels turning on the marketing attribution side and thinking about the customer journey. So, you know, the knowledge graph is sort of one component of that, but do you see any use cases around behaviors, right?

Starting point is 00:37:50 So when we think about behaviors that are related, and if you think about that in the context of maybe BI or, you know, sort of just a traditional database, you're talking about just tons and tons of SQL, which you talked about before, right? So you have different behaviors represented in different tables, and then you're sort of trying to tie that together using unique identifiers, which sort of change by table. And so you end up with these sort of monster queries to pull together like a basic user journey. I'd love to know about any use cases where graph helps make that more elegant or easier to do. Yeah, absolutely. Well, so here I'd probably point people to,

Starting point is 00:38:28 there's a use case on our website from Nordstrom and it has to do with personalized product recommendations and click stream data. And so, you know, earlier we talked about how people use graphs for recommendations like what might a person want to buy. And I think at Nordstrom, what they're doing

Starting point is 00:38:45 is taking all of these different events that are occurring on their website. And that's kind of, you know, partially the user journey through a website and then using that to inform what the recommendation is. To your point, every time somebody makes a touch point with your company, whether it's downloading a white paper or sending an email or downloading a piece of software or something along that, that is part of their journey. And you can use graphs to basically pull that together and look at a whole bunch of users in a cohort. And so a very useful analysis to do might be to say of the people who ghosted us and never called us back, what similarities did their journey have?

Starting point is 00:39:32 You know, of the people who later bought the product, what were the most common early steps that they took on our website. And so you build graphs of these kinds of user journeys. And this is where some of our graph algorithm stuff comes in, where you can use it to establish similarities between nodes, or you can do certain kinds of graph partitioning that can help you see some of these patterns. Very cool. And I know we're getting close to time here, but one thing that I think would be helpful, we always like to try to talk practical implementation stuff with our guests, just because I think it's helpful for our audience. to connect into Neo4j. What do you typically see in terms of stage of company, sort of other tools that they might be using? Because I'm sure that there are people in our audience maybe thinking graph might be a really interesting way for us to make some of the things we're trying to do with sort of your traditional warehouse a lot easier. What's the point at which it makes sense? And what's the process of sort of implementing that

Starting point is 00:40:48 into your stack look like? So the companies that we work with are all over the gamut in terms of their technical capabilities. I mean, in the last couple of weeks, I've worked with some that said we're entirely on-prem and data can only get in and out using this particular enterprise ETL suite. And so it's got to be Informatica, for example. And that ranges all the way up to companies that have adopted a whole lot of cloud native services. And they'll say things like, all of our workloads are running on top of Google Kubernetes engine. And so we need help doing the network bits to use such and such a managed service together with GKE to ingest data. So they're really all over the map. Usually when I go talk to a customer first, I try to establish what their baseline is. I love all of those native cloud services, and I think that they have a lot of value to offer, but it's also a massive learning curve. And it's something that I find within some of these companies is best adopted piecemeal or like a bit at a time. Because if I go in with an architecture that calls for the use of, you know, Kafka as a managed service and, you know, storage triggers based on S3 or something like that, I could really lose people.

Starting point is 00:42:12 So I think a good architecture is one that has to be operable by the people, organizations, and policies that you have around you. So how to get started? Well, maybe don't get started on a big architecture before you've proved the value to yourself. Usually what I tell people is like, take a look at neo4j sandbox.com. And that will give you a playground where you can play with technology, load your own data, see if you can get some value out of it. And that helps people get started thinking about, well, what does my data look like as a graph? And, you know, Ken Seifer helped me and that sort of stuff. And I find that the architecture data look like as a graph? And, you know, can Cypher help me and that sort of stuff. And I find that the architecture bits then come as a maturation of that basic value proposition. You say, all right, I want that value, but I want it at larger scale or with more timeliness. And then all of those questions about ingress and egress and architecture get tackled in the context of

Starting point is 00:43:05 how do I extend some piece of value that I already know I want? Sure. And would you say there's sort of a light bulb moment people have when they start playing with the sandbox? Because I just know there are a lot of people out there who have lived and worked in sort of your traditional data warehouse paradigm, you know, maybe for their entire career and are familiar with graph, but haven't really dug into it in terms of practical value in their data stack. What is it? Can you describe that light bulb moment a little bit when you say, Oh, wow, like this could be a really valuable component of the stack. So I'll give you, I'll give you an example of a light bulb moment, but I think for a lot of people,

Starting point is 00:43:49 they might have to try it out with their data to see what it's going to be in their business context. But let's just say, imagine that you had a list of roads and you knew how long each road was and how much distance there was between each city. And you had a shortest path problem. And you said, well, I live in Richmond, Virginia, and I want to get to Washington, DC along the shortest possible path. Sort of think about the kind of thing that Google maps does every single day. So each road that you take has a distance and a speed limit. And so it has a cost to traversing. And so you want to figure out the shortest path from Richmond to Washington, figure out how you would do that with a relational database, and then get

Starting point is 00:44:30 back to me, then go take a look at what it would do, what it would take to do that in Cypher. And you'll have the, that light bulb moment. Now, some people won't connect with that experience because they'll say, well, I don't have Google Maps data. And that's where you've got to try it out with the sandbox, load your data into a basic data model. And you're going to find that usually the aha moment is some sort of a variable length path query, because that is the sort of thing that graphs make super, super easy and relational databases and other databases just don't. So there are certain use cases that are more relational where you're not going to have that aha moment. So like I gave

Starting point is 00:45:12 the example earlier, if you have a million customers, you want to know all the ones where the zip code is 23229, graphs can do that, but that's not a particularly a graphy problem. Makes total sense. I have one last question for you, David, and it's about open source. So what's the relationship of Neo4j with open source and how important it has been the whole community for Neo4j so far? Neo4j was born as an open source product

Starting point is 00:45:39 and the community edition is out there right now under, I forget which variant of the JPL, but there's an open source version available right now. I'm a person who has written multiple different open source packages that relate to Neo4j. One called Neo4j Helm that helps you deploy it on Kubernetes. Another called Halen, which is a monitoring tool. So, I mean, I don't know what else to say other than we're strong proponents of open source and that's been a core part of our business since the beginning.

Starting point is 00:46:11 You know, now not as a Neo4j employee, but I think that open source is going through an interesting evolution right now. We're seeing in the industry that comes from, you know, the increase in cloud platforms and also people's shift towards managed services. I mean, consider that nobody ever asks whether Gmail is open source or not. And so when you use things as managed services, what open source means really starts to change. And I don't really have anything so great to say about that other than that as like a practitioner and somebody in the industry, I'm pretty interested to see what's going to happen in the coming years. Absolutely. I think, you know, the Rutter stack where Costas and I both work, we're open source as well. And

Starting point is 00:46:52 it's going to be a fascinating, fascinating environment to operate in, in the coming years. Well, we are, we're at time, David, this has been so interesting. I learned a ton. I think our audience has learned a ton and I just have all sorts of interesting ideas around graph with all the new knowledge you've given us. So really appreciate you taking the time. Yeah. Thanks for having me. Always an interesting conversation on the Data Stack Show. We took a little aside there to talk about writing books, but I think it's just interesting when someone has done something, you know, sort of maybe related, but as an activity outside of their day-to-day work with data. So it was fun to hear

Starting point is 00:47:30 about the process of writing a book. I think one of the interesting things that stuck out to me was how sort of straightforward it seems to integrate something like Neo4j with an existing data stack, right? So we talked about just sort of reading from a Kafka topic, right, which is a really, really common structure within data stacks these days. And so that was exciting to me, I think. And I hope our listeners sort of had some ideas about how they may be able to try this out with some of their existing data stack infrastructure.

Starting point is 00:48:03 Yeah, absolutely. From my side, I was quite surprised to hear for all these some of their existing data stack infrastructure. Yeah, absolutely. From my side, I was quite surprised to hear for all these different use cases, to be honest. It was super interesting to see that a graph database can be used as part

Starting point is 00:48:14 of the product experience, but at the same time, it can also be used for a lot of typical analytics workloads. And what was really, really interesting for me is how it can be used

Starting point is 00:48:25 inside the context of data governance. That's something that I definitely want to learn more about. Okay, so we'll extend this show just another two minutes, because I want to ask you about this. So we talked about data governance, a good bit in one of our recent shows, do you think that graph is potentially one of the ways to solve data governance at a sort of comprehensive level? If you can connect all of the data in your stack to it, it may make some of that governance workload easy. Yeah, I think, okay, it's not going to solve all the problems that are under data governance,

Starting point is 00:49:04 right? I don't think that's, okay, it's what you need in order to manage access to data like things like immune is doing right right right but the thing with data is that data gets continuously transformed inside the company right it is consumed by many different stakeholders many different systems and at the end we need to track the evolution of data. And a graph structure makes a lot of sense there, right? Like it's a very native way of representing how a piece of data has evolved in time, right?

Starting point is 00:49:37 And with what people it has interacted and all these things. So for stuff that are like data lineage, for example, yeah, that's probably like a very, very good way to represent this. Now, that's one part of the problem, right? Then you need to feed the database with all the data and most importantly, the metadata here, which is another question that we need to ask our guests. How do you get this data and how do you generate this metadata to fit a graph database to solve the problem? Very interesting.

Starting point is 00:50:13 Well, time will certainly tell. Thank you again for joining us on the Data Stack Show. Be sure to hit subscribe in your favorite podcast app. That way you'll get notified of new episodes when they go live and we will catch you on the next one. The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution. Learn more at rudderstack.com.

The Data Stack Show - 38: Graph Databases & Data Governance with David Allen of Neo4j

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.