The Data Stack Show - 38: Graph Databases & Data Governance with David Allen of Neo4j
Episode Date: June 2, 2021Highlights from this week's episode include: David’s background in comparative databases (1:50)David’s experience and lessons he learned from writing his book (3:23)How writing a technical book c...ompares to writing technical documentation (4:41)The process of writing a book (6:30)The best and worst part of David’s book writing experience (8:02)An introduction to what Neo4j is (9:08)What you need to graph (11:13)Typical problems a graph database is a good solution for (13:00)The difference between performance and relational databases (18:41)How Neo4j addresses performance and ergonomics (23:30)Neo4j and scalability (26:20)How Neo4j fits in the modern data stack (31:48)Neo4j use cases (35:45)Practical implementation of Neo4j (40:51)Neo4j’s relationship with open source (45:50)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution.
Thanks for joining the show today.
All right, today on the show, we have David Allen, and he works for a company called Neo4j, and they are a graph database.
And this is probably going to be a pretty technical discussion, which we really like.
And I'm pretty interested to ask about how graph can be used in the context of event data. That's
something that I deal with every day in my job. And we see that sort of the clickstream thing.
And Graph seems like a really interesting application for that.
So that's my burning question.
Costas?
Yeah, I want to see how Neo4j and in general,
Graph databases fit to a modern data stack, to be honest.
I'm quite familiar with the products.
I've seen how it grew from the early days of Neo4j as an open source
project, and it has matured a lot. And of course, the market has changed a lot, right? So I really
want to see what kind of use cases exist out there today for graph databases. Awesome. Well, let's
jump in and talk graph with David. David, welcome to the Data Stack Show. Really excited to talk with you about all sorts of
things, but especially graph. Yeah, thanks for having me today. I'm looking forward to the
conversation. We love getting into the technical details, but as I'm known to do, I want to start
with just hearing your background, and then I have a question about something non-technical.
So do you just want to give us the two-minute overview of who you are, what you're doing, and how you ended up where you are?
Sure. Well, I started off my career in data management consulting. And so in that capacity,
I was usually working with corporate customers, doing things like building ETL pipelines,
helping them with data quality problems, helping them with master data management and governance type stuff.
After a couple of jobs in that area, I ended up doing some applied research and development
for government at a company called MITRE.
And it was actually when I was at MITRE that I kind of ran into graph for the first time.
I didn't actually go straight to Neo4j after MITRE.
I went on and did some work as a CTO
for a startup and did a couple of other jobs and ended up coming back to Neo4j when I found a
position opened up and I wanted to kind of re-engage in the graph space. But I, you know,
I think of myself as coming from a background of like comparative databases. And at one point or
another in my career, I seem to have ended up having a
situation where I needed to use all of them. Very cool. Well, I want to dig into graph stuff
because there's so much there, but you're actually an author. So you wrote a book and you have spent
some time in the academic space. And so I'm really interested to know on the authorship side of things, what was that experience like? And were
there any lessons from sort of like data management or the engineering side of things that you took
into the process of writing a book? Now, of course, it was a technical book, but that's kind
of a cool thing that you're an author and sort of work in the technical side of data.
Well, it's been a while since that book was published.
I mean, I think that when you start the process
of writing a book,
you definitely bring all of your technical experience in.
And I think you set out to try to summarize
some of what you've learned
in the context of like technical publications,
whether it's a book or a blog post,
any form of publication like that.
In terms of the lessons learned.
Wow.
There's there, there were a lot of them for me in that process.
I think one of them is, you know, don't get into technical book publishing.
If you're trying to make money, that's, that's for sure.
Another one is, you know, you, you set out with a clear outline and you're like, okay,
I know what I want to say in this book, but you find that in developing the connective
tissue and making the entire story cohesive throughout that you end up having to go do some more basic research to fill in the gaps,
you know, areas where you thought you knew something, but you didn't actually to make
the whole thing hang together. And so I thought when I started the process, it was going to be,
okay, we're going to sit down and write down everything we've learned. And it wasn't that exactly. We had to fill in a lot of gaps.
Interesting. And how would you compare the experience of writing a technical book
to maybe something more like writing technical documentation?
Well, that's an interesting question. I think that technical documentation, I've written a lot of that over time. It feels like it's definitely much more narrow in scope and it has a more focused audience. Whereas in a longer form book type of a setup, you're kind of expected to give some element across the whole spectrum of like, well, there's understanding oriented material of what
are the key concepts and what is the theory behind this? And then there's also the how to
oriented material where you say, okay, this line of code and then that line of code.
And then there's the tutorial element, which is how to apply your knowledge to a novel situation.
So when you're writing technical documentation, I find it usually falls into one of those categories, like explanatory, conceptual, how to, or tutorial. And when
you're in the longer form, like if you're writing a book, you end up having to figure out how to do
the proper balance of all of that to give somebody like a really comprehensive view on a subject.
Right. Kind of like a narrative arc almost.
Yeah, there you go. And also, you know, sort of like, you know, if you've been to college and
you've seen how they structured the course materials, there's the sort of like 101 progression
up to the higher level course material. And so you got to really think about that knowledge path
when you're writing a longer piece of technical information, because you want to go jump in and explain the
complicated stuff, but you have to lay the groundwork and get some conceptual machinery
out of the way so that the more complicated stuff will make sense later on. In the context of a
tutorial or a blog post, you never do that. You just say up front, you need to know these three
things before you read this. And then that's that. That's super cool. Can you tell us a little bit
more about how it feels like to write a book? Like what is the process and how long it takes also?
Well, I've only done this once. I wouldn't represent myself as an expert in the area,
but it's a pretty grinding process, to be honest. You first start with a proposed outline before
your project is even accepted. And you typically work with an editor who provides feedback on that outline.
And then you get to something like an annotated outline, which is, you know, think of it as a
fleshed out outline with just like a list of bullets for each subheading of what you would
talk about or how you would approach that. And then you go through the grinding process of
developing out, you know, a first draft of each of the sections. And then typically they're expert
reviewers. So they get other people in your field to read the book and provide notes so that the publisher
themselves understands that they're not putting out, you know, bad information about this or that
topic. And when you get to the point where you've got a relatively mature draft, there are, there's
a lot of edits that go into that. And then there might be post-production stuff like,
is the publisher going to work with a company
that's going to build an index and those sorts of things.
That's super interesting.
Yeah, so it's a long process.
Yeah, it sounds like it is.
I mean, the reason I'm asking is I have no idea
how a book is written.
I'm pretty sure many people don't know how much work is put behind all these books that we see out there.
And yeah, it's a very good opportunity.
So what was the...
Okay, last question about the book, I promise.
What was the best part of writing this book and what was the worst part for you?
Well, I'll give you the worst part first.
The worst part was the endless revisions and not
being sure what the finish line is. And I think that all book authors kind of go through that at
one point or another, but not to dwell on that too much. The best part is like, I mean, hey,
I'm in the tech industry in the first place because I enjoy learning and I enjoy the process
of learning. And so I would say discovering the gaps in my knowledge was the most fun part. I set out
to write down what I thought I knew. And as I started getting deeper into it, I started to
discover, okay, the way you're thinking about this or that topic is a little fuzzy and needs
some more detail. So now I have to go do basic research. And so it is as much a learning project
as it is a writing project, if that makes sense. And that's
actually fun to me. I mean, I really like that element about software and technology is that
it puts me in a position to keep learning new stuff. Oh, that's an amazing point. Nice. Really,
really nice. So David, let's start talking a little bit about Neo4j and graph databases. Do you want to do us a quick introduction on what Neo4j is?
Sure. Well, Neo4j is what we call a native graph database.
And so graphs are a data structure that folks may or may not be familiar with.
But basically, every time you go to a whiteboard
and you draw a bunch of circles and lines connecting the circles on the whiteboard,
you are describing
a graph. So graphs are composed of nodes, which are those circles that you draw, and relationships
that link the nodes together. And what people find, like particularly in the whiteboarding context,
is that this structure is very rich, very easy to work with, and very associative and sort of more similar to how the
inside of your own head operates than, for example, something like a table or a document.
You know, when you're working with tables, you make a list of records. And so you can think of
every table at the end of the day as being a list of sorts. Graphs are just this flexible,
open-ended data structure with a lot of links back to,
you know, basic computer science that you can use to represent any form of data under the sun.
When people most often get exposed to graphs for the first time, it's usually in the context of
something like a social network. So you go onto Twitter and your account, like imagine on the
whiteboard, I draw my account as a circle. And then I might draw all the accounts that I follow as other circles and then draw a line from me to them saying I follow those accounts and so forth. that is most naturally represented as a graph. You go to LinkedIn and you say who's connected to who,
it's the same structure again.
Facebook, same structure again.
So that's how people usually first come to it.
And it's in the context of other business problems
where we introduce how to apply that kind of graph thinking
to lots of other kinds of data.
And why do we need a graph?
Like the relational databases that we have so far
are not expressive enough for these types of problems
that we usually solve with graph databases.
Yeah, so it's not a question of expressiveness.
So I've used pretty much all the data
before a lot of different databases, not all of them.
And so if you think about,
there are these different data models.
You've got graph,
relational, JSON documents, key value stores. Now, it would be silly to say that there's some kind of fact about the world that you can say with graphs that you can't say with tables,
because you can represent literally anything with tables. But I find it better to think about these
different kinds of databases as more like tools in a toolbox. And so the question is not whether it's more representationally powerful. The question is whether that's not a particularly graphy problem. However, if you wanted to, for
example, calculate somebody's Kevin Bacon score, like how many hops away they are from Kevin Bacon
in terms of the movies that they've made, that requires you to navigate a complex set of
relationships between records. And that is a very graphy problem. So the way I try to sum this up in
the simplest possible way is to say that sometimes problem. So the way I try to sum this up in the simplest possible way
is to say that sometimes the relationships
between the data items matter more
than the data items themselves.
And if that sounds like it fits your problem,
you're probably in the graph space.
Yeah, that's a very good way to put it.
I really like how you described the difference there.
So can you give us,
I mean, you said like people usually
get introduced to graphs by the social graph, right? That's what everyone has heard about.
Can you give us a few examples of like typical problems that the graph database is like
a very good solution for? Yeah, sure. So let me tell you a story about how I actually started
using graphs because this is the one that's most direct to me. So I was working for that government research and development company called MITRE and we were developing a solution for data provenance. I got this intelligence report or I got this estimate or I got this financial number.
And I don't know whether I can trust it or not because we have internal process weaknesses in our organization.
And so what they wanted to know was how did we arrive at this judgment or decision?
And so in order to know whether the information was good or not, we had to trace backwards to figure out how it was put together.
So we might say, well, we gathered some facts from Bob and then Bob summarized them in this report. And then that report was processed by that system and so forth. And so basically I want you to picture like a
family tree for any given report that you might see. Well, that family tree is a graph because
we can say data came from sources, was transformed in certain ways, and then got summarized into the graph.
And the way that I first came to graph is that I wanted to build directed acyclic graphs of data provenance so that I could answer these questions for these executives.
And I first did it on top of MySQL and just did it using two tables and joins between the tables, which MySQL is perfectly
capable of doing, and discovered Neo4j later when I found that it was much easier to develop with
and much faster for my purpose. So that little story, that's data lineage, which is fundamentally
a graph. So we have a lot of banks and financial institutions that use it for fraud analysis.
So a question might arise, well, is this particular payment or bank transfer fraudulent?
Sometimes that's very difficult to answer with the isolated details of the payment.
So it's a large payment or a small payment. That doesn't really mean that it's fraudulent or not fraudulent.
But if you can connect it with the wider community and you can say, this payment is one of 10
other payments that is all going into one bank account, which has been accumulating
a suspicious amount of funds, the pattern of relationships would give you a stronger
basis to cast suspicion on that one transaction, for example.
So a lot of other companies use it for recommendation engines.
And so we have some retailers, for example, you go under their website and you add something to
your shopping cart and they would like to recommend a product that you might like.
One of the ways you can do that with graph is say, do a social type of recommendation and say,
you have added an item to your cart that a lot of
other users liked. So we're going to recommend the other items that they really enjoyed,
or we're going to recommend items purchased by users whose behavior is similar to yours.
And so when you think about how to express those questions, it's all about the relationships
between users, products, order baskets, and behavior over time.
And those tend to be naturally graphy problems where some of the techniques that Neo4j has make your life easier.
That's super cool. Database is a good tool for any problem that cares more about the relationships and the directions between entities instead of the entities themselves.
That's definitely a part of it is if you care more about the relationships than the items itself, that's probably a good tip off.
Usually we just talk about this in terms of connected data.
So and that's that's a pretty wide space, but
folks know these problems when they see them inside of their organizations. Like there are
other cases where we've done a lot of work, like master data management, where you might need to
connect the metadata from 15 different databases and ask questions about, you know, where are our
customer IDs split across lots and lots of different databases.
And so that in turn is also a connected data kind of problem.
So these kinds of connected problems run the gamut.
I don't usually say that the graph database is the optimal solution for every single problem
under the sun, but the expanse of connected data problems is wider than most people appreciate when
they first see it.
Yeah, yeah.
But it seems from what you've shared so far with us that there are a couple of problems
that let's say fall under the category of data governance in general, that graph databases
are a very good fit for that.
Is this correct?
Yes, absolutely.
Oh, that's very interesting.
To be honest, I wasn't aware at all
that these kind of problems like I solved
with graph databases, that's super, super interesting.
We hear a lot about data governance,
to be honest, in this show.
And it's a very hot space right now.
There are many products that are popping out there
about solving very specific problems
around data governance.
And I know, and at least I knew in the past
that data lineage is like pretty tough problem to solve.
So I'd like to ask you a little bit more about that.
And there are also like some selfish reason behind that
because I'm very interested in it.
You said that you used a graph database to do that.
And also you tried to do that. And also
you tried to do that using MySQL. My first question is, what was the difference between the
two in terms of performance and also in terms of, let's say, ergonomics of using one system and the
other to solve the same problem? How you would describe this? Okay, so this kind of goes back
to what we were talking about a little while ago about the difference in representational strength and how there isn't one. So the way that you model
a graph in a relational database, we've seen 100 different customers do this 100 different ways,
but the patterns are all very similar. Basically, what you do is you create a node table, like
imagine that we have a person table. And then you create a separate many to many join table. Let's call the
table person knows person. And so then if I want to query the database to see if David knows Costas,
then what I do is I join the person table to the join table back to the person table. And then if
that crosswalk exists, then that relationship in your graph exists. Make sense so far?
Yeah, absolutely.
Okay.
So that's a perfectly fine way of doing a graph.
And other times, if all people need is a hierarchy, they will put, for example, a parent ID on a column.
And they'll say, well, you know, the parent organization is ID 15. And so a foreign key that links back to the same table
could be a form of a graph link, if that makes sense. So if you want to then traverse the graph,
let's say that you want to navigate from one node to another node in this graph,
you're always going to be doing that by SQL joins. And that's okay. But here we've already planted the seeds of where your
performance problems are going to be. So the issue with SQL joins is that they need to be
recomputed each time. And so relational databases have gotten really good at this. Don't get me
wrong. I mean, they got 30 years of research that goes into projecting out just the right rows very
quickly and optimizing that join process.
But the un-get-aroundable fact is that you're typically going to recompute this every time.
So this in turn means that if you want to do lots and lots and lots of joins, I'm talking
about 8, 10, 15, or more, you're going to be multiplying that computation burden.
So from a performance perspective, you can see that
navigating the graph via joins is going to scale poorly, no matter how you set your relational
database up. From an ergonomic perspective, SQL isn't meant to do this. And so if you're
expressing a relationship traversal as a join, you already have an ergonomic gap there. And it is extremely difficult to express
things like, I want to know the people that I am connected to that are between two and five hops
away from me. So in other words, when you want to place constraints on the length of the path that
you're navigating. Okay. I tried to do this when I wrote a custom stack of java software on top of my sql for my
provenance database and i found that it is possible i kind of had to go up the mountain and
consult with the local sql gurus and they gave me a set of sql constructs that i could use for
recursive table joining and even you know bounded recursion to do those sorts of things.
But I'm telling you that the SQL was crazy complicated.
And when I finally managed to write it, it performed really poorly.
And I could see that that was going to get worse as my traversals got deeper and as the
total volume of data that I was dealing with got larger.
And that's the point where,
you know, in my journey with my SQL for storing graphs, I had to say, whoa, I got to take a step
back. What am I trying to do here? I'm trying to implement a graph abstraction on top of something
that's not a graph. So are there any options for me out there that will actually store graph as a
data structure? And that's when I found Neo4j. Sometimes when we're talking with
customers, we show them this slide where it's a very graphic query. The slide is something like
find all people that this manager manages up to three levels down and count the number of people
per manager. And you show like a four line cipher query, and then you show some gigantic, awful SQL
query that does the same thing.
And we use that as a jumping off point to describe these ergonomic differences between the query
language. But really, if you boil out all the detail, what does it really get down to is
using the right tool for the job. It's easier to use a graph query language on top of a graph
structure than to use a table query language on top of a table abstraction of a graph.
Yeah, yeah. Makes total sense.
And that's exactly what I was hoping to hear from you, to be honest.
And I had SQL in my mind when I was asking this question.
So how Neo4j addresses these two things, performance and ergonomics?
Like what Neo4j is doing different than MySQL to do that?
So let me take this individually.
On the performance side, you'll see Neo4j talk about itself
as a native graph database.
And what that really means is that the data structure
is all the way down to what gets written to disk
and what is stored in graph is optimized for a graph structure.
So some folks may be familiar with the idea of sparse matrices or how you can represent a graph
as a matrix. The way Neo4j does it is basically nodes and relationships are always fixed length
records. And relationships are effectively not much more complicated than two pointers,
a pointer to the originating node and a pointer to the terminating node of the relationship.
Now, Neo4j likes to have most of the graph structure live in memory where we can.
And so the fixed length record aspect means that from a performance perspective,
you can jump to the right node just by doing some pointer arithmetic and memory offset. And the fact that relationships are pointers means that
traversing them is literally just dereferencing a pointer. And so the way a native graph database
works in memory is the combination of those techniques loaded hot in RAM so that graph
traversal really, when you strip away away all the fancy aspects of it,
becomes pointer chasing. And that's a very fast operation to do in main memory.
So in terms of the ergonomics, Neo4j has the Cypher query language. And so for people who
are not familiar with it, I usually just describe it as SQL for graphs, really.
It has SQL-inspired syntax., things like where, skip, limit,
all of that stuff is really operates the same in Neo4j as it does in SQL. But Cypher focuses on
letting you describe the graph pattern that you're trying to match and then letting the database
figure out how to go get it. And so in terms of ergonomics, I would, as a developer, I draw a
distinction between declarative languages and imperative languages. And so declarative languages
sort of broadly are, you tell the database what you want, and it's the database's problem to go
figure out how to do that. And imperative language is more like a traversal where you say, you give
the database explicit instructions and say, go to this node,
now expand out to that node, now expand out to that node, and so forth. And so Cypher is a
declarative language. And that's part of why it has better ergonomics for graph is you just
describe the pattern that you want, you use SQL similar syntax, and you let the database work it out. Sounds great.
And you said that Neo4j, I mean, it prefers to operate in memory, right?
Yes.
How does this translate into scalability?
How well does the system scale?
Well, you know, as somebody who's been in engineering for a while,
I honestly kind of dislike the scalability question
because I usually want to break it up into lots of different kinds of scalability.
You know, you can scale storage, compute, you can scale high availability attributes and so on.
So Neo4j does like to live in memory.
And so usually what we advise customers is to have, you know, some healthy percentage of the total size of their database have that much memory.
We have an arrangement called Fabric that's used for distributed databases.
So if you can't fit your entire graph in memory, what you can do is partition your graph out
and you can have many different database management systems that store pieces or shards of that
graph so that you're not restricted to how much RAM you can get in one box.
David, your point about scaling was great.
And what was, from all these different
like scalability problems, right?
What was the most interesting and challenging
from an engineering perspective to solve for Neo4j?
And I ask this question having in mind
graph partitioning in particular, to be honest,
because I always thought that making a graph database distributed is kind of a hard problem.
So yeah, can you share a little bit more information about that?
Oh, it is a hard problem. So the let's think how best to approach this. So usually,
the trouble with scalabilityability or the hardest part about
scalability for me is that you're usually trying to optimize across a lot of different axes at the
same time. And so for example, when people say that they want to be able to scale the number of
reads, they're also typically implying that they don't want read or write performance to degrade
while they're doing that. And so you can't just make some element bigger, you have to retain a lot of other things. And so at the
extremes of scalability, what I've always found with databases is that you end up making some
compromises. So for example, the original eventual consistent databases all came along at a time
where people were trying to scale right volumes to the point
where they couldn't maintain the strong asset guarantees and scale rights to that degree.
And so the hardest part of scalability is the, you know, the tan staffel rule, like there ain't
no such thing as a free lunch and, or, you know, what's that other classic saying, you know,
better, faster, cheaper pick two. So Neo4j as a system is basically trying to maintain strong consistency throughout and
puts a premium on making sure that the read performance and throughput is really quite
good.
And so in taking those design decisions, certain kinds of scalability might be a little bit
more difficult.
Other systems might take a drastically different approach and have different scalability attributes at the cost of different trade-offs.
Does that make sense? Yeah, yeah, absolutely. Actually, I love your definition of scalability
in relation to trade dose, because at the end, that's exactly what scalability is,
finding the right trade dose based on the problem you're trying to solve. So yeah,
that's perfect.
That's an amazing definition.
To take that one step further, graph partitioning. You said graph partitioning is a hard problem.
I completely agree.
So there are a lot of other systems where they will automatically partition your graph.
And so they just say, hey, you just got a bunch of nodes and relationships.
We'll handle the sharding for you.
They make an explicit trade-off that users don't usually, they're not usually aware of.
If you did, if you basically threw everything, let's say, into a MySQL table and then you
did horizontal or vertical partitioning of those tables to distribute your, quote, quote,
quote, you can sort of see that in doing so, you would be creating a lot of breakages where
in order to traverse a relationship,
you would need to cross between shards. That's an expensive thing to do in all distributed
databases is moving, you know, doing the network coordination between shards. And so when we think
about how to partition a graph, there's one way to partition it where it allows you to write an
unlimited amount of data at the cost that your read queries might get increasingly expensive and difficult to do.
And there might be another way of partitioning your graph that is really taking into account what the schema and connected components of the graph are, which can preserve really high throughput and performance at the expense of making the partitioning scheme
more difficult to create and maybe not automatic.
Does that make sense?
Yeah, absolutely.
So that's that kind of like manual sharding
versus automatic sharding.
And automatic sharding is clearly possible,
but it creates some trade-offs
that you might not realize you've made
until you're already at scale. And in case of Neo4j, which one do you support or which one do you recommend to
your customers? So in the case of Fabric, what we're usually doing is recommending that the
customers come up with a partitioning scheme that makes sense for their data. I was just writing this crazy Twitter thread about this last week.
I can link you to it later if you want,
but it has to do with how to create cut points in your data model
such that when you distribute your graph across multiple partitions,
you are minimizing the number of times you're going to have to cross partitions.
And that is the property that's going to mean
that your queries are still going to
perform well at scale. Yeah. Yeah. Please, please share this. It's very, very interesting.
All right. One last question from me, and then I'll let Eric ask his questions because I'm
monopolizing the conversation here. So can you give us a little bit more information from an
architectural standpoint, how Neo4j
fits in a modern data stack and how it also integrates or interoperates, let's say, with
other common data systems and an organization from what you've seen from your customers,
right?
Sure.
So let's first picture Neo4j as a complete black box in order to attack the interoperability
point.
And we need to see that there's got to be
some way to get data into the box and some way to get data out of the box. So on the inside,
we have a set of supported connectors that are available at the same price that comes with the
commercial software. And so we do Kafka, for example, and Spark, and we have a connector for business intelligence
that basically treats the database like a JDBC endpoint.
And so between those options, driver applications, and also the ability to do things like load
CSV, a lot of things that I do with customers involve creating ingest and egress pipelines.
And so on the ingest route, you can do it either batch or streaming.
And a common architecture might be something like my upstream Oracle system is publishing
all changes onto a Kafka topic. And so Neo4j is reading from that Kafka topic and is transforming
those records into a graph pattern, let's say three or four nodes linked together with some relationships. And so Neo4j is following along with whatever is coming from Oracle. And we are
augmenting that with some reference metadata that comes from other systems. And here we have a
knowledge graph in the center. Okay. On the egress side, it's again, kind of a situation of whether
you want to do it batch or streaming. So you can
do it streaming via Kafka or MSK, things like that. If it's batch, you can write a program,
you can use things like cloud functions, like Amazon lambdas and so forth. Or you can use stuff
like Databricks notebooks. I've been using a lot of Spark and Databricks myself lately for working
with Neo4j to get the data downstream.
So let's zoom out from that perspective for a second. So you've got this black box Neo4j,
and you've got some good set of options to get data into graphs, and you've got good options
to get data out of graphs back to tables or documents or whatever it is that you need.
So you can look at that entire architecture as kind of like a graph coprocessor
on top of any other system.
So for example, some people will do things
like graph assisted search.
They might have elastic search that they're using
for website search, but they might also load
their product taxonomy into Neo4j
and then change their website search
so that it's still primarily
Elasticsearch text, but it's also being informed by search expansion from the knowledge graph.
You might search for black tube socks and via the knowledge graph, I might expand that out
to also include search results for leggings, for example, because we know that those are related
from the knowledge graph, even if tube socks as a piece of text would never match leggings in text form.
So then you can basically take this graph architecture and you can add it on to other
systems.
So we're not necessarily trying to replace them or say that Neo4j has to do everything
for you, but it can add a lot of graph value to whatever it is that you're already doing.
And that same kind of pattern tends to
repeat itself when it comes to financial fraud detection engines, recommendation engines,
and so on. And some of the more interesting stuff that I get to play with these days has to do with
putting graphs into machine learning pipelines in exactly that pattern I'm describing.
That's super interesting. Do you also see use cases where Neo4j is used together with an OLAP system, like for more, let's say, BI or analytics workloads?
Definitely, yes. Well, so for BI or analytics. Can you give me just a little bit more on the
question? What kind of scenarios do you have in mind? Yeah, I'm talking more about the typical
cases inside the company and around reporting. We try to figure out, for example, how marketing
is performing, right? One of the issues, and I think that Eric will have much more questions
around that in marketing is attribution, which in my mind, at least you can consider like attribution
as a kind of graph, or you have the customer journey, which is okay.
I have the customer.
This is the first touch point.
This is the second touch point.
Like all these problems that usually companies, let's say they put them under the umbrella
of the traditional BI, because it's more something that tries to explain what happened, right?
That's what I mean, if you see use cases there.
Yes, definitely.
The intent of what you're asking, I kind of break it into two categories.
Sometimes Neo4j is the system of record because it's a transactional database.
And basically, people are trying to do BI on what's happening in their Neo4j system. And in that case, we have this connector for BI
I was describing earlier, where you just treat it like a JDBC endpoint, and then you can use
Tableau against Neo4j if you want to do that traditional BI stuff. A different use case is
when Neo4j isn't the system of record, but it might store, for example, a knowledge graph or
a set of taxonomies that you're going to use as a reference data source
in the BI or in the analytics process. That is also a pattern, but a different one.
That's super interesting. One follow-up question on that. And I think, of course,
being from a marketing background that got my wheels turning on the marketing attribution side
and thinking about the customer journey. So, you know, the knowledge
graph is sort of one component of that, but do you see any use cases around behaviors, right?
So when we think about behaviors that are related, and if you think about that in the context of
maybe BI or, you know, sort of just a traditional database, you're talking about just tons and tons
of SQL, which you talked about before, right? So you have different behaviors represented in different tables, and then
you're sort of trying to tie that together using unique identifiers, which sort of change by table.
And so you end up with these sort of monster queries to pull together like a basic user
journey. I'd love to know about any use cases where graph helps make that more elegant or
easier to do. Yeah, absolutely.
Well, so here I'd probably point people to,
there's a use case on our website from Nordstrom
and it has to do with personalized product recommendations
and click stream data.
And so, you know, earlier we talked about
how people use graphs for recommendations
like what might a person want to buy.
And I think at Nordstrom,
what they're doing
is taking all of these different events that are occurring on their website. And that's kind of,
you know, partially the user journey through a website and then using that to inform what
the recommendation is. To your point, every time somebody makes a touch point with your company,
whether it's downloading a white paper or sending an email or downloading a piece of software or something along that, that is
part of their journey.
And you can use graphs to basically pull that together and look at a whole bunch of users
in a cohort.
And so a very useful analysis to do might be to say of the people who ghosted us and never called us back, what similarities did their journey have?
You know, of the people who later bought the product, what were the most common early steps that they took on our website. And so you build graphs of these kinds of user journeys. And this is where
some of our graph algorithm stuff comes in, where you can use it to establish similarities between
nodes, or you can do certain kinds of graph partitioning that can help you see some of these
patterns. Very cool. And I know we're getting close to time here, but one thing that I think
would be helpful, we always like to try to talk practical implementation stuff with our guests, just because I think it's helpful for our audience. to connect into Neo4j. What do you typically see in terms of stage of company,
sort of other tools that they might be using? Because I'm sure that there are people in our
audience maybe thinking graph might be a really interesting way for us to make some of the things
we're trying to do with sort of your traditional warehouse a lot easier. What's the point at which it makes sense? And what's the process of sort of implementing that
into your stack look like? So the companies that we work with are all over the gamut in terms of
their technical capabilities. I mean, in the last couple of weeks, I've worked with some that said
we're entirely on-prem and data can only get in and out using this particular enterprise ETL suite.
And so it's got to be Informatica, for example. And that ranges all the way up to companies that
have adopted a whole lot of cloud native services. And they'll say things like,
all of our workloads are running on top of Google Kubernetes engine. And so we need help doing the network bits to use such and such a managed service together with GKE to ingest data. So they're really all over the map. Usually when I go talk to a customer first, I try to establish what their baseline is. I love all of those native cloud services, and I think that they have a lot of value to offer, but it's also a massive learning curve.
And it's something that I find within some of these companies is best adopted piecemeal or like a bit at a time.
Because if I go in with an architecture that calls for the use of, you know, Kafka as a managed service and, you know, storage triggers based on S3 or something like that, I could really lose people.
So I think a good architecture is one that has to be operable by the people, organizations,
and policies that you have around you. So how to get started? Well, maybe don't get started on a big architecture before you've proved the value to yourself. Usually what I tell people is like, take a look at neo4j sandbox.com. And that will give you a playground where you can
play with technology, load your own data, see if you can get some value out of it. And that helps
people get started thinking about, well, what does my data look like as a graph? And, you know,
Ken Seifer helped me and that sort of stuff. And I find that the architecture data look like as a graph? And, you know, can Cypher help me and that sort of stuff.
And I find that the architecture bits then come as a maturation of that basic value proposition.
You say, all right, I want that value, but I want it at larger scale or with more timeliness.
And then all of those questions about ingress and egress and architecture get tackled in the context of
how do I extend some piece of value that I already know I want?
Sure. And would you say there's sort of a light bulb moment people have when they start playing
with the sandbox? Because I just know there are a lot of people out there who have lived and worked
in sort of your traditional data warehouse paradigm,
you know, maybe for their entire career and are familiar with graph, but haven't really
dug into it in terms of practical value in their data stack. What is it? Can you describe that
light bulb moment a little bit when you say, Oh, wow, like this could be a really valuable
component of the stack. So I'll give you, I'll give you an example of a light bulb moment, but I think for a lot of people,
they might have to try it out with their data to see what it's going to be in their business
context. But let's just say, imagine that you had a list of roads and you knew how long each road
was and how much distance there was between each city. And you had a shortest path problem.
And you said, well, I live in Richmond, Virginia, and I want to get to Washington, DC along the
shortest possible path. Sort of think about the kind of thing that Google maps does every single
day. So each road that you take has a distance and a speed limit. And so it has a cost to traversing.
And so you want to figure out the shortest path from
Richmond to Washington, figure out how you would do that with a relational database, and then get
back to me, then go take a look at what it would do, what it would take to do that in Cypher.
And you'll have the, that light bulb moment. Now, some people won't connect with that experience
because they'll say, well, I don't have Google Maps data. And that's where
you've got to try it out with the sandbox, load your data into a basic data model. And you're
going to find that usually the aha moment is some sort of a variable length path query,
because that is the sort of thing that graphs make super, super easy and relational databases
and other databases just don't. So there are certain
use cases that are more relational where you're not going to have that aha moment. So like I gave
the example earlier, if you have a million customers, you want to know all the ones where
the zip code is 23229, graphs can do that, but that's not a particularly a graphy problem.
Makes total sense. I have one last question for you, David,
and it's about open source.
So what's the relationship of Neo4j with open source
and how important it has been
the whole community for Neo4j so far?
Neo4j was born as an open source product
and the community edition is out there right now
under, I forget which variant of the JPL,
but there's an open source version available right now.
I'm a person who has written multiple different open source packages that relate to Neo4j.
One called Neo4j Helm that helps you deploy it on Kubernetes.
Another called Halen, which is a monitoring tool.
So, I mean, I don't know what else to say other than we're
strong proponents of open source and that's been a core part of our business since the beginning.
You know, now not as a Neo4j employee, but I think that open source is going through an
interesting evolution right now. We're seeing in the industry that comes from, you know,
the increase in cloud platforms and also people's shift towards managed services.
I mean, consider that nobody ever asks whether Gmail is open source or not.
And so when you use things as managed services, what open source means really starts to change.
And I don't really have anything so great to say about that other than that as like
a practitioner and somebody in the industry, I'm pretty interested to see what's going to happen in the coming years. Absolutely.
I think, you know, the Rutter stack where Costas and I both work, we're open source as well. And
it's going to be a fascinating, fascinating environment to operate in, in the coming years.
Well, we are, we're at time, David, this has been so interesting. I learned a ton. I think our
audience has learned a ton and I just have all
sorts of interesting ideas around graph with all the new knowledge you've given us. So really
appreciate you taking the time. Yeah. Thanks for having me.
Always an interesting conversation on the Data Stack Show. We took a little aside there to talk
about writing books, but I think it's just interesting when someone has done something, you know, sort of
maybe related, but as an activity outside of their day-to-day work with data. So it was fun to hear
about the process of writing a book. I think one of the interesting things that stuck out to me
was how sort of straightforward it seems to integrate something like Neo4j with an existing data
stack, right?
So we talked about just sort of reading from a Kafka topic, right, which is a really, really
common structure within data stacks these days.
And so that was exciting to me, I think.
And I hope our listeners sort of had some ideas about how they may be able to try this
out with some of their existing data stack infrastructure.
Yeah, absolutely.
From my side, I was quite surprised to hear for all these some of their existing data stack infrastructure. Yeah, absolutely. From my side,
I was quite surprised to hear
for all these different use cases,
to be honest.
It was super interesting
to see that a graph database
can be used as part
of the product experience,
but at the same time,
it can also be used
for a lot of typical
analytics workloads.
And what was really,
really interesting for me
is how it can be used
inside the context of data governance. That's something that I definitely want to learn more
about. Okay, so we'll extend this show just another two minutes, because I want to ask you
about this. So we talked about data governance, a good bit in one of our recent shows, do you think
that graph is potentially one of the ways to solve data governance at a
sort of comprehensive level?
If you can connect all of the data in your stack to it, it may make some of that governance
workload easy.
Yeah, I think, okay, it's not going to solve all the problems that are under data governance,
right?
I don't think that's, okay, it's what you need in order to manage access to data like things like immune is doing
right right right but the thing with data is that data gets continuously transformed inside the
company right it is consumed by many different stakeholders many different systems and at the
end we need to track the evolution of data.
And a graph structure makes a lot of sense there, right?
Like it's a very native way of representing
how a piece of data has evolved in time, right?
And with what people it has interacted and all these things.
So for stuff that are like data lineage, for example,
yeah, that's probably like a very, very good way to represent this. Now, that's one part of the problem, right?
Then you need to feed the database with all the data and most importantly, the metadata here,
which is another question that we need to ask our guests.
How do you get this data and how do you generate this metadata to fit a graph database to solve
the problem?
Very interesting.
Well, time will certainly tell.
Thank you again for joining us on the Data Stack Show.
Be sure to hit subscribe in your favorite podcast app.
That way you'll get notified of new episodes when they go live and we will catch you on the next one.
The Data Stack Show is brought to you by Rudderstack,
the complete customer data pipeline solution. Learn more at rudderstack.com.