Orchestrate all the Things - Graph data science is moving one step closer to the mainstream: Neo4j releases v2.0 of its eponymous product. Featuring Senior Director of Product Management for Data Science Alicia Frame
Episode Date: April 12, 2022Whether you're genuinely interested in getting insights and solving problems using data, or just attracted by what has been called “the most promising career” by LinkedIn and the “best job ...in America” by Glassdoor, chances are you're familiar with data science. But what about graph data science? As we've elaborated previously, graphs are a universal data structure with manifestations that span a wide spectrum: from analytics to databases, and from knowledge management to data science, machine learning and even hardware. Graph data science is when you want to answer questions, not just with your data, but with the connections between your data points -- that's the the 30 second explanation according to Alicia Frame. Frame is the Senior Director of Product Management for Data Science at Neo4j, a leading graph database vendor. She has a PhD in computational biology, and has spent ten years as a practicing data scientist working with connected data. When she joined Neo4j about 3 years ago, she set out to build a best in class solution for dealing with connected data for data scientists. Today, the product Frame is leading at Neo4j, aptly called Graph Data Science, is celebrating its two-year anniversary with version 2.0 which brings some important advancements: new features, a native Python client, and availability as a managed service under the name AuraDS on Google Cloud. We caught up with Frame to discuss graph data science the concept, and Graph Data Science the product.
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
Whether you're genuinely interested in getting insights and solving problems using data,
or just attracted by what has been called the most promising career by LinkedIn
and the best job in America by Glassdoor, chances are you're familiar with data science.
But what about graph data science?
Graphs are a universal data
structure with manifestations that span a wide spectrum, from analytics to databases and from
knowledge management to data science, machine learning and even hardware. Graph data science
is when you want to answer questions not just with data but with the connections between your data
points. That's the 30-second explanation according to Alicia Frame.
Frame is the Senior Director of Product Management for Data Science at Neo4j,
a leading graph database vendor.
She has a PhD in Computational Biology and has spent 10 years as a practicing data scientist
working with connected data.
When she joined Neo4j about three years ago, she set out to build a best-in-class solution
for dealing with connected data for data scientists.
Today, the product Frame is leading at Neo4j, aptly called Graph Data Science, is celebrating its two-year anniversary with version 2.0.
That brings some important advancements. New features, a native Python client, and availability as a managed service under the name AuraDS on Google Cloud. We caught up with Frame to discuss Graph Data Science,
the concept, and Graph Data Science, the product.
I hope you will enjoy the podcast. If you like my work, you can follow Link Data
Orchestration on Twitter, LinkedIn, and Facebook.
I'm Alicia Frame. I'm the Senior Director of Product Management for Data Science
at Neo4j. I've been with Neo for just over three years now. My background is I have a PhD in
computational biology. I spent 10 years as a practicing data scientist working with connected
data. And I joined Neo4j to really help build a best-in-class solution for dealing with connected data. And I joined Neo4j to really help build a best-in-class solution for dealing
with connected data for data scientists and probably put myself out of any future job.
But I've been really excited to figure out how to take all of the power of graph and make it
possible for really anyone to get started and find value.
Thank you.
Great.
Thanks for the intro.
And then since the topic of the conversation today is about a new release in the product that you're leading, so graph data science in the context of Neo4j. Again, a good way to get started with that is to just
start talking a little bit about what is graph data science. That's obviously known to yourself
and myself as well. But again, it's not necessarily known to everyone. So let's start from there.
Of course. So I think the like, you know, 30 second explanation is just graph data science is when you want to answer questions, not just with your data, but with the connections between most data scientists work with, for better insights to answer questions you can't
get at without connections, or just to more faithfully represent your data. So when we talk
about, you know, doing graph data science, it's really just using connections to inform your
questions, whether you're using queries just to find the patterns that you know exist, if you're using
unsupervised methods like graph algorithms to sift through your data and figure out, you know,
what's most important, where are the communities, what are the patterns that I should be looking at,
or, you know, supervised machine learning where you're actually classifying what type of graph
is this, or where will a relationship form in the future?
It's all about just getting that extra value from those connections that you can't get at
in any other way. And when we talk about doing graph data science at Neo4j, you know, Neo4j is
a graph database company. We have the labeled property graph, we have Cypher. Graph data science is
something you can run on top of that to start understanding and getting extra value from that
data. So as your data grows, maybe you start off with a few thousand nodes and relationships or
nouns and verbs. And as your data grows, it's harder to know, what do I look at? What's important?
What's the pattern? What's the trend? And graph data science becomes so much more
important to understand, where should I be looking? What do I pay attention to?
And so graph data science at Neo4j is really that same framework of I'm answering questions with connected data. We just offer
a built-in platform to do connected data analysis. So we let users reshape their underlying connected
data, transform it, run algorithms, machine learning pipelines, train models, all of that
in a single platform environment. So it becomes super easy.
There's no ETL.
There's no complicated figuring out how do I represent this data as a graph
that's all just 100% graph needed.
Okay.
And I think as a separate line of product, let's say,
within the broader Neo4j ecosystem, the graph data science framework is relatively new.
So if you wanted to frame it, let's say, in terms of its target audience, I think it sounds like it's sort of multiple. So on the one hand, I would say you're addressing
with this product data scientists
who are not necessarily graph users,
but want to do graph analytics, for example.
So the value proposition for this audience
seems to be, well,
instead of storing your data
in whatever else you have been using to store them,
you can use graph database
and Neo4j for that and
do graph analytics and graph data science on top of that. And on the other hand, you have the sort
of more traditional, let's say, user segment. So people who have been using Neo4j for their
transactional, operational or analytics applications. And for them, the value proposition would be, well, since you're using that anyway for
your applications, well, here's a new set of tools that you can use to get additional
value out of your data.
Would you say that's it, more or less?
That's exactly right.
I think when you think about, you know, who have we built the graph data science library for? It's very much been built for the data scientists you described and also the
analysts, business analysts, data analysts, who are working directly with that data,
trying to answer questions. And you're right that the main value proposition is not only are you storing your connected data in a connected shape,
but it's a single workspace, single environment to do everything from data analysis, querying, persistence, training your models.
So it's very streamlined, very simple.
And for folks who want to get started doing, you know, data analysis with graph, data science with graph, it really kind of reduces the learning curve, cognitive burden of, oh, gosh, I have to figure out like 30 different libraries.
It's very simple, streamlined, easy to get started with.
And I'd say a lot of our user base is those folks, data scientists who are totally new to
graph databases. They want to get started with connected data. But I think it's really important
that we don't forget the developers who've been using Neo4j all along. And we've seen a lot of
success for those folks in terms of helping them just get extra value out of the applications that they've already
built. And so we talk a lot about, you know, like Meredith building their, you know, user journeys
out in Neo4j and using Neo4j to identify anonymous readers on their websites. And that really grew
out of a longtime Neo4j developer who'd really enjoyed
Neo4j, the graph database. And they were keen to figure out how they could get more value.
And they were like, wait a second, this algorithm solves this really complex
application question that we have and really just fits neatly into our pipeline. So it was really
a nice synergy or a nice evolution for
those developers. Okay. So you have a sort of dual product announcement that's coming up. On the one
hand, you're announcing the general availability of Aura Data Science. That's the cloud version of
the Graph Data Science Framework, basically. And of the Graph Data Science framework basically and
on the other hand you're also announcing a set of new features. So let's start from the general
availability part and to be honest with you this caught me a little bit by surprise because I was
sort of assuming that this was already available. And the reason that I was under this assumption was that I know that Neo4j has been moving to the cloud, let's say, for a while now.
So I was assuming, OK, it's pretty much complete.
And obviously, graph data science framework is part of that.
So, well, it must be there as well
it turns out that I was only half right so it was available on preview for the last few months and
now it's generally available so why the delay basically what took you longer than the rest
of the platform why did it take longer than the rest of the platform to get there?
So I think, so our cloud platform is called Aura.
And I think Aura started with a database configured for transactional applications built for and by developers, right?
And they started there because that's really where Neo4j as a company got started. Neo4j has been
around for over a decade, building a amazing database to backend applications for transactional
use cases. And so Aura really started with that use case, building out a high availability,
fault tolerant, 24-7 availability database as a service.
AuraDS is our data science product. And it's not that we took that existing database product,
slapped a new name on it, installed a plugin, and we're like, here you go. It's actually been
same platform, same Aura cloud platform. But Aura DS has been rebuilt from the ground up
to provide a custom experience built for data scientists.
So when you talk about Neo4j, the database on Aura DS,
different configuration optimized for a different setup,
data science workloads are typically
much more memory intensive, many more
threads. So we wanted to make sure we had the right configuration for data scientists to be
successful. And then probably where most of our time and effort was spent was building out a user
interface that actually works for data scientists. So I think a lot of companies make a mistake when they don't think about the fact that data scientists are very different from developers.
They have a different background, expectations, experience.
They're familiar with different things.
So AuraDS is really built to solve the problem that data scientists are not developers.
They're not database administrators.
And really, they want to focus on getting value from their data,
finding new insights, building more predictive models.
They don't want to figure out how to set up and maintain a database.
And so AuraDS has a completely rebuilt UI that's centered around
a much more data scientist friendly experience.
So we have sizing guidelines where a user can say, you know, this is how many nodes, this is how many relationships I have.
This is the type of algorithm or the type of model I want to run.
And we give them a recommendation of the size that they need.
The metrics that we track are metrics that are much more relevant for data scientists.
So CPU usage, memory usage.
And then we also have kind of redone our documentation,
reconfigured what we support
so that in our documentation for ORADS,
connectors like Spark Ecosystem,
BI Connector, Kafka are front and center for how do I get data in, how do I get data out.
Everything is positioned around our new Python client, so it's very data scientists expected to be there, things like the ability to pause an instance.
So the way a data scientist works, they're going to use a pretty large instance to do something very complex, but when it's done, they probably want to pause it and come back to their work and not pay for it all the time. So if you think about it, the workflows of a data scientist
are just quite different from a developer.
So AuraDS is really built custom for data scientists
from how it runs on the backend to how it's monitored
and presented to users to the user interface
and the capabilities that are there.
Okay, I see.
Yeah, that makes sense.
And I think it also ties back to what we mentioned previously about the target audience, really,
for this, well, product within a product, in a way, within the broader Neo4j ecosystem,
let's say.
Just out of curiosity, because you did mention revamped user interface
and user experience, I'm assuming that you must have used or at least hooked in some way to
notebooks, because that's very, I think that's a typical way that data scientists like to work,
right? Exactly. So as part of the AuraDS and Graph Data Science 2.0 release,
we have introduced a native Python client. It's just pip install graph data science,
really easy, that reduces a lot of the friction of using Neo4j with a notebook. So we hide a lot
of what happens with transactions and things data
scientists don't need to know about. In a much simpler, more streamlined, you can call functions
in Neo4j GDS, just like you would any other Python function and get the results back as a data frame.
And that's all built into AuraDS as well. So if you look in our docs for AuraDS,
everything is shown side by side. Here's how you do it in Cypher if you want to use Neo4j browser.
Here's how you do it in Python if you want to work from your notebook. And we actually have
embedded collab notebooks. So if someone just wants to pull something up and look at it,
they've already got the code snippets in a notebook format that they're
ready to go with. Okay. Yeah, that sounds like an important upgrade, actually, I would say,
which makes me wonder whether you're also going to backport those new interfaces and those new ways of working with Graph Data Science Framework,
if you're going to backport them to the on-premise version as well.
So the Python client, Graph Data Science, that we've introduced will work on-prem or in the cloud,
self-managed, wherever you need it.
It works with GDS 2.0 and more recently.
So anyone who is running the latest version of our graph data science library can use that Python
client wherever they are. And hopefully, you know, this is going to make adoption a lot smoother,
much easier to get started with, and just reduce the need for
a data scientist to learn yet another language. But we wanted to make sure that it was available
on the cloud and self-managed at the same time. Okay. I think, again, it makes sense.
To be honest with you, it wasn't clear to me that there has been this sort of upgrade,
let's say, in the way people will interact with the product.
And also that it's going to be available not exclusively for the cloud version.
So it's good to know.
When we introduce, you know, it's a big release with a lot of stuff.
But that means that sometimes things get a little lost in the noise where it's like, oh, there's a cloud product, there's a 2.0 platform, there's a whole new Python client.
So I'm glad we got to call it out and talk about it.
Yeah, that's say,
we also need to clarify that at this time, general availability is only for the Google
Cloud version, right?
And I presume the other version, so Amazon and Azure are going to follow.
And again, I'm kind of assuming that the reason you started with Google, same as what happened with the Neo4j cloud managed version, by the way, it also started from Google.
It's probably the fact that there is a sort of special relationship, let's say, with Google.
So there's a partnership there.
Exactly. We have a partnership with the Google Cloud platform. So we release things first on
Google Cloud. What's actually really exciting for us in the data science space is that we are also
partnering with the Vertex AI team at Google. So we've done workshops, blog posts with Google
on using Neo4j Graph Data Science with their Vertex AI data science platform in the cloud
to show how those two can be integrated and how you can build graph and kind of connected data
science into their AutoML pipelines. And so actually,
someone from Google will be joining us for a release webinar at the end of the month
to show how all the pieces fit together into really a really fully featured production-ready
pipeline going from raw data to AutoML pretty seamlessly between Neo4j, the Graph database, the Graph
Data Science library, and then all of Google's sophisticated machine learning tooling.
Yeah, that sounds like an interesting proposition for Google's clients, let's say, and also
a good way for Neo4j to address a larger audience so like a win-win
really. Okay so you already kind of touched upon some of the new features which have to do
the ones that you already mentioned have to do with the upgrade in the user interface and user experience. But I know there's more. So let's go through the
listing in any order you choose, basically. Yeah, I don't want to just read a list,
because I think that might be quite boring for folks. But I think highlighting some of the ones
that are really standouts for us, I would maybe start with the work we've done actually
around making graph data science
more compatible with transactional clusters.
So one of the things in this release
is we've introduced cluster compatibility.
So you can run Neo4j graph data science
alongside your transactional cluster.
You don't have to worry about moving data
between those instances. So you don't have to worry about copying data from your cluster to
a single instance or getting data back from that dedicated data science instance into your cluster.
That worry goes away and you don't end up with, you know, something that's not configured for either workload.
So what you can do now is you can attach a dedicated GDS node to your cluster.
It automatically gets that updated data in real time.
You can run your data science workload without interfering with the transactional workload on your cluster.
And then all of that write back is handled internally.
So you don't have to worry about ETL. And I think that's a pretty big game changer.
I was actually surprised by how quickly it was adopted. I have already seen customers picking
this up and running it, you know, before it's even released as a GA feature. Alongside that, we've also introduced backup and restore for your graph projections.
So GDS operationally runs on an in-memory projection of your underlying database.
And by definition, in-memory is transient.
But we've actually, in this release, introduced backup and restore for that transient image
and that's a cool story because that feature actually came from Aura DS where we had to
introduce it so that users never saw any kind of interruption in their workload if they were
we had to push a rolling update or a patch users don't see that and then we added that feature
back to our self-managed distribution.
So now everyone kind of has that enterprise functionality. And I think that's kind of
where we're looking at how we make it possible for developers to go into production. And then
if you shift gears to talk more about things that are custom built for data scientists,
probably the most exciting piece of this release for us is
we've really matured our machine learning and auto ML capabilities. So we introduced the ability to
create machine learning pipelines for tasks like link prediction, so filling in missing relationships
on your graph, or node classification, where you are filling in missing labels like this person is a fraudster
or this person is innocent. And we've introduced this concept of a pipeline catalog where you can
say, hey, I want to train a model for this end goal. These are the features that I want to use,
like these are the node labels and the relationship types. I want to use a graph embedding. And then I want to use that graph embedding
for a classifier model.
And Neo4j will automatically do all those feature generation
embedding steps and then select the best performing model
for you and then return to an end user pretty seamlessly.
Here's your predictive model that you can then apply
to incoming data and new data as your database grows turn to an end user pretty seamlessly. Here's your predictive model that you can then apply to
incoming data and new data as your database grows and really simply move from proof of concept of
like, can I build a model all the way to production with a persistent published model? And I think
that's a pretty cool feature. Okay. There was one thing I wanted to ask you on the persistence, basically, that you already mentioned. Again, this was one of the things that took me by surprise a little bit because, again, assumptions. I thought that, well, it couldn't possibly not have been there. But, well, just hearing you elaborate on that now, I guess I was again wrong. It probably wasn't there. And the fact that you
actually needed, absolutely needed to have it for the cloud version made you add it. And well,
as you said, now it's also there for the self-managed version. You know, it may sound
like an ape question, but how could people possibly work without it uh so far i mean you know if there was
an i don't know an outage or you know something wrong with their uh with their machine or whatever
that just meant that you know they they basically lost whatever it was they were working on
yeah so you would start again um and that is kind of the status people were at so the fact that we
use an in-memory projection of the underlying database is kind of like a double-edged sword.
There are some amazing things that it lets you do.
It's highly optimized so you can run graph algorithms at like 100 billion node scale and they finish in minutes.
But it's a projection of the underlying database.
So it's not running on disk, it's not persistent. And so up until this point, what you would do is you would create your projection,
execute your algorithm, and then either write those results back to the database if you needed
them to be persisted, or you could stream them out. And this just kind of lets you keep that intermediate state
persisted in case you need to restart your database.
I think the reason this wasn't really impacting
people's ability to run in production
is our algorithms are quite fast.
This isn't you're going to wait hours or days
and hope nothing bad happens to your database.
It's sub-second to
minutes, depending on what you're running. And so you could project, run, write back,
and not really worry about it. I think the other piece is probably that that GDS, Graph Data Science as a platform, has a lot of safeguards in place to prevent someone
from accidentally knocking over their database. So if you request an algorithm that takes up more
memory than you have available on your system, instead of just knocking over the database,
you get a message that says, hey, you don't have enough memory
available right now. Either wait, add more memory, or run at your own risk. And so we kind of built
in a lot of safeguards to prevent knocking over the database, and then it's fast enough.
But as we expand from what Emil likes to call penicillin users, you know, the people who
have like this problem that's blocking their entire business to folks who just want to get
value from graph, being able to back up and restore those projections so you don't interrupt your
workflow becomes really important. Okay, Yeah, that makes sense again.
And another thing that sort of stood out for me from the list of new
features was the fact that I saw a nice list of integrations with third
party products. Well, BI, business intelligence products, mostly,
I think.
And I was wondering, what exactly does that mean for graph data science?
I know that Neo4j already had a number of integrations,
which, if I'm not mistaken, for the most part,
basically meant being able to show graph data in a user interface that was not really designed for graphs. So I wonder what does integration mean specifically for you here?
Yeah, I think the most important thing to remember is that data scientists don't exist in a vacuum.
So most data scientists have a really complicated ecosystem of tools that, you know, often they're sitting downstream from data warehouses, enterprise data warehouses that are populated by other teams.
And when they ship something into production, that means, you know, I've published this model to some other platform. And Neo4j has to fit in that ecosystem so that it's kind of that better together story
instead of trying to say, you know, throw out everything you've done, try something
different.
So when we talk about our ecosystem integrations, some of them are focused on getting data in.
And so sneak peek at the end of this month, we're having some pretty big updates to our
Spark connector to support more types of data warehouses, much more simply.
Getting data in via Spark connector, or if you've got a streaming service, we've got the Kafka
connector. Once you've got data in, you know, then it's integrating with downstream tasks. So that's
the BI connector. If a data scientist is already using Tableau for visualization
and they don't want something purpose-built for graph like Bloom.
But it's also integrating with kind of AutoML vendors.
So we have connectors that exist for Nine and Dataiku, which are two very common, widely
used data science model building platforms.
And we're working with the team at Vertex AI to show how easy it is to use PyCon and our
graph data science client to glue all those pieces together in a notebook on GCP. So that
it's very much a story of using Neo4j for the graphy stuff, but then working really seamlessly with everything else in the data science ecosystem, so that there's no friction for that data scientist trying to get into production.
Okay, I see.
So, and I think we need to be wrapping up soon.
So let's have one last topic to address so you already alluded
to some things that are you're going to be announcing pretty soon I was wondering if you
could well broaden the scope a little bit and share with us what's in the broader roadmap let's
say as well as if you have any metrics or stories or anything else you'd like to share in
terms of adoption, how has it been and what do you see, how do you see things going forward?
Sure. So roadmap, two things to look out for probably in the next one to two quarters that
I think are really exciting are we are working on building integrations between
Bloom, which is our visualization platform, and Neo4j Graph Data Science so that in Bloom,
you can automatically run graph algorithms on data that's available in scene. So I'm looking
at some connected data. I pull from a dropdown menu. I want to run page rank to see which nodes are most important and then automatically
apply rule-based styling.
So important nodes with high page rank all of a sudden pop up and are much bigger in
my visualization and unimportant nodes get smaller.
And so it becomes a really seamless kind of no code way of running graph algorithms
on data and getting insights really, really quickly. And I know we're like on a podcast,
I usually have like a really pretty visualization, but I am so psyched about that.
The other thing to look for in the next couple months, couple quarters, is making it a lot easier to move data
in and out of Neo4j graph data science
in really big volumes.
So one of the most important things for data scientists
is speed of getting data in and out of Neo
and the scale.
So I have generated embeddings
on my 20 billion node graph. Great. I'm so glad those
finished in 10 minutes, but how do I get them out to my downstream pipeline? So we're working on
some really exciting ecosystem integrations for really big data movement and also making that
super simple for the end user. So those are probably the two big road map things I would
just touch on. And in terms of adoption metrics stories, it's been a wild ride. You know,
graph data science is just about two years old. We saw 370% plus year over year growth in the
number of enterprise customers. We have hundreds of thousands of folks who've
downloaded our graph algorithms book. And I think one of my favorite like AuraDS things is actually
we've seen two folks who started off in our early access program. We let in a handful of users just
to test it out, give us feedback. They're already in production with their, you know,
company's workflows. And already they went from idea to proof of concept to production in a matter
of months on their own with very little help from the team at Neo4j, which is just mind-blowing.
So really amazing adoption in the last year.
And AuraDS, data science as a service, Python client,
everything is combining in like a perfect storm
just to let folks find value so much faster.
I hope you enjoyed the podcast.
If you like my work,
you can follow Link Data Orchestration
on Twitter, LinkedIn, and Facebook.