Orchestrate all the Things - Graph data science is moving one step closer to the mainstream: Neo4j releases v2.0 of its eponymous product. Featuring Senior Director of Product Management for Data Science Alicia Frame

Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. Whether you're genuinely interested in getting insights and solving problems using data, or just attracted by what has been called the most promising career by LinkedIn and the best job in America by Glassdoor, chances are you're familiar with data science. But what about graph data science? Graphs are a universal data structure with manifestations that span a wide spectrum, from analytics to databases and from

Starting point is 00:00:31 knowledge management to data science, machine learning and even hardware. Graph data science is when you want to answer questions not just with data but with the connections between your data points. That's the 30-second explanation according to Alicia Frame. Frame is the Senior Director of Product Management for Data Science at Neo4j, a leading graph database vendor. She has a PhD in Computational Biology and has spent 10 years as a practicing data scientist working with connected data. When she joined Neo4j about three years ago, she set out to build a best-in-class solution

Starting point is 00:01:04 for dealing with connected data for data scientists. Today, the product Frame is leading at Neo4j, aptly called Graph Data Science, is celebrating its two-year anniversary with version 2.0. That brings some important advancements. New features, a native Python client, and availability as a managed service under the name AuraDS on Google Cloud. We caught up with Frame to discuss Graph Data Science, the concept, and Graph Data Science, the product. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook. I'm Alicia Frame. I'm the Senior Director of Product Management for Data Science at Neo4j. I've been with Neo for just over three years now. My background is I have a PhD in

Starting point is 00:01:54 computational biology. I spent 10 years as a practicing data scientist working with connected data. And I joined Neo4j to really help build a best-in-class solution for dealing with connected data. And I joined Neo4j to really help build a best-in-class solution for dealing with connected data for data scientists and probably put myself out of any future job. But I've been really excited to figure out how to take all of the power of graph and make it possible for really anyone to get started and find value. Thank you. Great. Thanks for the intro.

Starting point is 00:02:36 And then since the topic of the conversation today is about a new release in the product that you're leading, so graph data science in the context of Neo4j. Again, a good way to get started with that is to just start talking a little bit about what is graph data science. That's obviously known to yourself and myself as well. But again, it's not necessarily known to everyone. So let's start from there. Of course. So I think the like, you know, 30 second explanation is just graph data science is when you want to answer questions, not just with your data, but with the connections between most data scientists work with, for better insights to answer questions you can't get at without connections, or just to more faithfully represent your data. So when we talk about, you know, doing graph data science, it's really just using connections to inform your questions, whether you're using queries just to find the patterns that you know exist, if you're using unsupervised methods like graph algorithms to sift through your data and figure out, you know,

Starting point is 00:03:50 what's most important, where are the communities, what are the patterns that I should be looking at, or, you know, supervised machine learning where you're actually classifying what type of graph is this, or where will a relationship form in the future? It's all about just getting that extra value from those connections that you can't get at in any other way. And when we talk about doing graph data science at Neo4j, you know, Neo4j is a graph database company. We have the labeled property graph, we have Cypher. Graph data science is something you can run on top of that to start understanding and getting extra value from that data. So as your data grows, maybe you start off with a few thousand nodes and relationships or

Starting point is 00:04:38 nouns and verbs. And as your data grows, it's harder to know, what do I look at? What's important? What's the pattern? What's the trend? And graph data science becomes so much more important to understand, where should I be looking? What do I pay attention to? And so graph data science at Neo4j is really that same framework of I'm answering questions with connected data. We just offer a built-in platform to do connected data analysis. So we let users reshape their underlying connected data, transform it, run algorithms, machine learning pipelines, train models, all of that in a single platform environment. So it becomes super easy. There's no ETL.

Starting point is 00:05:28 There's no complicated figuring out how do I represent this data as a graph that's all just 100% graph needed. Okay. And I think as a separate line of product, let's say, within the broader Neo4j ecosystem, the graph data science framework is relatively new. So if you wanted to frame it, let's say, in terms of its target audience, I think it sounds like it's sort of multiple. So on the one hand, I would say you're addressing with this product data scientists who are not necessarily graph users,

Starting point is 00:06:10 but want to do graph analytics, for example. So the value proposition for this audience seems to be, well, instead of storing your data in whatever else you have been using to store them, you can use graph database and Neo4j for that and do graph analytics and graph data science on top of that. And on the other hand, you have the sort

Starting point is 00:06:31 of more traditional, let's say, user segment. So people who have been using Neo4j for their transactional, operational or analytics applications. And for them, the value proposition would be, well, since you're using that anyway for your applications, well, here's a new set of tools that you can use to get additional value out of your data. Would you say that's it, more or less? That's exactly right. I think when you think about, you know, who have we built the graph data science library for? It's very much been built for the data scientists you described and also the analysts, business analysts, data analysts, who are working directly with that data,

Starting point is 00:07:18 trying to answer questions. And you're right that the main value proposition is not only are you storing your connected data in a connected shape, but it's a single workspace, single environment to do everything from data analysis, querying, persistence, training your models. So it's very streamlined, very simple. And for folks who want to get started doing, you know, data analysis with graph, data science with graph, it really kind of reduces the learning curve, cognitive burden of, oh, gosh, I have to figure out like 30 different libraries. It's very simple, streamlined, easy to get started with. And I'd say a lot of our user base is those folks, data scientists who are totally new to graph databases. They want to get started with connected data. But I think it's really important that we don't forget the developers who've been using Neo4j all along. And we've seen a lot of

Starting point is 00:08:19 success for those folks in terms of helping them just get extra value out of the applications that they've already built. And so we talk a lot about, you know, like Meredith building their, you know, user journeys out in Neo4j and using Neo4j to identify anonymous readers on their websites. And that really grew out of a longtime Neo4j developer who'd really enjoyed Neo4j, the graph database. And they were keen to figure out how they could get more value. And they were like, wait a second, this algorithm solves this really complex application question that we have and really just fits neatly into our pipeline. So it was really a nice synergy or a nice evolution for

Starting point is 00:09:05 those developers. Okay. So you have a sort of dual product announcement that's coming up. On the one hand, you're announcing the general availability of Aura Data Science. That's the cloud version of the Graph Data Science Framework, basically. And of the Graph Data Science framework basically and on the other hand you're also announcing a set of new features. So let's start from the general availability part and to be honest with you this caught me a little bit by surprise because I was sort of assuming that this was already available. And the reason that I was under this assumption was that I know that Neo4j has been moving to the cloud, let's say, for a while now. So I was assuming, OK, it's pretty much complete. And obviously, graph data science framework is part of that.

Starting point is 00:10:04 So, well, it must be there as well it turns out that I was only half right so it was available on preview for the last few months and now it's generally available so why the delay basically what took you longer than the rest of the platform why did it take longer than the rest of the platform to get there? So I think, so our cloud platform is called Aura. And I think Aura started with a database configured for transactional applications built for and by developers, right? And they started there because that's really where Neo4j as a company got started. Neo4j has been around for over a decade, building a amazing database to backend applications for transactional

Starting point is 00:10:53 use cases. And so Aura really started with that use case, building out a high availability, fault tolerant, 24-7 availability database as a service. AuraDS is our data science product. And it's not that we took that existing database product, slapped a new name on it, installed a plugin, and we're like, here you go. It's actually been same platform, same Aura cloud platform. But Aura DS has been rebuilt from the ground up to provide a custom experience built for data scientists. So when you talk about Neo4j, the database on Aura DS, different configuration optimized for a different setup,

Starting point is 00:11:41 data science workloads are typically much more memory intensive, many more threads. So we wanted to make sure we had the right configuration for data scientists to be successful. And then probably where most of our time and effort was spent was building out a user interface that actually works for data scientists. So I think a lot of companies make a mistake when they don't think about the fact that data scientists are very different from developers. They have a different background, expectations, experience. They're familiar with different things. So AuraDS is really built to solve the problem that data scientists are not developers.

Starting point is 00:12:23 They're not database administrators. And really, they want to focus on getting value from their data, finding new insights, building more predictive models. They don't want to figure out how to set up and maintain a database. And so AuraDS has a completely rebuilt UI that's centered around a much more data scientist friendly experience. So we have sizing guidelines where a user can say, you know, this is how many nodes, this is how many relationships I have. This is the type of algorithm or the type of model I want to run.

Starting point is 00:12:58 And we give them a recommendation of the size that they need. The metrics that we track are metrics that are much more relevant for data scientists. So CPU usage, memory usage. And then we also have kind of redone our documentation, reconfigured what we support so that in our documentation for ORADS, connectors like Spark Ecosystem, BI Connector, Kafka are front and center for how do I get data in, how do I get data out.

Starting point is 00:13:27 Everything is positioned around our new Python client, so it's very data scientists expected to be there, things like the ability to pause an instance. So the way a data scientist works, they're going to use a pretty large instance to do something very complex, but when it's done, they probably want to pause it and come back to their work and not pay for it all the time. So if you think about it, the workflows of a data scientist are just quite different from a developer. So AuraDS is really built custom for data scientists from how it runs on the backend to how it's monitored and presented to users to the user interface and the capabilities that are there. Okay, I see.

Starting point is 00:14:26 Yeah, that makes sense. And I think it also ties back to what we mentioned previously about the target audience, really, for this, well, product within a product, in a way, within the broader Neo4j ecosystem, let's say. Just out of curiosity, because you did mention revamped user interface and user experience, I'm assuming that you must have used or at least hooked in some way to notebooks, because that's very, I think that's a typical way that data scientists like to work, right? Exactly. So as part of the AuraDS and Graph Data Science 2.0 release,

Starting point is 00:15:08 we have introduced a native Python client. It's just pip install graph data science, really easy, that reduces a lot of the friction of using Neo4j with a notebook. So we hide a lot of what happens with transactions and things data scientists don't need to know about. In a much simpler, more streamlined, you can call functions in Neo4j GDS, just like you would any other Python function and get the results back as a data frame. And that's all built into AuraDS as well. So if you look in our docs for AuraDS, everything is shown side by side. Here's how you do it in Cypher if you want to use Neo4j browser. Here's how you do it in Python if you want to work from your notebook. And we actually have

Starting point is 00:15:57 embedded collab notebooks. So if someone just wants to pull something up and look at it, they've already got the code snippets in a notebook format that they're ready to go with. Okay. Yeah, that sounds like an important upgrade, actually, I would say, which makes me wonder whether you're also going to backport those new interfaces and those new ways of working with Graph Data Science Framework, if you're going to backport them to the on-premise version as well. So the Python client, Graph Data Science, that we've introduced will work on-prem or in the cloud, self-managed, wherever you need it. It works with GDS 2.0 and more recently.

Starting point is 00:16:48 So anyone who is running the latest version of our graph data science library can use that Python client wherever they are. And hopefully, you know, this is going to make adoption a lot smoother, much easier to get started with, and just reduce the need for a data scientist to learn yet another language. But we wanted to make sure that it was available on the cloud and self-managed at the same time. Okay. I think, again, it makes sense. To be honest with you, it wasn't clear to me that there has been this sort of upgrade, let's say, in the way people will interact with the product. And also that it's going to be available not exclusively for the cloud version.

Starting point is 00:17:37 So it's good to know. When we introduce, you know, it's a big release with a lot of stuff. But that means that sometimes things get a little lost in the noise where it's like, oh, there's a cloud product, there's a 2.0 platform, there's a whole new Python client. So I'm glad we got to call it out and talk about it. Yeah, that's say, we also need to clarify that at this time, general availability is only for the Google Cloud version, right? And I presume the other version, so Amazon and Azure are going to follow.

Starting point is 00:18:31 And again, I'm kind of assuming that the reason you started with Google, same as what happened with the Neo4j cloud managed version, by the way, it also started from Google. It's probably the fact that there is a sort of special relationship, let's say, with Google. So there's a partnership there. Exactly. We have a partnership with the Google Cloud platform. So we release things first on Google Cloud. What's actually really exciting for us in the data science space is that we are also partnering with the Vertex AI team at Google. So we've done workshops, blog posts with Google on using Neo4j Graph Data Science with their Vertex AI data science platform in the cloud to show how those two can be integrated and how you can build graph and kind of connected data

Starting point is 00:19:22 science into their AutoML pipelines. And so actually, someone from Google will be joining us for a release webinar at the end of the month to show how all the pieces fit together into really a really fully featured production-ready pipeline going from raw data to AutoML pretty seamlessly between Neo4j, the Graph database, the Graph Data Science library, and then all of Google's sophisticated machine learning tooling. Yeah, that sounds like an interesting proposition for Google's clients, let's say, and also a good way for Neo4j to address a larger audience so like a win-win really. Okay so you already kind of touched upon some of the new features which have to do

Starting point is 00:20:16 the ones that you already mentioned have to do with the upgrade in the user interface and user experience. But I know there's more. So let's go through the listing in any order you choose, basically. Yeah, I don't want to just read a list, because I think that might be quite boring for folks. But I think highlighting some of the ones that are really standouts for us, I would maybe start with the work we've done actually around making graph data science more compatible with transactional clusters. So one of the things in this release is we've introduced cluster compatibility.

Starting point is 00:20:57 So you can run Neo4j graph data science alongside your transactional cluster. You don't have to worry about moving data between those instances. So you don't have to worry about copying data from your cluster to a single instance or getting data back from that dedicated data science instance into your cluster. That worry goes away and you don't end up with, you know, something that's not configured for either workload. So what you can do now is you can attach a dedicated GDS node to your cluster. It automatically gets that updated data in real time.

Starting point is 00:21:35 You can run your data science workload without interfering with the transactional workload on your cluster. And then all of that write back is handled internally. So you don't have to worry about ETL. And I think that's a pretty big game changer. I was actually surprised by how quickly it was adopted. I have already seen customers picking this up and running it, you know, before it's even released as a GA feature. Alongside that, we've also introduced backup and restore for your graph projections. So GDS operationally runs on an in-memory projection of your underlying database. And by definition, in-memory is transient. But we've actually, in this release, introduced backup and restore for that transient image

Starting point is 00:22:25 and that's a cool story because that feature actually came from Aura DS where we had to introduce it so that users never saw any kind of interruption in their workload if they were we had to push a rolling update or a patch users don't see that and then we added that feature back to our self-managed distribution. So now everyone kind of has that enterprise functionality. And I think that's kind of where we're looking at how we make it possible for developers to go into production. And then if you shift gears to talk more about things that are custom built for data scientists, probably the most exciting piece of this release for us is

Starting point is 00:23:06 we've really matured our machine learning and auto ML capabilities. So we introduced the ability to create machine learning pipelines for tasks like link prediction, so filling in missing relationships on your graph, or node classification, where you are filling in missing labels like this person is a fraudster or this person is innocent. And we've introduced this concept of a pipeline catalog where you can say, hey, I want to train a model for this end goal. These are the features that I want to use, like these are the node labels and the relationship types. I want to use a graph embedding. And then I want to use that graph embedding for a classifier model. And Neo4j will automatically do all those feature generation

Starting point is 00:23:53 embedding steps and then select the best performing model for you and then return to an end user pretty seamlessly. Here's your predictive model that you can then apply to incoming data and new data as your database grows turn to an end user pretty seamlessly. Here's your predictive model that you can then apply to incoming data and new data as your database grows and really simply move from proof of concept of like, can I build a model all the way to production with a persistent published model? And I think that's a pretty cool feature. Okay. There was one thing I wanted to ask you on the persistence, basically, that you already mentioned. Again, this was one of the things that took me by surprise a little bit because, again, assumptions. I thought that, well, it couldn't possibly not have been there. But, well, just hearing you elaborate on that now, I guess I was again wrong. It probably wasn't there. And the fact that you actually needed, absolutely needed to have it for the cloud version made you add it. And well,

Starting point is 00:24:54 as you said, now it's also there for the self-managed version. You know, it may sound like an ape question, but how could people possibly work without it uh so far i mean you know if there was an i don't know an outage or you know something wrong with their uh with their machine or whatever that just meant that you know they they basically lost whatever it was they were working on yeah so you would start again um and that is kind of the status people were at so the fact that we use an in-memory projection of the underlying database is kind of like a double-edged sword. There are some amazing things that it lets you do. It's highly optimized so you can run graph algorithms at like 100 billion node scale and they finish in minutes.

Starting point is 00:25:41 But it's a projection of the underlying database. So it's not running on disk, it's not persistent. And so up until this point, what you would do is you would create your projection, execute your algorithm, and then either write those results back to the database if you needed them to be persisted, or you could stream them out. And this just kind of lets you keep that intermediate state persisted in case you need to restart your database. I think the reason this wasn't really impacting people's ability to run in production is our algorithms are quite fast.

Starting point is 00:26:19 This isn't you're going to wait hours or days and hope nothing bad happens to your database. It's sub-second to minutes, depending on what you're running. And so you could project, run, write back, and not really worry about it. I think the other piece is probably that that GDS, Graph Data Science as a platform, has a lot of safeguards in place to prevent someone from accidentally knocking over their database. So if you request an algorithm that takes up more memory than you have available on your system, instead of just knocking over the database, you get a message that says, hey, you don't have enough memory

Starting point is 00:27:05 available right now. Either wait, add more memory, or run at your own risk. And so we kind of built in a lot of safeguards to prevent knocking over the database, and then it's fast enough. But as we expand from what Emil likes to call penicillin users, you know, the people who have like this problem that's blocking their entire business to folks who just want to get value from graph, being able to back up and restore those projections so you don't interrupt your workflow becomes really important. Okay, Yeah, that makes sense again. And another thing that sort of stood out for me from the list of new features was the fact that I saw a nice list of integrations with third

Starting point is 00:27:59 party products. Well, BI, business intelligence products, mostly, I think. And I was wondering, what exactly does that mean for graph data science? I know that Neo4j already had a number of integrations, which, if I'm not mistaken, for the most part, basically meant being able to show graph data in a user interface that was not really designed for graphs. So I wonder what does integration mean specifically for you here? Yeah, I think the most important thing to remember is that data scientists don't exist in a vacuum. So most data scientists have a really complicated ecosystem of tools that, you know, often they're sitting downstream from data warehouses, enterprise data warehouses that are populated by other teams.

Starting point is 00:28:54 And when they ship something into production, that means, you know, I've published this model to some other platform. And Neo4j has to fit in that ecosystem so that it's kind of that better together story instead of trying to say, you know, throw out everything you've done, try something different. So when we talk about our ecosystem integrations, some of them are focused on getting data in. And so sneak peek at the end of this month, we're having some pretty big updates to our Spark connector to support more types of data warehouses, much more simply. Getting data in via Spark connector, or if you've got a streaming service, we've got the Kafka connector. Once you've got data in, you know, then it's integrating with downstream tasks. So that's

Starting point is 00:29:44 the BI connector. If a data scientist is already using Tableau for visualization and they don't want something purpose-built for graph like Bloom. But it's also integrating with kind of AutoML vendors. So we have connectors that exist for Nine and Dataiku, which are two very common, widely used data science model building platforms. And we're working with the team at Vertex AI to show how easy it is to use PyCon and our graph data science client to glue all those pieces together in a notebook on GCP. So that it's very much a story of using Neo4j for the graphy stuff, but then working really seamlessly with everything else in the data science ecosystem, so that there's no friction for that data scientist trying to get into production.

Starting point is 00:30:35 Okay, I see. So, and I think we need to be wrapping up soon. So let's have one last topic to address so you already alluded to some things that are you're going to be announcing pretty soon I was wondering if you could well broaden the scope a little bit and share with us what's in the broader roadmap let's say as well as if you have any metrics or stories or anything else you'd like to share in terms of adoption, how has it been and what do you see, how do you see things going forward? Sure. So roadmap, two things to look out for probably in the next one to two quarters that

Starting point is 00:31:20 I think are really exciting are we are working on building integrations between Bloom, which is our visualization platform, and Neo4j Graph Data Science so that in Bloom, you can automatically run graph algorithms on data that's available in scene. So I'm looking at some connected data. I pull from a dropdown menu. I want to run page rank to see which nodes are most important and then automatically apply rule-based styling. So important nodes with high page rank all of a sudden pop up and are much bigger in my visualization and unimportant nodes get smaller. And so it becomes a really seamless kind of no code way of running graph algorithms

Starting point is 00:32:06 on data and getting insights really, really quickly. And I know we're like on a podcast, I usually have like a really pretty visualization, but I am so psyched about that. The other thing to look for in the next couple months, couple quarters, is making it a lot easier to move data in and out of Neo4j graph data science in really big volumes. So one of the most important things for data scientists is speed of getting data in and out of Neo and the scale.

Starting point is 00:32:41 So I have generated embeddings on my 20 billion node graph. Great. I'm so glad those finished in 10 minutes, but how do I get them out to my downstream pipeline? So we're working on some really exciting ecosystem integrations for really big data movement and also making that super simple for the end user. So those are probably the two big road map things I would just touch on. And in terms of adoption metrics stories, it's been a wild ride. You know, graph data science is just about two years old. We saw 370% plus year over year growth in the number of enterprise customers. We have hundreds of thousands of folks who've

Starting point is 00:33:26 downloaded our graph algorithms book. And I think one of my favorite like AuraDS things is actually we've seen two folks who started off in our early access program. We let in a handful of users just to test it out, give us feedback. They're already in production with their, you know, company's workflows. And already they went from idea to proof of concept to production in a matter of months on their own with very little help from the team at Neo4j, which is just mind-blowing. So really amazing adoption in the last year. And AuraDS, data science as a service, Python client, everything is combining in like a perfect storm

Starting point is 00:34:13 just to let folks find value so much faster. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

Orchestrate all the Things - Graph data science is moving one step closer to the mainstream: Neo4j releases v2.0 of its eponymous product. Featuring Senior Director of Product Management for Data Science Alicia Frame

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.