Orchestrate all the Things - Neo4j 5 advances ease of use and performance, the graph market maintains momentum. Featuring Jim Webber, Chief Scientist at Neo4j

Episode Date: November 9, 2022

Graph database Neo4j has just released version 5, which promises better ease of use and performance through improvements in its query language and engine, as well as automated scale-out and conve...rgence across deployments. We caught up with Jim Webber, Chief Scientist at Neo4j, to discuss Neo4j 5 as well as the bigger picture in the graph market. Article published on VentureBeat

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Orchestrate all the things podcast. I'm George Amadiotis and we'll be connecting the dots together. Graph database Neo4j has just released version 5 which promises better ease of use and performance through improvements in its query language and engine as well as automated scale out and convergence across deployments. We caught up with Jim Weber, Chief Scientist at Neo4j to discuss Neo4j 5 as well as the bigger picture in the graph market. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook. Hello, my name is Jim Weber. I am the Chief Scientist at Neo4j and I work on interesting systems problems. Great, that's a very brief and succinct summary. Thank you, Jim.
Starting point is 00:00:50 And so the reason we're having this conversation today, the occasion actually, because there's a number of good reasons, but the occasion is that there's a new Neo4j version coming out really soon, Neo4j 5. By the way, I was just looking around a little bit before we connected, and it seems like you have actually silently released it already. I mean, I saw some release docs and the like. So you haven't really officially announced it, but it looks like the code is out there, right? The code has been baked internally for some couple of months now
Starting point is 00:01:26 and you can imagine that we have to get a bunch of i's dotted and t's crossed before we can release that to the public i'm surprised that you've been able to find documentation for it because there is meant to be a kind of formal release but yeah you're absolutely right near for j5 is now ready for public consumption and that will be hopefully happening, I think, this week. So hence why we're talking today. Great. So it's been a while since the last major release. It was in February 2020.
Starting point is 00:01:55 And I should know because I also got the chance to cover that and discuss about it with Emil, the CEO at Neo4j. And so what I tried to do as a sort of starting point was kind of figure out what the differentiation is. So what has elapsed between then and now. And there's a few key elements that also stand out in the draft press release that I got a chance to see. So improvements around query language and what you can do with it, performance in queries, which is very important, obviously. Some improvements around ops, basically, so cluster management and things like that. But I thought an interesting angle to try and decipher all of those things would be to follow up from the last thing that Emil used to close that conversation we had back then.
Starting point is 00:02:48 So he said that the major goal for Neo4j going forward, we're talking about like two and a half years now, would be to focus on ease of use. So I wanted to ask you, OK, so out of these key improvements that you have made for Neo4j 5, how would you say ease of use sort of permits through them? Yeah, I think that's a really lovely question to start with. So I'll answer in the way that Emil answered last time here, two and a bit years ago. Our 200 engineers have been busy. Now you'll recall Emil responded with my 100 engineers have been busy. Yes. And I'm sure that the next time we talk,
Starting point is 00:03:27 it will be our 300 engineers have been busy as we're still busily trying to hire the brightest minds in databases to come and work on graph problems. But actually, I think our developer experience, our ease of use, if you like, has come along substantially. And in my mind, that's in
Starting point is 00:03:46 quite a few buckets. So the primary surface, if you like, for Neo4j is Cypher, our query language. And if you look at Neo4j 5, Cypher has evolved considerably, somewhat under spontaneous improvements from the Neo4j engineers and product management team so for example we have simplified some of the path patterns in cypher so that they behave kind of how you'd expect them to behave if you if you're in the ascii art world rather than having to drop down into the world of predicates as you did in near 4j uh for and earlier for things like hey i'd like to say if this label, but not this other label.
Starting point is 00:04:27 And in Neo4j 4, you can do that, but it takes a few lines of code, right? A few lines of predicates to check that. And now in Neo4j 5, you are able to kind of just sketch that out as kind of Boolean expressions inside your path expression. It makes the code shorter.
Starting point is 00:04:41 It makes the code more readable. And you can see actually, not only has some of this stuff come along from the very long-tenured group of cipher language designers here at Neo4j, but it's also now, I think, being refined by experience, maturing experience with the GQL standardization process. So I guess your listeners are fairly clued in, but I'll remind them. Not everyone. Okay, so for those of you that are uninitiated
Starting point is 00:05:11 and not standards wonks, then GQL is graph query language. It's the graph analog of SQL, our beloved structured query language, and it's being standardized by the same ISO committee that standardized SQL 100 million years ago. They're now standardizing their second ever language, which is delightful because they are called the Database Query Languages, plural working group. So they've earned that plural finally by standardizing GQL. And some of the ideas that have come out in GQL,
Starting point is 00:05:42 which is an industry-wide collaboration, Neo4j, other pure play graph database folks, some of the web hyperscalers and database old guard are in there as well. I think that kind of crucible of innovation has also surfaced some useful things so that Cypher has evolved. It's improved because of that collaboration. So I think if you're a Neo4j user coming from Neo4j 4 or earlier to Neo4j 5, you're going to find that a lot of the ways that you work with Cypher has become radically simplified. If you're new to Neo4j and listening to this podcast has made you think, hey, I should try some of that graph stuff finally, then what you're going to find is all these claims that graph people make about Cypher being lovely and humane, you're going to be hopefully underwhelmed because you say, yeah, it's kind of true. It just works, right?
Starting point is 00:06:27 I type things in as I kind of imagine they should be, and it sort of works. So I think that's one of the big kind of ease of use features that has changed in Neo4j. I think also what's changed in the kind of second bucket is the operational ease of use. So, of course, ops is super super easy if you choose to use Neo4j's AuraDB that's our cloud offering but if you want to run Neo4j in
Starting point is 00:06:52 house historically that's taken a bit of expertise right and I slightly apologize for that because I built Neo4j's clustering system and you've had to operate that at all you lovely people out there and sometimes it's been fine and sometimes it's been intricate and we've had to operate that at all you lovely people out there. And sometimes it's been fine and sometimes it's been intricate. And we've tried to kind of chisel off some of those intricate edges, if you like. So one of the new features in Neo4j 5 is called autonomous clustering. And it plays a few roles, but in terms of ops simplification, it means that you can take, particularly if you're an enterprise using quite a lot of Neo4j, you can take those three or five or 20 clusters of Neo4j that you currently have, you know, dotted around each supporting different projects and different bits of
Starting point is 00:07:33 infrastructure, and you can rationalize them into a single cluster, which is really nice because it takes your, it decimates your ops complexity. Now, instead of running 30 machines spread across, you know, 10 clusters of three, as to make the arithmetic easy, you run a single cluster of 30. You have a single operational surface, and you declare onto that cluster databases, graphs that you would like to host. And when you declare those graphs onto the cluster, you declare a redundancy level. I'd like this to be replicated three times or five times, depending on your redundancy and scale characteristics. And then the underlying infrastructure makes sure that that constraint is always upheld, even when machines start to fail and get replaced, which is a really nice thing.
Starting point is 00:08:22 Because you're still just managing, from an infrastructural point of view, a single cluster. But as that cluster evolves and machines do fail, and someone's ringing my doorbell, of course, mid-podcast, they can pause. While you are running that cluster, machines fail, networks get, you know, network traffic gets blocked, NICs fail, power supplies fail, and the autonomous clustering infrastructure will reload the cluster.
Starting point is 00:08:58 It kind of plays Tetris with your graphs so that those redundancy constraints are always upheld and so that it minimizes any contention between databases. So it will redeploy your workload to keep it running all the time while you're just managing a single cluster. Now, that has taken a lot of work. It's a good job we've got 200 engineers, approximately, since I think that that was a big piece of work. I think that benefits, if you like, the top end of town, where you've got multiple instances of Neo4j running in a typically larger enterprise. I think for the bottom end of town, it's a simplified operational surface. But if you're only running one cluster to power your startup or your departmental app, you won't see too much benefit from there. I guess the other operational benefit is that we've taken some time to think about the Neo4j admin tool. So Neo4j-admin, which is the command line
Starting point is 00:09:39 tool that people use when they're monitoring and running Neo4j. And we've just taken a lot of time there to simplify the syntax there, to deprecate some of the things that seemed a bit clunky that had accreted over time. So that again, Neo4j admin works much more as you would expect to type it as a kind of ops professional. A whole bunch of good stuff there, really. Indeed.
Starting point is 00:10:01 And well, I have a whole bunch of follow-up questions for you because, well, it sounds only fair. That's your job as interviewer, right? So bring them on. Right. So by starting with the query language, it sounds like what you have done is basically some, well, some people in having a shared background would call it syntactic sugar. So basically, the engine sounds like, well, we can talk about the engine later, because there's also some performance improvements. I'm guessing some things have changed about the engine as well. But let's talk about the interface to the engine. My main concern about what you have just described, so making things
Starting point is 00:10:39 easier and all of that would be, is it backwards compatible? So people with legacy code, do they have to rewrite everything? or does it still work? Yeah, I mean, I think the gold standard for backward compatibility is Microsoft, right? I'm sure that you could take a 16-bit binary from 1991 and still run it on Windows 11, right? Those folks are amazing. We try to be like that. So, for example, in Neo4j 5, there are definitely deprecated things and there are definitely breaking changes.
Starting point is 00:11:08 After all, it's a major version. So if you've got some stuff from Neo4j 3, you think I really want to run that unaltered on Neo4j 5. If you use the default runtime, then it won't work. I say, hey, I don't really understand this pattern or I don't really understand this keyword or so on. But in Neo4j, you can always specify a runtime that you're interested in. We have good backward compatibility there. So even in Neo4j 5,
Starting point is 00:11:30 you're probably going to be able to pick out an older version of a runtime, explicitly specified at the beginning of your query, and have it run in that runtime. Now, you'll probably be missing out on some performance optimizations and that kind of stuff. But while you're in the process of migrating code from Neo4j 3.5 onto Neo4j 5, that can kind of keep the lights on until you're ready to port your code over to the new patterns and expressions and get the performance benefits
Starting point is 00:11:53 that have come along with that. Now, like I said, we don't have Cypher queries from 1991 that you could run on Neo4j 5, but we are pretty good at keeping people's old Cypher queries alive. Okay, yeah, that sounds like a reasonable solution, actually. So if you have to introduce breaking changes at some point, at least you give people the option to run their old runtimes.
Starting point is 00:12:15 I had a similar kind of question about the admin tool, actually, because it sounds like you've done a similar thing right there. And I'm guessing probably the approach will be similar as well. You can run your legacy admin command line to code, but you have to specify the runtime, right? In this case, it's just a slight bit different. So we don't specify a runtime, but all of the legacy commands have now been prepended with the word legacy.
Starting point is 00:12:42 So you do dash dash legacy dash backup. So if you've got scripts that would use just dash, dash, backup as an example, you are going to have to tweak those to dash, dash, legacy, dash, backup. So there is some work to do there. But the functionality of those commands is identical. It's the same code that existed predominantly, modular, any bug fixes or tweaks in Neo4j 4. So if you're familiar and you like that experience, you can continue to use that experience. As I picked on backup, I'll continue that. If you happen to choose the new Neo4j 5 backup, it has a bunch of useful features like you can, for example, decide to do a fast backup, which would postpone the consistency check part of backup until restore time, that kind of thing.
Starting point is 00:13:25 So you've got some more operational options available to you if you choose to use the new stuff. But if you prefer the old stuff from Neo4j 4 and previously, that still exists with a minor syntactic tweak. Okay. So again, there is a solution. You may have to go back and edit your existing scripts a little bit, but it's workable, I think.
Starting point is 00:13:46 I think it is quite manageable. It's literally a global find and replace in your script. Yeah, exactly. So another thing that sort of stood out for me when I was reading the draft press release was the bit about ensuring compatibility between self-managed or workloads and on-premise workloads and i was wondering what could that possibly refer to and i'm hoping you can enlighten me i have after listening to the uh the first part of your answer about the uh the things you did with the uh with cluster management basically I'm starting to get haunts that what you, it sounds a little bit like what you did with on-premise
Starting point is 00:14:30 cloud cluster management sort of resembles what you are possibly doing with your in-cloud cluster management. So I'm guessing that may have something to do with it. Well, I'm glad that you're guessing because the last thing I want users of our cloud to know is our internals, right? It's supposed to be a service. So beyond the endpoint, you shouldn't care.
Starting point is 00:14:51 It could be, what's the old adage? Hamsters running on wheels in order to make this work. So I'm pleased that you can't tell. But you're absolutely right. This kind of infrastructure allows our cloud operations to be able to lay databases out across available physical hardware in some optimal way. So we can run clusters in a great way for our end users and hopefully run it in a way that also minimizes costs for Neo4j because it turns out those cloud providers demand
Starting point is 00:15:22 their fees as well. So there is some of that going on. But I think more broadly, what that attests to, George, is that a lot of folks have run Neo4j on-premise. They've run it for a year because we've been doing that for years, right? I mean, this is my 12th year and third week of being on Neo4j payroll. So we've been doing this on-premise stuff for a long time. And then we come up with AuraDB, and you'll notice that in things like AuraDB, there's not quite feature parity. So for example, when you're running Neo4j on-premise, you have complete carte blanche around, for example, functions or procedures that you might choose to write for yourself and install into the database. Whereas with AuraDB, that's a much
Starting point is 00:16:05 narrower set of functions and procedures. In fact, it's a security checked subset of the APOC library, the kind of vast utility library for Neo4j. And you're not today allowed to push your own procedures necessarily into Aura, because you might do nefarious things you might push a procedure that's really a root kit right which would be uh extremely damaging so what we're what we've been doing some preparatory work in the background there and you'll see this over the course of neo4j5 with actually some some research work research with a capital r uh that my team are doing uh around uh intermediate uh intermediate uh representation analysis for being able to analyze your procedures and check them to set that they're not doing any nefarious things so that you will be able to deploy them potentially into AuraDB. And at that point, we actually do reach feature parity. The
Starting point is 00:17:01 last great building block of feature parity between AuraDB and Neo4j on-prem becomes the same. And so what we're doing over time is we're eroding into that gap and making sure that the experience that you have with AuraDB is as familiar as you have with Neo4j on-prem. For folks new to Neo4j who come in straight into Aura, they're not going to notice. Aura is relatively friction-free. They can get going and be productive that way. But for certainly people who've got sophisticated on-premises installations, we want to ease their path into the cloud should they choose to go there over the medium term. Okay. So just to get the facts straight, do you have feature parity at this point or are you close?
Starting point is 00:17:43 No, we're closing. So you'll notice over time the feature parity diminishes. So when we first had AuraDB, we didn't have any procedures at all. Now we have AuraDB running with a subset of APOC, a substantial subset of APOC called APOC Core. And then over time, we will be able to close that gap completely. But for things like other important enterprise considerations, for example, like security and so on,
Starting point is 00:18:08 those things are there and already be. They were featured parity early in the game because otherwise people can't safely run. Yeah, makes sense. Okay, so to wrap up the release-specific topics, let's revisit a little bit the issue of performance. And I know that traditionally Neo4j has a policy of not releasing benchmarks that you do internally.
Starting point is 00:18:36 However, I wanted to ask you because, you know, there's some like pretty big claims in the draft release that I saw like, you know, up to 1,000 performance gain and so on. So let's see how's the best way to frame that question. So how can people know? I mean, what's the best way for people to, you know, to see with their own eyes, like, wow, this is, you know, vastly superior performance? Yeah, I mean, the best way is to build your system, or at least to build a representative stripe through your system to get that kind of confidence. But I realize that that
Starting point is 00:19:11 can be an expensive proposition. So let me tell you a little bit about the changed infrastructure that's in Neo4j. So between four and five, we have optimized storage formats, new indexes, new runtime for Cypher, as well as performance improvements to the existing runtimes. The runtime for Cypher, I think, is pretty interesting. There's a new runtime called the parallel runtime. And that actually came out of a collaborative EU project. So we took part in a Horizon 2020 project for folks outside the EU. The EU sponsors collaborative scientific research at a continental wide scale. Hilariously, of course, I sit here in the UK, which is no longer part of the EU for bizarre reasons.
Starting point is 00:19:58 But Neo4j being a Swedish company, we were able to participate and we participated with a number of universities uk greece and also we participated with a number of other hardware and database companies netherlands uh crete weirdly um a hotbed of uh silicon activity it turns out is crete and uh and through that project we were able to develop um under the leadership Dr. Pontus Melker, one of our query language experts, we were able to develop a parallel runtime for our cipher query language. Now, the output from that project was fairly rough around the edges, but we've then taken the subsequent time between the end of the particular project and the release of Neo4j 5 to be able to productionize that. So we now have a parallel runtime so that if you want to do single query,
Starting point is 00:20:49 large graph analytics, you're now able to do that and consume potentially all of the cores and all of the RAM that are in your server. And of course, if you've got other queries running alongside that, we make it sufficiently smooth so that if you like your big query
Starting point is 00:21:04 doesn't consume all of the resources. The queries are respectful given the runtime and they will only consume cores and RAM and so on if other transactions or other queries aren't vying for it. Now on top of that you're right there was a thousand X faster query performance. And I think that is that is true. But also, I think it's one of these things where we've managed to take something that was, I think, slower in Cypher than it should have been by a reasonable yardstick and be able to transform that into something that performs predictably well in Cypher. And I think because of that, you can see in certain edge cases, large performance increases. I think typically what you're going to see
Starting point is 00:21:50 for a kind of, you know, hopefully an unreal typical Cypher workload, you might see a several X performance, an order of magnitude performance increase, even given the fact that Cypher has improved the indexing, that the storage engine has improved underneath it. So that whole integratedher has improved, the indexing, the storage engine has improved underneath it. So that whole integrated stack has improved across the board. And you should, I think, reasonably see a good performance increase, even if you do nothing to your existing Cypher.
Starting point is 00:22:15 Provided it runs on Neo4j 5, it's going to run a bit faster. That's still pretty massive. So I guess probably the easiest way for people to check that would be to upgrade and check their performance before and after the fact. Absolutely. In fact, my team at Neo4j, we're very interested in performance. So we measure these things and we're comfortably happy with where Neo4j has gone so far. And if anything, my team are Neo4j's fiercest critics in terms of performance.
Starting point is 00:22:42 So if we're not unhappy, I think that's not such a bad outcome. Okay, well, interesting. All right, so I think we have just a few minutes left. So I thought it'd be interesting to pick your brain on where you think the market as a whole is going, because there's a number of things that are sort of converging. And well, first of all, in my opinion, it looks like the honeymoon period, let's say, of the graph scene is coming to an end. And that's sort of normal. You know, you have this hype cycles thing. And well, the peak of the inflated expectations looks like it's behind us. So people are having to come to terms with, you know, what graph
Starting point is 00:23:25 databases really are and what they can do and what they are not so good at and so on and so forth. So that's one. Second, we have a kind of what's the global thing really a change in economic climate, you know, downturn, recession, whatever it is you want to call it. So it used to be that, you know, tech companies in general were on top of the world. And now, well, not so much. There's been layoffs. And actually, some of the layoffs I've seen have impacted some vendors in the graph space as well. And so how do you think, so where do you think this is all going, basically?
Starting point is 00:24:00 And what's Neo4j's positioning into this whole scene? Yeah, I think they're two very good interlinked questions George so let me take the first one. I'm actually happy that we're over the hype phase because people started imagining all sorts of you know insane possibilities for graph databases which were not at all backed up by the computer science so we're now I think we're in a really good phase where people understand, increasingly understand what graph databases are good for, and that's helping the market, right?
Starting point is 00:24:31 No longer are people coming to, for example, Neo4j with something that to me, obviously looks like a Cassandra-shaped query, right? It's like, okay, look, people are really learning to differentiate how their data looks. Graphs are the kind of weird new boy on the block, if you like, and they're starting to understand graphs. So that hype is evaporating from the market is good because it's leaving substance. And so what we're seeing at
Starting point is 00:24:55 Neo4j is that when people recognize they have graph problems, which is increasingly common because modern data is sometimes very structured, very uniform, sometimes very sparse and very irregular, and that suits graphs very well. They come to Neo4j with realistic graph problems that we can help them solve, right? Which I think is a good sign of market maturation. So I don't mind that. I think we'll kind of,
Starting point is 00:25:19 if you think it's a garden hype cycle, we'll bump along the bottom for a while. But for us, that's not a bad signal because during that same bumping on the bottom, our metrics internally at Neo4j are all ticking upwards very nicely. So we do see, although we may be a trough of disillusionment or whatever trademark Gartner have for that, we may well be there. But also our other metrics are tracking in the right direction. In fact, then when I look at other things that Gartner are saying, Gartner are telling us this is like a gazillion dollar industry by 2025. Right. And although I don't yet have a gazillion dollars, I do think that the direction of travel that the analysts are saying is broadly correct. Right. I don't think they are stupidly optimistic or stupidly pessimistic. I also don't think they're probably accurate because who can forecast that,
Starting point is 00:26:06 especially given the current macro climate. But it's kind of trending in the right direction. So I feel there's market maturation going on. I do think that we're going to see a bit of a shakedown because of macro. At Neo4j, we're not immune to that. We are still hiring, for sure. We would never, unless things were disastrous stop hiring so we want to bring talent into neo4j you know if you happen to know any brilliant
Starting point is 00:26:31 database engineers send them our way we would love to chat to them so our plan is to continue to grow through you know possibly the next year or so of recession who knows right um because we think that the market, the total addressable market for graph databases is huge. They are a good general purpose platform. Many IT systems could be conveniently and humanely built on a graph. So we're feeling good about that. Now, you're right, there's a shakedown sector-wide. Some graph database players are having to, you know, shed jobs, restructure and all that kind of stuff and i think that that's sad for the industry i feel particularly for those people that are losing their jobs at this point but i don't think it's detrimental at a macro level to graph databases
Starting point is 00:27:17 as a whole right i think you know some companies have you've got money in the bank of plan some companies they've you know this thing has turned up this this downtown has happened and it's caught them on the hop but i think that's a company by company thing i don't think it's a systemic thing across the graph database industry certainly the metrics that we see and that we understand you know from from the industry at large including some of the web hyperscalers is that the interest in graph continues to grow as the as at the same time the understanding of graph matures. And I think that's quite a solid foundation for the next decade or so of growth in the industry.
Starting point is 00:27:52 Yeah, that's my understanding as well. And it's interesting to hear that. It sounds like we share that, but the difference is that in your case, you also have your internal metrics and well, even above metrics, you have something more valuable. You have, well, customer feedback and, well, revenue too. These are things that you can count on more than metrics or anything else.
Starting point is 00:28:16 Hopefully so, right? I mean, as long as the US dollar doesn't crash as a reserve currency, we'll be fine. Well, that's also an option that's not entirely off the table, but you know, in that case we'll have a bigger issue. Bigger issues to solve, you're right. Okay, well great, it's been a pleasure. We finally managed to have a conversation on record as well, you know, with one minor disruption, so it went reasonably well.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.