Orchestrate all the Things - Graph, machine learning, hype and beyond: ArangoDB open source multi model database releases version 3.7. Featuring CEO Claudius Weinberger, Head of Engineering/ML Jörg Schad

Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amatiotis and we'll be connecting the dots together. This episode is all about multimodal and graph databases. The conversation today is with ArangoDB CEO and co-founder Claudius Weinberger and ArangoDB Head of Engineering and Machine Learning Jörg Schad. The occasion is the latest release version 3.7 of ArangoDB, an open-source multimodal database that supports Graph. Since many of the new features in this

Starting point is 00:00:33 release are around Graph, it's a focal point for the conversation. We also touch upon multimodal machine learning, hype and the database market in general. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook. Thank you, Jörg and Claudius, for taking the time to have this conversation today.

Starting point is 00:00:58 And the occasion is the release, the upcoming release of ArangoDB version 3.7, in which there are some interesting features, particularly concerning graph, which we're going to talk about as we progress with the conversation. However, I would like to start the conversation by asking you to say a few words about ArangoDB itself, because not everyone

Starting point is 00:01:25 who's listening is necessarily familiar with what you do. And I would also like you to say a few words about yourselves and your roles in Arango. So you can start in whatever order you'd like.

Starting point is 00:01:41 I believe Sir Claudius is actually the best person to start. Okay, let's be doing this. So first of all, thanks for having a call today and the time for us. So my name is Claudius Weinberger. I'm the CEO and co-founder of ArangDB. Maybe let's start with a little bit of background. So Frank, our CTO, my co-founder, we started to build a RunRuby, maybe a very early first prototype in around 2013. And after we have done that and we proved that our idea could work,

Starting point is 00:02:19 we took a little bit more time to think about if it really would make sense to build another database. So we are now working together for 20 years. So at that point it was already over 10 years, so we are not that young that we simply think we can do it, let's do it. So it must make sense and there must be a market. And the main idea of a run would be what's still valid today is what we call the native multimodal approach. That means that we found a way that we can combine the JSON document data model, the graph model, and the key value model really in one database core

Starting point is 00:03:01 with one query language. And the one query language is very key for that from our perspective, and we will maybe say a little bit more about that later. But it's a full graph database, and we'll also speak about that a little bit later. So that was really the idea. Around 2014-15, we started it as our own company. We got our first funding.

Starting point is 00:03:29 2018, we added also Arango Search to it. So Arango Search is a full-text retrieval engine, what is also written in C++ and added to Arango B. There's a little bit of story behind that. I'm happy to tell that, speak about that later. In 2019, we got our first CSA funding, not the first, we got our CSA funding and opened our office in the US, made a little bit before that the flip to the US. So RangoDB is a US company with a large subsidiary in Germany. And we hired our first chief revenue officer, so Matt Ekstrom,

Starting point is 00:04:11 and also a new head of engineering, and that is Jörg. Maybe it's a good point that, Jörg, you tell a little bit about you and the further history. Sure. I'm Jörg, so I'm the head of engineering and machine learning over here at ArangoDB. And yeah, I just joined last year, but I actually have been working with ArangoDB as kind of like a partnership, also discussing cool ideas such as like transactions and other things, probably for the past four years, right? If I recall that correctly, Claudius. So from my background, I've always been kind of switching back and forth between databases.

Starting point is 00:04:51 So my research, my PhD, that was mostly database system or distributed data analytics. Then over to large-scale infrastructure, container systems. So one of the early Kubernetes contributors. I've been working on Apache Nisos. And this is where I got to know ArangoDB as a partner.

Starting point is 00:05:14 And then I helped working out the different frameworks, helped out working on the operator. And yeah, then last year when Claudio's pitch that I might actually want to join, this was like the perfect combination to kind of combine those both passions like large scale infrastructure, especially like machine learning infrastructure

Starting point is 00:05:32 and databases. And this is why I'm pretty excited that I can be part of this journey here. Okay. Okay. Well, thank you both. That's a good intro, I think, for anyone who may be listening to get a good idea about the basics of RangoDB and what your roles are there and what you do. a while now and I, you know, given that I've noticed a slight change of messaging, let's say,

Starting point is 00:06:08 around what you put forward to people in this last year. So, and to give a little bit of background about myself and how I got to know where I would be, for me personally, the touch point was the graph capabilities because this is an area I specialize in. And so at some point, I always keep an eye on all things graph and graph databases. And obviously, this includes RangoDB as well. So in the past, I had the feeling that the message was mostly around what Claudius mentioned, the multimodal capabilities.

Starting point is 00:06:47 And I've noticed that in the last year or so, there's quite some emphasis on graph. And so I wonder if you could share, well, first of all, if my impression is indeed true. And if yes, I would like to hear a few words from you on what the rationale for this kind of slight change of focus was. Do you notice increased demand for graph capabilities or is it something that you personally believe is important or something else maybe? I'm happy to get started and then, Claude, feel free to jump in.

Starting point is 00:07:26 So I believe it's combining two factors. So first of all, it's what you mentioned, we're seeing a lot of demand. So a lot of users, both on the open source community as well as also paying customers, are kind of coming with graph use cases and then it's spent upon multi-model use cases later on. So I wouldn't say that we kind of totally diverge from the multi-model idea. It's just kind of the angle or the perspective from where we see that. We see it as kind of like a graph and beyond where graph is kind of a central use case, but we firmly believe that we actually need more data models to support efficient and successful graph use cases.

Starting point is 00:08:13 So I would just call it like a slight change of perspective, which we see from both customer demand, but of course this also just helps us to kind of have an easier message out there and help people easier to identify with like one use case or one like starting angle. Okay, thank you. Yeah, mostly Jörg already said, maybe what we really see is a lot of graph use cases. But what I also strongly believe is that many people looking at graph databases and have use cases, real-world use cases, but partially graph and primarily looking on graph databases in the first step.

Starting point is 00:09:02 But really the point, and what we will maybe also sharpen a little bit more is that you can DebiOint our new claim so that you can really grow with your use case if you use our one-minute Debi. Okay, thanks. By the way, I wanted to ask if you have any quick comment on, well, the latest hype, let's say. So as you know, it's around this time of year that Gartner releases their emerging tech hype cycle. And this one was just released like a few days ago. And something that piqued my interest in that was the fact that now, well, a couple of things. First of all, we've been talking about graphs and it seems that they have conflated graphs and ontologies in one category, which to me doesn't entirely make sense.

Starting point is 00:09:52 And the second is that, well, they've kind of moved that from the peak of inflated expectations to the drop of disillusionment. And, you know, I've heard different people saying, offering different comments on that. And to me, it kind of means that a couple of things. First, you know, they kind of had to move it to make room for new technologies. So there's a bunch of new,

Starting point is 00:10:18 newly appearing technologies in this hype cycle. Then the second thing is that well you know you have to get through that stage to get to to what they call the uh the slope of enlightenment so i wouldn't necessarily interpret that as a negative thing it just means that you know adoption is moving on basically and i wonder if you have any any thoughts on that uh yes uh happy to get started again. So I believe what we are seeing from like a first graph perspective, happy to jump on this diverge with ontologies later in the follow-up. But so what we are seeing is that a lot of people, they started out using graph databases and then they often hit

Starting point is 00:11:05 scalability limits. And this is where we see a lot of people are coming over. And this is a lot of the discussions we're having. If you just follow our open source Slack channel or also in the call with like prospects or customers, where we see that they're actually a bit disillusioned about these scaling capabilities but i believe this is actually where uh we often have a pretty fruitful conversation about how we can we actually scale out graphs across like a distributed cluster um from a use case perspective i actually don't really see that value right now. I'd rather see like still use cases being increased. I think a lot of, there's still a lot of like trial and error where people try out different graph use cases, but by now people have actually identified a number of

Starting point is 00:11:57 successful use cases where graph really makes sense. And it's kind of like an established pattern where it becomes more mature in terms of graph use cases. Thank you, Jörg. And what you already also said, so if you look a little bit on how analysts looking at new markets and what is their pattern also. It's completely normal. I think it's also a good sign that the whole graph story is moving on. There is a hype, and at some points the hype goes a little bit back. And on the other hand, it's true that some people started to use graph for many use cases.

Starting point is 00:12:46 So I think the whole branch has to be a little bit careful with what we are promising, that we can do everything faster than other databases. I think that is not the main reason people should look at graphs. And it's unfortunately also not true for any use case, especially if you go in the scaling area. Okay so I think then this conversation that we've started around graphs and differentiation and scaling and so on may be a good stepping stone

Starting point is 00:13:19 to actually get into the specifics of both the new release and your approach regarding graphs and how that connects to the multi-model approach that Arango does. I noticed that there are three new features specifically centered around graphs in the new release, so satellite graphs, disjoint smart graphs, and parallel graph traversals. And so what I'd like to ask you is to introduce them briefly. And then if you would like, based on that, offer your point of view on making the connection basically between how that works, how that's connected to your query language, which you briefly touched on already. And where do you see a Rangody-based differentiation in terms of graph capabilities and the connection

Starting point is 00:14:11 with multimodal? So if I had to kind of combine all the different features you just mentioned, I would call it like under a scene graph performance at scale. And actually, all of them came out of customer or user requirements. So we've been in active conversation with those prospects, again, open source community, gathering feedback of what they actually need. And so they're actually fitted to a number of use cases

Starting point is 00:14:41 we're seeing in that graph space. So let's, for example, start with satellite graphs. So a pattern we have been seeing over and over again that there's like some very large collection of, for example, sensor readings or something similar, which simply needs to be sharded across multiple servers because it's too large for a single server. But then I have a small graph representing the metadata, for example, sensor locations or other metadata associated with that. And that's not necessarily super large, but just the combination of both the metadata and the individual sensor readings

Starting point is 00:15:23 is something very large which needs to be sharded. And with satellite graphs, we actually then have an automatic approach to simply like replicating the metadata across the different nodes. And that allows us then, or our query optimizer actually, to perform query optimizations of pushing down all the computations to the individual nodes.

Starting point is 00:15:46 And hence, I guess this is like a standard pattern or this is kind of the goal for many of the scale-out optimizations and graphs that we keep computation down at single nodes and reduce the number of cross-node communication despite data needs to be sharded across multiple nodes. Disjoint smart graphs actually go in a pretty similar direction. We have had smart graphs for a while and smart graphs basically imply like a smart sharding mechanism where depending on how your data is set up, we try to shard in a way that the number of hops, again, is minimal between nodes. And with this joint smart graph, it's basically if we can prove that the resulting subgraphs are actually subpartitions, so they are just joint, we can then, again, have a number of optimizations from the query optimizer not invisibly to the

Starting point is 00:16:47 user which can then push down a lot more computation down to the individual server simply because we know that they're actually disjoint subparts and subpartitions of the graph. Parallel traversals it's slightly different but again this is again leveraging that data is lying on a number of different charts start a number of traversals in parallel and with that feature we actually allow you to specify that probably in a follow-up version we will kind of get to an automatic way of parallelizing that but for a RangioDB we kind of being in also enterprise software we're following that approach that we try to push it out in steps and first have people experience that.

Starting point is 00:17:52 And then we can actually identify the patterns and automatically parallelize AQL queries. Yeah, that makes sense. Since you mentioned automating the parallel traversals, there's a similar question I wanted to ask on the smart graph features special in a way. I was wondering if the sharding is made automatically by the database or there has to be some user involvement, because typically that's an operation that requires some kind of knowledge of either the schema, if there is one, or distribution or statistical properties,

Starting point is 00:18:45 because you have to decide in which way you want to split your graph, basically. It makes sense to put together nodes that are typically traversed together. So can you say a little bit about how this works? Yeah. Claudius, if you want to mention a little bit about the history of that feature, I'm happy for you to jump in. Of course, I found that really interesting. Yeah, then start you first. I will add something to the history. Ideally, of course, we get a partition key by the user because that's going to make it a whole lot simpler for us

Starting point is 00:19:28 because the user usually has more knowledge and more semantic knowledge. But in addition to that, we can basically also analyze the number of hops in between. So if we know the edges, we basically try to minimize the number of edges going between nodes. So it's an, so to speak, an optimization problem. Okay.

Starting point is 00:20:10 As we started to build around Ruby as a distributed system, this was shortly after the first release. So we, well, it was clear that if you go in the direction to distribute a graph, then you're running exactly the issues what you mentioned before. So if you simply would do a random distribution distribute a graph then you're running exactly in the issues what you mentioned before so if you simply would do a random distribution of your graph what is still possible but in the most cases the performance is not as expected for many people and also it can have very different

Starting point is 00:20:39 performance results for the same travelers or starting from different starting nodes, not only because the graph is very different with this point in the graph, but also you have more network hubs, maybe you have it as with another starting point. So there's a lot of ideas and we evaluate many of them to do optimizations based on the database level. And yeah, there are statistics and stuff that we can use for that,

Starting point is 00:21:09 what we can even get better over it. But it simply would mean that these optimizations will be exposed. That means you insert your data, you're doing your first graph traversal, and it would be slow and getting over the time faster. But what normal use cases have very often that they insert new data, maybe a user just created something in this application and then they are running the graph traversals directly after that and they need this graph traversal part.

Starting point is 00:21:38 And that means every ex-post optimization you could do in a database would not fulfill this requirement. And so we moved quite early to a way that we used a domain knowledge what most applications have, how the data is connected in a way, not completely connected, for that we have now the disjoint smart graph, if that is a special case. But it already makes it extremely easy to go this way because you only have to share the partition key

Starting point is 00:22:10 or however you call it, and everything else, the query, the query optimization, the query distribution, and so on, happens in the query optimizer completely invisible for the user. I think this is a very good compromise to come to a result what makes sense for real world use cases on the one hand, on the other hand, enable you to chart a graph. And I believe that also kind of goes back to your question,

Starting point is 00:22:40 where's the difference between a RangoDB and others? Single RangoDB has been designed from like, round up to be a distributed database or distributed graph database as being part of multi-model, where the optimizer and basically the entire infrastructure are taking care and being written to be distributed. So it's not just kind of the user interaction that Claudius just mentioned.

Starting point is 00:23:07 It's really kind of the internal core engine which has been written to be a distributed database. Okay. I guess this, well, this philosophy is also reflected beyond the query optimizer, also to the query language. And my feeling, my intuition is that maybe that's the reason behind, well, not only having your own query language,

Starting point is 00:23:37 but sticking with it as well. On the other hand, and there's certain benefits to that, I'm sure. On the other hand, this kind of creates a gap, let's say, for users that want to have interoperability with standards regarding query language. Well, there's one of them that's already around for a few years, Sparkle. There's another one that is emerging, GQL. And I know that there have been people from Arango in the latest W3C workshop on interoperability. So I wonder how you see this space evolving

Starting point is 00:24:15 and what your goals are, what your plans are in terms of supporting interoperability, not necessarily on the query language level, but at the very minimum, I would say, on the data format level. So being able to import and export using some standard there. Claudius, do you want to jump into the history? Otherwise, I can cover present and future plans. Happy to start.

Starting point is 00:24:42 Let me go a little bit to the history. So as we started a Run Rebe, five years ago or something, a little bit more, so there was not really a standard. So there was many products out in the space and there was already stuff what you mentioned, Sparkle, yes, but Sparkle,

Starting point is 00:25:01 it's not really a fit for the way how we built the graph in our Rangnib. And so we also was part of our discussion at that time where we sit together for quite a long time. So the idea was not primarily to have your own query language, but it was important that the query language supports graphs and document and key value models at the same time, because we see a large and a great benefit from that. And so there was two ways to go. The one way was no of the graph query language would fulfill this requirement because they are primarily only written in the most

Starting point is 00:25:45 cases at that time also for the graph use case. And the other one was SQL, and SQL could not cover all this stuff. And we could not build our own SQL dialect or we built our own language. And as we all know in the business for quite a long time time we know if you call something sequel but it's not sequel at the end it's set up upset the people more if you call it directly to me so we developed aql what is our own query language and so but now it's coming the point where that may be changed that is the point maybe you can take over so as you already mentioned, we are also part of the GQL committee and working on kind of coming up with that standard,

Starting point is 00:26:33 which often is a very interesting political challenge more than just a technical challenge in itself. And we also, of course, for that, we're planning to support that and are already looking into the first prototypes. But I believe it's only going to cover part of the value proposition we are having. So as mentioned earlier, often graph is kind of the entry vector for users getting to RangoDB. So there it most certainly could be helpful. But overall, if you're leveraging the entire capability, so multi-model or also a Rango search being the full-text search engine,

Starting point is 00:27:15 you actually need support for that in a query language as well. And this is where we kind of see AQL will still be aroundQL will still be around and will still be the standard for kind of the multi-model aspect inside of RangDb. In terms of connectivity, so I kind of interpreted that as basically

Starting point is 00:27:37 what is our, let's call it like RDF ecosystem looking like. So there we are also working together with the community. There are a number of people also doing things there like imports, exports, where we're working together with that community as well to kind of integrate there as kind of like an import or export, mostly focused right now. Okay, thanks. So let's wrap up then.

Starting point is 00:28:13 I mean, we have focused, I would say, almost exclusively on graph, but you have to excuse my curiosity. And it's also a focal point for the new release, so I guess it's kind of warranted. But let's at least briefly touch upon some other features as well. So you already mentioned search. And I know that this is also something that has new features, new feature support in the upcoming release.

Starting point is 00:28:39 So to be honest with you, the fact that impressed me the most about RangoDB search is the fact that it exists. And it exists as a standalone, integral part of the product. Because what many other databases do is basically integrate some of the self, you know, open source solution, like, I don't know, the scene or solar or something like this. So I was wondering what was your rationale for developing your own and whether you find that making that choice pays off. I would definitely say so, yes.

Starting point is 00:29:14 So I believe this comes down to a multimodal database. It's actually more than the sum of its parts, right? Because the value proposition is actually coming from that I can combine a search directly with follow, for example, by a graph traversal. I can use my full-text search to identify a number of documents and then actually leverage the metadata graph, for example,

Starting point is 00:29:39 to extract more information and to classify more information. So I believe this is why it actually, main value proposition comes from that it's all one unit together and not individual parts because they can just interact together. And so this is why I believe it's also is kind of from a historic perspective where Claudius again can probably give more background than I can, was integrated into a RangoDB itself. And this is where we also see with customers the most value proposition right now, customers and users.

Starting point is 00:30:27 So I personally worked the first time with Lucene 20 years ago, quite early. And the products you mentioned, Solr at the end is also using Lucene and other open source products. So Lucene is more or less the standard library for full-text retrieval. But it's unfortunately had one problem. It's written in Java and our RAN DB is written in C++. To integrate a Java-written library in a product in the way what Jörg mentioned, is fully integrated, that it's completely part of the query optimizer, would mean a large, big challenge. At some point, we found something else.

Starting point is 00:31:03 Just to mention, our RAN research is based also on a library, but it's itself also an open source product. And that is called iResearch. It's written in C++, is originally started by EMC in an RMD lab from them. And after Dell bought EMC, they unfortunately decided to close this L&D lab, and so we hired the lead developers of this library. It has not yet the same traction as a Lucene, but at the end, it's really a component that is not anything closed.

Starting point is 00:31:37 It's also open source. It's its own component, but can also be used in other C++ or C products if people like to do that. Okay, thanks. Thanks for the historical perspective. I think it explains a lot, I would say. So, last point I wanted to go over, and let's wrap up with that, was around before machine learning. And I think, again, this is something that has been developed

Starting point is 00:32:08 or at least promoted, let's say, fairly recently. And the idea there seems to be to utilize, to encourage developers to utilize RangDB as a common metadata layer through which they can sort of unify, let's say, their machine learning pipelines. That seems to make sense. But, you know, the question that kind of popped to me was,

Starting point is 00:32:39 well, okay, it does make sense, but, you know, why RangoDB? Why, you know, not use any other database or any other data storage system for that role? So that's the first part of the question. And the second is, how much work is needed to get that metadata there by the developer or whether there's some kind of pre-made connectors that make things easier, possibly? Very good question. So this is actually part of the reason I joined RangoDB as well. So as mentioned, I've been switching back and forth between databases

Starting point is 00:33:14 and like large-scale infrastructure. And in past jobs, I've been building a machine learning pipeline on machine learning infrastructure for mostly finance and healthcare use cases. And one of the biggest challenges we saw there was just for audit trails for CCPA or GDPR over here in Europe, it was really necessary to have a full view on the entire pipeline.

Starting point is 00:33:38 And so one day we actually had to figure out what happens if a patient withdraws his consent for us to use his data in this data set and just being able to identify the different models we had deployed in production was something very challenging because we had to go through like five different metadata stores so the machine learning part the data feature transformation part so it was like very hard and we were started to look at having a common layer with all the metadata where it basically would end up being one query. And if you just think about

Starting point is 00:34:13 modeling that in a relational system, this will end up with a lot of joins and especially a lot of self joins if you consider a machine learning feature might be derived from another feature. So those got like fairly ugly queries in the end and also just from a performance point and also data insertion point. It wasn't that nice in the end. And this is where we started looking at graph and then in particular multimodal databases. And the advantage here is that I can both combine the flexibility of having no schema because multimodal supports like a document view.

Starting point is 00:34:50 So each metadata instance, I can actually just store in like a relatively free format because also the metadata I get, it really depends between different machine learning frameworks. And it also severely depends between different company or different team setups of what I can record. And then the second part coming in there, it's so unstructured, perfect in a document, perfect in like a JSON-like format. But then I still need the structure of how things are connected.

Starting point is 00:35:20 And this is where the graph aspect again comes in. And so in the end looking up which models are being impacted by which or are being derived from this one data set it's just a graph traversal so it turned out to be a really easy model to be both flexible but also to be very efficient in terms of form formulating this query, like in many others as well. So I think this was kind of the motivation to actually go with first like a graph-like database, and then in particular, the multi-model aspects

Starting point is 00:35:54 of putting free schemaless JSON format was kind of the key for a RangoDB here. And as I said, that was before I actually joined a RangoDB in my old job. In terms of what it needs to integrate with your existing machine learning pipeline, we have connectors for kind of the biggest large systems like TFX, the TensorFlow ecosystem.

Starting point is 00:36:21 We are also right now working on the Kubeflow integration where basically you can just drop in and there are kind of APIs so you don't have to do anything. Your metadata will get recorded. If you have a free setup a machine learning workflow, which actually many teams I know have because I just built it from scratch, then there is a simple Python API you can use to store and retrieve your metadata.

Starting point is 00:36:53 Okay. Okay, I see. Thanks. That's quite a comprehensive explanation for that. I was kind of assuming that this was the touch point for you to follow with Tharango because I knew that that was your own background. So thanks. Yeah, I think we covered all the points I had in my list.

Starting point is 00:37:16 So unless you want to add something, I'm good. Maybe just one other aspect which we're seeing currently a lot, and this is like besides the actual graph capabilities or crew capabilities, it's kind of the deployment setup where we see this trend to having much more support for Kubernetes or cloud-like environments. And from a database perspective, this is actually interesting because it gives us a much more dynamic infrastructure than just having fixed servers which are going to run for the next two years.

Starting point is 00:37:55 And this is another criteria or another feature set in 3.7 where we worked a lot on, it's called cluster scalability in such dynamic environments where we worked a lot on, it's called a cluster scalability in such dynamic environments where nodes are being swapped out. My Kubernetes is going to reschedule my pod and other things. So I think this is the other last interesting trend which has reflected quite a bit in 3.7.

Starting point is 00:38:21 Yeah, true. I also noticed that you have a Kubernetes operator just recently released and really quite early. So we developed the first framework for DCOS. Now we know all DCOS did not win the game, but we also started to build our Kubernetes operator as soon as the concept of Kubernetes operator was available, because we believe that is really key for a database, for a persistent service in such an environment. But we started it really, really early. I don't know how long we are already working on the operator, three years, three years, I don't know.

Starting point is 00:39:13 More three years, yeah. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

Orchestrate all the Things - Graph, machine learning, hype and beyond: ArangoDB open source multi model database releases version 3.7. Featuring CEO Claudius Weinberger, Head of Engineering/ML Jörg Schad

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.