Orchestrate all the Things - Streaming graph analytics: From cybersecurity to the world with open source Quine and thatDot. Featuring thatDot CEO / Founder Ryan Wright

Episode Date: April 27, 2022

What do you get when you combine two of the most up-and-coming paradigms in data processing - streaming and graphs? A potential game-changer, and this is the bet first DARPA and now CrowdStrike F...alcon Fund have taken on thatDot and its open source framework Quine. The CrowdStrike Falcon Fund is an investment vehicle managed by CrowdStrike, in partnership with Accel, that makes cross-stage private investments within cybersecurity and adjacent markets. DARPA is also known to have an interest in cybersecurity, and this is what motivated the decision to fund the development of a framework recently released by thatDot as an open source project dubbed Quine. Many solutions exist on the market both for streaming data processing as well as for graph analytics, oftentimes working in tandem. However, thatDot Founder and CEO Ryan Wright claims that Quine's technology is unique, enabling it scale to orders of magnitude beyond what any other system today is capable of. We caught up with Wright to discuss the key premises behind Quine and thatDot, as well as the practical aspects of using Quine and the next steps in its evolution. Article published on VentureBeat

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Orchestrate All The Things podcast. I'm George Amadiotis and we'll be connecting the dots together. What do you get when you combine streaming and graphs? A potential game changer and this is the bet first DARPA and now CrowdStrike Falcon Fund have taken on that managed by CrowdStrike in partnership with Accel that makes cross-stage private investments within cybersecurity and adjacent markets. DARPA is also known to have an interest in cybersecurity and this is what motivated the decision to fund the development of a framework recently released by DAPT. as an open-source project project dubbed Quine. Many solutions exist in the market, both for streaming data processing as well as for graph analytics, oftentimes working in tandem. However, that DOT co-founder and CEO, Ryan Wright, claims that Quine's technology is unique, enabling it to scale to orders of magnitude beyond what any other system today is capable of.
Starting point is 00:01:02 We caught up with Wright to discuss the key premises behind Quine and that dot, as well as the practical aspects of using Quine and the next steps in its evolution. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook. Nice to meet you, George. So my background is in the software engineering and research world. So I've been leading data science and data platform teams for, well, about 20 years, so longer than it's been called data science and data platform. So I've worked as a director of engineering.
Starting point is 00:01:36 I've led a number of research programs funded by DARPA, pushing the edge of computer science. And I've started several different companies along the way. So that is the fourth company that I've started. Interesting. Okay, thank you. So I guess then the next step would be to try and explain Quine in a nutshell. And also, if you'd like to give a little bit of background on the company. So that dot and obviously those two are connected. Yeah absolutely. So that dot is the company that has commercialized and just recently open sourced with Quine, the streaming graph. So Quine is a new technology that has been in the works for about seven and a
Starting point is 00:02:27 half years at this point in time. So it is aimed squarely in between two existing worlds. So in between the world of streaming platforms and databases. So it turns out that modern data pipelines, what it takes to actually process huge volumes of data coming through the enterprise, it doesn't fit really well into either of these older paradigms. So event stream processing is very much on the rise and taking the world by storm as companies move from this old static batch processing workflow into more modern methods that try to achieve answers to questions and process data in real time as it flows through their systems. And so technologies like Kafka and Pulsar and all those stream processing technologies have gotten a lot of use and a lot of attention and have done great things in the enterprise world lately. So on the database side, we've had databases for 50 years, and they've all kind of
Starting point is 00:03:26 worked with this same model of that's where your data goes to stay. It sits there until you're ready to ask the question. And what we ran into was just problems repeating, doing the same thing over and over again in the enterprise world of trying to build new data pipelines that have to work with this old data processing and data storage model, which is really outdated and meant for earlier generations of technology and earlier generations of stream processing. So to bring new solutions to this space, we started the Quine project and then just recently open sourced it. Quine is aimed, as I mentioned, between these two worlds of stream processing and data storage and is what we call a streaming graph. So it's kind of like a graph database, but it's really meant for stream processing applications.
Starting point is 00:04:20 And where graph databases have been known to be among the slowest of the bunch in the data storage world, new technology and new innovations have meant that Quine can enter this space with pretty revolutionary capabilities in terms of speed and throughput with the expressivity of graph processing on top of a stream to produce something new and the kind of capabilities that had previously been impossible. Okay, thanks for the introduction to the technology. That was my impression of Quine as well. And to be honest with you, I had not heard of it before,
Starting point is 00:05:02 even though, as I was telling you earlier, I kind of have a soft spot for graph. And just looking at the technology, it looked at me something like, you know, Flink meets Neo4j. And I'm kind of wondering where exactly does it fit in the broader landscape and why the need for it basically. So if we look at things from the streaming side, well, most of those frameworks like Blink and Kafka and so on,
Starting point is 00:05:28 most of them also have some graph libraries or graph processing capabilities. And if you look at things from the graph database side of things, well, they also, many of them have things such as Kafka connectors and so on. So they also have ways of ingesting data in real time as well. So where does Quine sit in this landscape and what is the unique thing that it brings to the table? Yeah, so I'd say that's the right way to think about it is it's that flanking Neo4j
Starting point is 00:05:58 intersection where there's two worlds of streaming and data processing come together, especially with a focus on graphs and what you can do with a graph. When you come at it through that lens, though, it's very natural to look at it and say, well, is it like this streaming system or is it like this database system? in our research over the years has been that there has been a missing piece of the puzzle, that there is an important capability that has been left by the wayside. And it's when putting together these systems as one, so stream processing and data storage, when you put them together and solve the respective challenges from a perspective that isn't one or the other, but it is both, then it brings new capabilities to the table. In our case, what we discovered is that the graph structure and processing data in a graph is actually the key to making it fast.
Starting point is 00:06:59 So I think there's this fairly well-established sentiment in the industry that if you're working with graphs, your option has been basically like Titan or Janus DB or Neo4j, something in that space, and graphs will be slow and you just have to accept it. If you need graphs to be fast, then you have to go somewhere else and basically make your own and simulate your own graph. But what we discovered through years of research here is that using a graph on stream processing is actually the key to new kinds of capabilities, to achieving new rates of throughput and new expressivity. And that key is really about incremental computation, that the graph data structure tells you about what is the most efficient way to go query the data, to push that query through the graph instead of waiting and then going and pulling the graph to ask over time, you know, just,
Starting point is 00:07:53 do you have my answer now? Do you have my answer now? Do you have my answer now? So the capability that we develop called standing queries, where a query lives inside the graph, you drop it in and it automatically propagates through the graph. It means that answers come back to you. You don't have to go ask over and over and over again. You ask the query once, it lives as a standing query and the answers stream out to you. So this is an example of the kind of capability that is really new in this space. And it leads to just incredible leaps in throughput and performance for the whole system where previous graph technologies
Starting point is 00:08:30 could maybe run in an event stream processing system at a couple thousand events per second. Our customers have used Quine over a million events per second, and that was really just an arbitrary stopping point. So that incremental computation of being able to use a graph inside of a stream has really been the linchpin for what has honestly been holding the industry back for the last 10 years. So bringing that into the space has really let us achieve some really remarkable new kinds of results. Okay.
Starting point is 00:09:03 It all sounds very both interesting and exotic to be honest with you. So I have a number of follow-up questions. I guess the most obvious one is, well, how about storage? I noticed that Quine seems to integrate with a number of storage systems ranging from S3 to Apache Cassandra and a few others that escaped, to be honest with you. So does that mean that you don't really have your own storage?
Starting point is 00:09:34 And in that scenario, is storage transient or is it permanent? Yeah, great questions. Our users have been finding a lot of different use cases for applying Quine in their environments. And that leads to needing different kinds of capabilities from the underlying storage layer. So we built Quine to be swappable so that under the hood, this streaming graph, which is fast and operates in memory, but is also stateful and stores data on disk, that it can do that data storage in several different ways, depending on what's needed for the deployment. So by default, Quine supports, like you mentioned, several different technologies. RocksDB is used by default for the local storage. So if you want to save data on the same machine that Quine is running on, RocksDB is a good choice.
Starting point is 00:10:23 Quine supports Cassandra as well. So a lot of the enterprise applications, enterprise deployments take that shape where instances of Quine persist their data to another Cassandra cluster. And that Cassandra cluster can be run and scaled in the enterprise environment in the typical fashion. And then Quine benefits from the just years and years and years of research and work spent on making Cassandra really robust and fast as well. So those swappable data storage layers give users configuration options for what
Starting point is 00:10:59 is best in your environment. And the fact that it is stateful lets us solve some of these really critical, previously impossible to solve challenges. For example, some of our users are in the cybersecurity industry. And this was actually DARPA's interest when they were funding the research where we developed Quine. The goal was to create new techniques and technologies for detecting advanced persistent threats. And the challenge with advanced persistent threats, where a very sophisticated attacker
Starting point is 00:11:30 gets into an enterprise environment and stays there quietly, what's hard about that is there's a huge volume of data all the time. Well, we've got tools that can process data, but to find the attacker, you have to take new data that just arrived. So about what the attacker is doing right now, and you have to combine it with data that might be weeks or months old. So it's very old. It's the needle in the haystack has to be joined in real time with the incoming needle in the event streaming haystack that just arrived. So the combination of fast stream processing, looking for complex expressive patterns over a stateful collection of data
Starting point is 00:12:12 that persists over a long period of time was just previously impossible. So we developed Quine in part to solve that problem and then to generalize to all sorts of other kinds of uses that take that similar shape of processing data in a stream where it is also stateful and needs to still be fast. Yeah, it started sounding exactly like you just described it basically. So a combination of scanning, let's say, incoming data for whatever pattern it is you're looking for,
Starting point is 00:12:46 and at the same time, having an overview of existing data, because that's the only way that you could do what you refer to as standing curation. And I'm wondering, how exactly do you achieve that? Do you use some sort of graph-oriented indexing or something else? Yeah, so the short answer is we achieve it with new technology that we've been building for quite some time. But like all new things, it is built out of old prior things. And so our system is built on top of ACA. ACA is an implementation of the actor model. So it's been used in large enterprise environments for 10 years and running.
Starting point is 00:13:35 So ACA is known and battle tested in this world. a new kind of capability because Quine's distinction is really that it combines the graph data model used by graph databases with this event stream processing system built on top of the actor model from 50 years ago. That actor model gives us new capabilities for then performing computation with each little piece of data inside the system. So it's fully asynchronous, it's distributed, and it runs in this graph-structured fashion that matches the graph-structured data model. So pairing those two things together, Quine is really the first to do that. And it gives us new capabilities because every node in the graph is now capable of executing arbitrary computation.
Starting point is 00:14:28 So it can go do whatever it needs to do in response to new data streaming in and new queries taking action throughout the graph. Yeah, I was kind of afraid you would say that. Because again, just looking around a little bit before the conversation, well, the actor pattern and ACA came up prominently, and it looked like your system was relying heavily on those. So the inevitable question, which I'm sure you've heard many, many times before, is, well, yes, fine, that all sounds great, but how easy is that really to use for the average developer? I mean, just in terms of background, I'm relatively familiar with SACA myself in the sense that, well, I've heard of it before,
Starting point is 00:15:19 Lightbent refers to it and uses it as well. But I'm pretty sure the vast majority of developers have not. So how does your system integrate in typical software development structure? Yeah, so to use Quine, you don't have to know anything at all about ACA. Our goal with Quine was to make it as easy and familiar as using something like a database. So really to use Quine, you describe it in database queries. Quine implements the Cypher language, which is very similar to SQL. So you can use Quine entirely just by expressing Cypher queries. Say, here's a pattern that I'd like to watch for in the data.
Starting point is 00:16:06 And you just drop that in as a database query, as a cipher query, drop that into Quine. Quine will turn that into the streaming components that it needs to propagate throughout the graph automatically and stream results out. And it's really, there's two sides to it, to using Quine. There's the ingesting data, what we call event-driven data. So data streams in from a source like Kafka or Red Panda or Kinesis. Data streams in and you use a query to build it into a graph structure. And the second piece is just use a second query to watch for whatever you'd like to find in that graph. And then every time a new one of those is produced, events stream out immediately. That's the data-driven events side.
Starting point is 00:16:54 So the structure of the data itself drives events that flow out to the next system or to take whatever action is desired. And in terms of using it, we have a community of users who actually started to already build what we call recipes. Recipes are packaged configurations of that model I just described of data streaming in, building a graph, monitoring that graph, and data streaming out. Recipes package up for certain kinds of data, everything that you need to turn that into answers that just stream out. So it becomes as simple to run Quine in an environment like that as just running a single command line tool. So you just say start Quine dash R to say use a recipe and then the name of the recipe that you'd like to use. And it'll fetch it from the
Starting point is 00:17:44 community hosted collection of recipes, or you could pass it a file yourself for a recipe that you've been working on to describe. For example, here are my Apache logs, and I would like to ingest those and build a graph out of those logs and then monitor them for some kind of configuration errors or maybe cybersecurity vulnerabilities, monitor it for that kind of activity coming out of the Apache logs and stream those results out to a dashboard that we maintain. That whole process can then be accomplished with just a single, to say, run Quine with this Apache log parsing recipe and then feed it your data and it streams results out.
Starting point is 00:18:25 Okay. and then feed it your data and it streams results out. Okay, yeah, I would say that I'm a bit relieved to know that people don't actually need to get into the weeds with Akka to use Quine. And yeah, definitely choosing Cypher as the query language interface sounds like a good decision. So is that the only requirement to start using the system? So anyone familiar with Cypher can just start hacking away?
Starting point is 00:18:56 Yeah, that's exactly right. If you're interested in just some of the recipes that are on there, you could even skip using Cypher because they're already baked into each recipe there. But if somebody comes along and says, I've got a new use case, or I've got my own data set that I'd like to interpret in a streaming fashion, then all they have to do is write a Cypher query to ingest the data and build it into a graph. And then maybe one more or several more, whatever is useful. One more Cypher query to stream results out, to say, watch for this pattern,
Starting point is 00:19:30 and every time there's a result, send it out to Slack or send it out to a Kafka topic, or write it to a file or log it to the console. Okay, I see. Yeah, that's an interesting approach. Earlier, you also mentioned the origins of the project and basically the reason, the motivation for DARPA to have funded this research. I was kind of guessing that cybersecurity would probably be the use case of interest for them. And it sounds like that was the case indeed.
Starting point is 00:20:07 And I wonder, well, first, whether this is actually in use in any other governmental organizations, be it DARPA or other organizations today, and what other use cases do people use Quine for today? Yeah, so it really is a general purpose tool. And we're seeing already dozens of teams from lots of different companies coming in different verticals with their goals and seeing some really strong results. So those verticals include cybersecurity, like we mentioned, which is a good strong application because so many cybersecurity
Starting point is 00:20:42 problems really look like graph problems about connecting users and computers and files and processes and all these different kinds of things together and doing so in real time on a high volume data. So cybersecurity is a very strong application for this. But in addition to cybersecurity, there are companies using it for blockchain analysis. Big banks and other fintech companies are using it. And log processing companies and CDN companies are using it to understand their own infrastructure better, whether that's by monitoring logs or by hooking it up to Kubernetes event streams and using it to build and understand a model of their system as it runs so they can see what's happening and even use it to interpret those logs and events
Starting point is 00:21:34 and trigger some action that comes out of it. Okay, I see. And, well, since we started talking about use cases, it's also a good segue to talk about the actual trigger for the conversation. We did things in a bit of an unorthodox way because the trigger for the conversation is the fact that you're raising some money basically. And so I was wondering if you'd like to share some key facts and metrics about that dot the company. So things like, well, how many people you are employing at the moment and obviously the specifics around the funding. And also I'm interested in the thinking
Starting point is 00:22:22 behind going open source or rather open core. Because as I understand, there is the open source project and then that also offers some services, but also some value adds around the project. Yeah, absolutely. So we're still a young company. It's early days for us. We had a fortunate start in that the funding from DARPA and the research programs gave us the means and the perfect opportunity to go apply this to a really critical problem, especially in the cybersecurity industry. And so that has led to a number of different applications in the cybersecurity community throughout the world. And most recently, we're thrilled to have an investment from CrowdStrike coming in because there are so many use cases in this space. And CrowdStrike is the leader in the field and some
Starting point is 00:23:18 of the best thinking on the front edge of that space and aiming for how to do new things in their space. And so we're thrilled to see applications emerging all over the place in the cybersecurity world. But it is still early days for us as a company. We're still relatively small. We're bringing this technology to market now and looking ahead to what's next for us as a company right now. We're focused on the launch and the open source launch that we've just kicked off just recently.
Starting point is 00:23:51 But looking ahead at what comes next, we expect later on this year, we'll be raising a Series A. Okay, I see. And what about the open sourcing, the project? Would you like to elaborate a bit on the thinking behind that? And also what sets apart the open source version as opposed to the enterprise product that DOT is bringing to the market? Yeah, absolutely. It's, you know, I come from an engineering background and working in software engineering,
Starting point is 00:24:29 you can't avoid using open source. And there's this very real sense that open source is everywhere. It's a key part of how we build all of the systems that run today. And so in a lot of ways, open sourcing a technology like this is just a natural response to that sort of community effort where I've benefited, we've all benefited from open
Starting point is 00:24:53 source projects in the past. And so a good thing for us to do building on top of those and trying to advance the state of the art is let's open source what we've built so that people can take it and use it in ways that even surprise us that we hadn't planned for and just help them achieve success, get the value that they want to out of that system and just make it open and free for everybody to use. And developers who want to understand the tools before they go use them and deploy them can go look through the code, understand the details, make contributions, kind of, you know, different developers take different tacks there. So we just wanted to enable all of them to do what comes pretty natural to software developers in this space is to go look at the code, to understand it,
Starting point is 00:25:41 to know that it'll be there and they can change it, they can tweak it if they need to. And then for that data as a company, the way we apply our open core model is to say, go ahead and open source that, or open source Quine, make it available to everyone. And for the companies who have a level of scale that they want to reach and they want help doing that, they can come to that company for that kind of support. So there's an enterprise version of Quine that is focused on those enterprise specific features around resilient clustering of the system and scaling it to arbitrarily large sizes of data volume so that you can get up to millions per second or beyond. So that use of scale really, you know, drives someone to the enterprise version of Quine.
Starting point is 00:26:36 But then lastly also, Quine is itself a platform for the next generation of AI that is just emerging and starting to leave the research labs. This generation is the graph AI generation. There's all sorts of new techniques just leaving the research lab around graph recommender systems, graph neural networks, graph anomaly detection. That.has built an instance of this, a new technique that we call novelty scoring. It is a new way to do anomaly detection, but by using a graph so that you can take the categorical data, which is anything that's not a number, use that directly, build it into a
Starting point is 00:27:22 graph, analyze that graph live in real time, and get extremely useful, informative scores without any noise coming out of the system to tell you how novel is every piece of data along the way. So enterprise users who have applications for either novelty detection or this upcoming generation of technologies for graph AI, we'll find a healthy home here in Quine because it's the perfect platform on which to deploy these sorts of graph AI systems into the modern enterprise. Okay. It sounds like, well, as you also said yourself, that is a relatively young company. So apparently, you have lots of work ahead of you. So let's wrap up by asking you to share what's what's your roadmap basically. And that's obviously connected to what you're going to be using
Starting point is 00:28:17 the funding for. Yeah, so we are focused right now on on serving the open source community. So we just released Quine as an open source project. We're seeing great adoption and lots of exciting use cases coming out of that community. So we're focused right now really on serving that open source community, getting the word out, creating more resources for users to understand Quine, to find applications for it and go use it for free in their own environments. And then supporting the enterprise users who have special support needs or want to scale it beyond what they can do on their own with the open source application. They can rely on help from experts here at that dot who built the system. And so that's really the path forward for us. And that includes evolving the
Starting point is 00:29:07 core software based on community input and what users want from it and helping to guide and shepherd and merge contributions from users anywhere in the world. And then also bringing that application to the enterprise, serving those enterprise needs that are a bit more specific. And that means adding certain kinds of functionality and features so that those enterprise users can use it seamlessly in their environment. That's really what's ahead for us. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration
Starting point is 00:29:43 on Twitter, LinkedIn, and Facebook.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.