Orchestrate all the Things - Streaming graph analytics: From cybersecurity to the world with open source Quine and thatDot. Featuring thatDot CEO / Founder Ryan Wright
Episode Date: April 27, 2022What do you get when you combine two of the most up-and-coming paradigms in data processing - streaming and graphs? A potential game-changer, and this is the bet first DARPA and now CrowdStrike F...alcon Fund have taken on thatDot and its open source framework Quine. The CrowdStrike Falcon Fund is an investment vehicle managed by CrowdStrike, in partnership with Accel, that makes cross-stage private investments within cybersecurity and adjacent markets. DARPA is also known to have an interest in cybersecurity, and this is what motivated the decision to fund the development of a framework recently released by thatDot as an open source project dubbed Quine. Many solutions exist on the market both for streaming data processing as well as for graph analytics, oftentimes working in tandem. However, thatDot Founder and CEO Ryan Wright claims that Quine's technology is unique, enabling it scale to orders of magnitude beyond what any other system today is capable of. We caught up with Wright to discuss the key premises behind Quine and thatDot, as well as the practical aspects of using Quine and the next steps in its evolution. Article published on VentureBeat
Transcript
Discussion (0)
Welcome to the Orchestrate All The Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
What do you get when you combine streaming and graphs?
A potential game changer and this is the bet first DARPA and now CrowdStrike Falcon Fund have taken on that managed by CrowdStrike in partnership with Accel that makes cross-stage private investments within cybersecurity and adjacent markets.
DARPA is also known to have an interest in cybersecurity and this is what motivated the decision to fund the development of a framework recently released by DAPT. as an open-source project project dubbed Quine. Many solutions exist in the market, both for streaming data processing
as well as for graph analytics, oftentimes working in tandem.
However, that DOT co-founder and CEO, Ryan Wright, claims that Quine's technology is unique,
enabling it to scale to orders of magnitude beyond what any other system today is capable of.
We caught up with Wright to discuss the key premises behind Quine and that dot, as well
as the practical aspects of using Quine and the next steps in its evolution.
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.
Nice to meet you, George.
So my background is in the software engineering and research world. So I've been leading data science and data platform teams for, well, about 20 years,
so longer than it's been called data science and data platform.
So I've worked as a director of engineering.
I've led a number of research programs funded by DARPA, pushing the edge of computer science.
And I've started several different companies
along the way. So that is the fourth company that I've started.
Interesting. Okay, thank you. So I guess then the next step would be to try and explain
Quine in a nutshell. And also, if you'd like to give a little bit of background on
the company. So that dot and obviously those two are connected. Yeah absolutely. So that dot
is the company that has commercialized and just recently open sourced with Quine, the streaming
graph. So Quine is a new technology that has been in the works for about seven and a
half years at this point in time. So it is aimed squarely in between two existing worlds. So in
between the world of streaming platforms and databases. So it turns out that modern data
pipelines, what it takes to actually process huge volumes of data coming through the enterprise,
it doesn't fit really well into either of these older paradigms. So event stream processing is very much on the rise and taking the world by storm as companies move from this old static
batch processing workflow into more modern methods that try to achieve answers to questions and process data in real
time as it flows through their systems. And so technologies like Kafka and Pulsar and all those
stream processing technologies have gotten a lot of use and a lot of attention and have done great
things in the enterprise world lately. So on the database side, we've had databases for 50 years, and they've all kind of
worked with this same model of that's where your data goes to stay. It sits there until you're
ready to ask the question. And what we ran into was just problems repeating, doing the same thing
over and over again in the enterprise world of trying to build new data pipelines that have to work with this old data processing and data storage model, which is really
outdated and meant for earlier generations of technology and earlier generations of stream
processing. So to bring new solutions to this space, we started the Quine project and then
just recently open sourced it. Quine is aimed, as I mentioned, between these two worlds of stream processing and data storage
and is what we call a streaming graph.
So it's kind of like a graph database, but it's really meant for stream processing applications.
And where graph databases have been known to be among the slowest of the bunch in the
data storage world, new technology and new innovations have meant that Quine can enter
this space with pretty revolutionary capabilities in terms of speed and throughput with the
expressivity of graph processing on top of a stream to produce something new and the kind
of capabilities that had previously been impossible.
Okay, thanks for the introduction to the technology.
That was my impression of Quine as well.
And to be honest with you, I had not heard of it before,
even though, as I was telling you earlier, I kind of have a soft spot for graph.
And just looking at the technology, it looked at me something like, you know,
Flink meets Neo4j.
And I'm kind of wondering where exactly does it fit
in the broader landscape
and why the need for it basically.
So if we look at things from the streaming side,
well, most of those frameworks like Blink and Kafka and so on,
most of them also have some graph libraries
or graph processing capabilities.
And if you look at things from the graph database side of things,
well, they also, many of them have things such as Kafka connectors and so on.
So they also have ways of ingesting data in real time as well.
So where does Quine sit in this landscape and what is the unique thing that it brings
to the table?
Yeah, so I'd say that's the right way to think about it is it's that flanking Neo4j
intersection where there's two worlds of streaming and data processing come together,
especially with a focus on graphs and what you can do with a graph.
When you come at it through that lens, though, it's very natural to look at it and say, well, is it like this streaming system or is it like this database system? in our research over the years has been that there has been a missing piece of the puzzle,
that there is an important capability that has been left by the wayside. And it's when putting
together these systems as one, so stream processing and data storage, when you put them together
and solve the respective challenges from a perspective that isn't one or the other,
but it is both, then it brings new capabilities to the table. In our case, what we discovered is that
the graph structure and processing data in a graph is actually the key to making it fast.
So I think there's this fairly well-established sentiment in the industry that
if you're working with graphs,
your option has been basically like Titan or Janus DB or Neo4j, something in that space,
and graphs will be slow and you just have to accept it. If you need graphs to be fast,
then you have to go somewhere else and basically make your own and simulate your own graph.
But what we discovered through years of research here is that using a graph on stream processing is actually the key to new kinds of capabilities, to achieving new rates of throughput and new expressivity.
And that key is really about incremental computation, that the graph data structure tells you about what is the most efficient way to go query the data, to push that query through the graph
instead of waiting and then going and pulling the graph to ask over time, you know, just,
do you have my answer now? Do you have my answer now? Do you have my answer now?
So the capability that we develop called standing queries, where a query lives inside the graph,
you drop it in and it automatically propagates through the graph. It means that answers come back to you. You don't have to go
ask over and over and over again. You ask the query once, it lives as a standing query and the
answers stream out to you. So this is an example of the kind of capability that is really new in
this space. And it leads to just incredible leaps in throughput
and performance for the whole system
where previous graph technologies
could maybe run in an event stream processing system
at a couple thousand events per second.
Our customers have used Quine
over a million events per second,
and that was really just an arbitrary stopping point.
So that incremental computation of being able to use a graph inside of a stream has really been the linchpin for what has honestly been holding the industry back for the last 10 years.
So bringing that into the space has really let us achieve some really remarkable new kinds of results.
Okay.
It all sounds very both interesting and exotic
to be honest with you.
So I have a number of follow-up questions.
I guess the most obvious one is, well, how about storage?
I noticed that Quine seems to integrate
with a number of storage systems ranging from S3
to Apache Cassandra and a few others that escaped, to be honest with you.
So does that mean that you don't really have your own storage?
And in that scenario, is storage transient or is it permanent?
Yeah, great questions.
Our users have been finding a lot of different use cases for applying Quine in their environments.
And that leads to needing different kinds of capabilities from the underlying storage layer.
So we built Quine to be swappable so that under the hood, this streaming graph, which is fast and operates in memory, but is also stateful and stores data on disk, that it can do that data storage in several different ways,
depending on what's needed for the deployment. So by default, Quine supports, like you mentioned,
several different technologies. RocksDB is used by default for the local storage. So if you want
to save data on the same machine that Quine is running on, RocksDB is a good choice.
Quine supports Cassandra as well. So a lot of the
enterprise applications, enterprise deployments take that shape where instances of Quine persist
their data to another Cassandra cluster. And that Cassandra cluster can be run and scaled in the
enterprise environment in the typical fashion. And then Quine benefits from the just years and years and years
of research and work spent on making Cassandra really robust
and fast as well.
So those swappable data storage layers
give users configuration options for what
is best in your environment.
And the fact that it is stateful lets
us solve some of these really
critical, previously impossible to solve challenges. For example, some of our users are in the
cybersecurity industry. And this was actually DARPA's interest when they were funding the
research where we developed Quine. The goal was to create new techniques and technologies for
detecting advanced persistent threats.
And the challenge with advanced persistent threats, where a very sophisticated attacker
gets into an enterprise environment and stays there quietly, what's hard about that is there's
a huge volume of data all the time. Well, we've got tools that can process data, but to find the
attacker, you have to take new data that just arrived. So about what the attacker
is doing right now, and you have to combine it with data that might be weeks or months old.
So it's very old. It's the needle in the haystack has to be joined in real time with the incoming
needle in the event streaming haystack that just arrived. So the combination of fast stream processing,
looking for complex expressive patterns
over a stateful collection of data
that persists over a long period of time
was just previously impossible.
So we developed Quine in part to solve that problem
and then to generalize to all sorts of other kinds of uses
that take that
similar shape of processing data in a stream where it is also stateful and needs to still be fast.
Yeah, it started sounding exactly like you just described it basically. So a combination of
scanning, let's say, incoming data for whatever pattern it is you're looking for,
and at the same time, having an overview of existing data, because that's the only way
that you could do what you refer to as standing curation.
And I'm wondering, how exactly do you achieve that? Do you use some sort of graph-oriented indexing or something else?
Yeah, so the short answer is we achieve it with new technology that we've been building for quite some time.
But like all new things, it is built out of old prior things.
And so our system is built on top of ACA.
ACA is an implementation of the actor model.
So it's been used in large enterprise environments for 10 years and running.
So ACA is known and battle tested in this world. a new kind of capability because Quine's distinction is really that it combines the
graph data model used by graph databases with this event stream processing system built
on top of the actor model from 50 years ago.
That actor model gives us new capabilities for then performing computation with each
little piece of data inside the system. So it's fully asynchronous,
it's distributed, and it runs in this graph-structured fashion that matches the
graph-structured data model. So pairing those two things together, Quine is really the first to do
that. And it gives us new capabilities because every node in the graph is now capable of executing arbitrary computation.
So it can go do whatever it needs to do in response to new data streaming in and new queries taking action throughout the graph.
Yeah, I was kind of afraid you would say that.
Because again, just looking around a little bit before the conversation,
well, the actor pattern and ACA came up prominently, and it looked like your system was relying heavily on those.
So the inevitable question, which I'm sure you've heard many, many times before, is, well, yes, fine, that all sounds great,
but how easy is that really to use for the average developer?
I mean, just in terms of background, I'm relatively familiar with SACA myself
in the sense that, well, I've heard of it before,
Lightbent refers to it and uses it as well.
But I'm pretty sure the vast majority of developers have not.
So how does your system integrate in typical software development structure?
Yeah, so to use Quine, you don't have to know anything at all about ACA.
Our goal with Quine was to make it as easy and familiar as using something
like a database. So really to use Quine, you describe it in database queries. Quine implements
the Cypher language, which is very similar to SQL. So you can use Quine entirely just by
expressing Cypher queries. Say, here's a pattern that I'd like to watch for in the data.
And you just drop that in as a database query, as a cipher query, drop that into Quine. Quine
will turn that into the streaming components that it needs to propagate throughout the graph
automatically and stream results out. And it's really, there's two sides to it, to using Quine. There's the ingesting data,
what we call event-driven data. So data streams in from a source like Kafka or Red Panda or
Kinesis. Data streams in and you use a query to build it into a graph structure.
And the second piece is just use a second query to watch for whatever you'd like to find in that graph.
And then every time a new one of those is produced, events stream out immediately.
That's the data-driven events side.
So the structure of the data itself drives events that flow out to the next system or to take whatever action is desired.
And in terms of using it, we have a community of users who
actually started to already build what we call recipes. Recipes are packaged configurations of
that model I just described of data streaming in, building a graph, monitoring that graph,
and data streaming out. Recipes package up for certain kinds of data, everything that you need to turn
that into answers that just stream out. So it becomes as simple to run Quine in an environment
like that as just running a single command line tool. So you just say start Quine dash R to say
use a recipe and then the name of the recipe that you'd like to use. And it'll fetch it from the
community hosted collection of recipes, or you could pass it a file yourself for a recipe that you've been
working on to describe. For example, here are my Apache logs, and I would like to ingest those and
build a graph out of those logs and then monitor them for some kind of configuration errors or
maybe cybersecurity vulnerabilities, monitor it for that kind
of activity coming out of the Apache logs and stream those results out to a dashboard
that we maintain.
That whole process can then be accomplished with just a single, to say, run Quine with
this Apache log parsing recipe and then feed it your data and it streams results out.
Okay. and then feed it your data and it streams results out.
Okay, yeah, I would say that I'm a bit relieved to know that people don't actually need
to get into the weeds with Akka to use Quine.
And yeah, definitely choosing Cypher
as the query language interface
sounds like a good decision.
So is that the only requirement to start using the system?
So anyone familiar with Cypher can just start hacking away?
Yeah, that's exactly right.
If you're interested in just some of the recipes that are on there, you could even skip using
Cypher because they're already baked into each recipe there. But if somebody comes along and says, I've got a new
use case, or I've got my own data set that I'd like to interpret in a streaming fashion,
then all they have to do is write a Cypher query to ingest the data and build it into a graph.
And then maybe one more or several more, whatever is useful.
One more Cypher query to stream results out,
to say, watch for this pattern,
and every time there's a result,
send it out to Slack or send it out to a Kafka topic,
or write it to a file or log it to the console.
Okay, I see.
Yeah, that's an interesting approach.
Earlier, you also mentioned the origins of the project and basically the reason, the motivation for DARPA to have funded this research.
I was kind of guessing that cybersecurity would probably be the use case of interest for them.
And it sounds like that was the case indeed.
And I wonder, well, first,
whether this is actually in use in any other governmental organizations,
be it DARPA or other organizations today,
and what other use cases do people use Quine for today?
Yeah, so it really is a general purpose tool. And we're seeing
already dozens of teams from lots of different companies coming in different verticals with
their goals and seeing some really strong results. So those verticals include cybersecurity,
like we mentioned, which is a good strong application because so many cybersecurity
problems really look like graph problems about
connecting users and computers and files and processes and all these different kinds of
things together and doing so in real time on a high volume data. So cybersecurity is a very strong
application for this. But in addition to cybersecurity, there are companies using it
for blockchain analysis. Big banks and other fintech companies are using it. And log processing
companies and CDN companies are using it to understand their own infrastructure better,
whether that's by monitoring logs or by hooking it up to Kubernetes event streams and using it to build and understand a model of their system as it runs
so they can see what's happening and even use it to interpret those logs and events
and trigger some action that comes out of it.
Okay, I see.
And, well, since we started talking about use cases, it's also a good segue to talk about the actual trigger for the conversation.
We did things in a bit of an unorthodox way because the trigger for the conversation is the fact that you're raising some money basically. And so I was wondering if you'd like to share some key facts and metrics about that dot the company.
So things like, well,
how many people you are employing at the moment
and obviously the specifics around the funding.
And also I'm interested in the thinking
behind going open source or rather open core. Because as I understand, there is the open source project and then that also offers some services, but also some value adds around the project.
Yeah, absolutely. So we're still a young company. It's early days for us. We had a fortunate
start in that the funding from DARPA and the research programs gave us the means and the
perfect opportunity to go apply this to a really critical problem, especially in the cybersecurity
industry. And so that has led to a number of different applications in the cybersecurity
community throughout the world.
And most recently, we're thrilled to have an investment from CrowdStrike coming in because
there are so many use cases in this space. And CrowdStrike is the leader in the field and some
of the best thinking on the front edge of that space and aiming for how to do new things in their space.
And so we're thrilled to see applications emerging all over the place in the cybersecurity
world.
But it is still early days for us as a company.
We're still relatively small.
We're bringing this technology to market now and looking ahead to what's next for us as
a company right now.
We're focused on the launch and the open source launch that we've just kicked off just recently.
But looking ahead at what comes next, we expect later on this year, we'll be raising a Series
A.
Okay, I see.
And what about the open sourcing, the project?
Would you like to elaborate a bit on the thinking behind that?
And also what sets apart the open source version as opposed to the enterprise product that DOT is bringing to the market?
Yeah, absolutely. It's, you know, I come from an engineering background
and working in software engineering,
you can't avoid using open source.
And there's this very real sense
that open source is everywhere.
It's a key part of how we build all of the systems
that run today.
And so in a lot of ways,
open sourcing a technology like this is just a natural
response to that sort of community effort where I've benefited, we've all benefited from open
source projects in the past. And so a good thing for us to do building on top of those and trying
to advance the state of the art is let's open source what we've built so that people can take
it and use it in ways that even surprise us that we hadn't planned for and just help them achieve
success, get the value that they want to out of that system and just make it open and free for
everybody to use. And developers who want to understand the tools before they go use them
and deploy them can go look through the code, understand the details, make contributions, kind of, you know, different developers take
different tacks there. So we just wanted to enable all of them to do what comes pretty natural
to software developers in this space is to go look at the code, to understand it,
to know that it'll be there and they can change it, they can tweak it if they need to.
And then for that data as a company, the way we apply our open core model is to say, go ahead and
open source that, or open source Quine, make it available to everyone. And for the companies who
have a level of scale that they want to reach and they want help doing that, they can come to that company for that kind of support.
So there's an enterprise version of Quine that is focused on those enterprise specific features around resilient clustering of the system
and scaling it to arbitrarily large sizes of data volume so that you can get up to millions per second or beyond.
So that use of scale really, you know,
drives someone to the enterprise version of Quine.
But then lastly also,
Quine is itself a platform for the next generation of AI
that is just emerging and starting to leave the research labs.
This generation is the graph AI generation.
There's all sorts of new techniques just leaving the research lab around graph recommender systems, graph neural networks, graph anomaly detection.
That.has built an instance of this, a new technique that we call
novelty scoring. It is a new way to do anomaly detection, but by using a graph so that you can
take the categorical data, which is anything that's not a number, use that directly, build it into a
graph, analyze that graph live in real time, and get extremely useful,
informative scores without any noise coming out of the system to tell you how novel is every piece
of data along the way. So enterprise users who have applications for either novelty detection
or this upcoming generation of technologies for graph AI,
we'll find a healthy home here in Quine because it's the perfect platform on which to deploy these sorts of graph AI systems into the modern enterprise.
Okay. It sounds like, well, as you also said yourself, that is a relatively young company. So apparently, you have lots of work ahead of you. So let's wrap
up by asking you to share what's what's your roadmap basically.
And that's obviously connected to what you're going to be using
the funding for.
Yeah, so we are focused right now on on serving the open
source community. So we just released Quine as an open source project.
We're seeing great adoption and lots of exciting use cases coming out of that community.
So we're focused right now really on serving that open source community, getting the word out, creating more resources for users to understand Quine, to find applications for it and go use it for free in their own environments.
And then supporting the enterprise users who have special support needs or want to scale it beyond
what they can do on their own with the open source application. They can rely on help from
experts here at that dot who built the system. And so that's really the path forward for us. And that includes evolving the
core software based on community input and what users want from it and helping to guide and
shepherd and merge contributions from users anywhere in the world. And then also bringing
that application to the enterprise,
serving those enterprise needs that are a bit more specific.
And that means adding certain kinds of functionality and features so that those enterprise users can use it seamlessly in their environment.
That's really what's ahead for us.
I hope you enjoyed the podcast.
If you like my work, you can follow Link Data Orchestration
on Twitter, LinkedIn, and Facebook.