The Data Stack Show - 113: What Is Streaming Graph? Featuring Ryan Wright of thatDot
Episode Date: November 16, 2022Highlights from this week’s conversation include:Ryan’s background and career journey (2:49)Quine and where it came from (4:36)Graph databases 101 (7:17)Use cases for graph databases (13:44)Purpos...es for graphs (22:27)How to use Quine (31:49)Quine’s performance and scale (43:06)Educating users about a new product (49:13)The team that would optimize Quine (52:23)When graph will gain popularity (56:15)Quine: https://quine.io/The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com. Costas, we are going to talk today about Graph, which we haven't talked about
in quite some time. I think Neo4j was the last one that we talked about. So I love bringing up
subjects that we don't cover a whole lot. We're going to talk with Ryan from ThatDot. They're
the company behind an open source technology called Quine. And really my question actually, because he talks a little about graph, is just defining it and then understanding from Ryan where it fits in the stack.
It can be used for a number of different use cases, right?
I mean, literally software development and building actual graphs, know, queries and insights and all that sort of stuff.
So that's what I'm going to ask.
How about you?
Yeah, you're right.
Like we haven't had that many opportunities to talk about like graph databases and
like graph databases have been around like for quite a while, but we don't hear about
them that much outside of like, okay, like
no4j, which is probably like the most recognized of the one.
So it will be super interesting to hear from Ryan, like what's
made him like start the project.
Why we need Graph.top?
Is it what's new about like the system that he has built?
And as you said, like how it fits to the rest of like the data infrastructure out
there, because it sounds like graph.top bases have been like a little bit niche.
Yeah.
And it comes like to analytics, at least.
Yep.
Well, at the same time, like don't forget that we have stuff like GraphQL, which,
okay, it's not about analyzing data, but in the front end development space,
it has been like very well adopted.
So it would be interesting like to share from humans, like what's next and what's
new and exciting about graph databases.
I agree. Well, let's dig in and talk new and exciting about graph databases. I agree.
Well, let's dig in and talk with Ryan.
Ryan, welcome to the Data Sack Show.
We are so excited to chat with you today.
Thank you, Eric.
Great to be here.
All right.
Well, give us your background.
You have a super interesting background, you know, as a data practitioner, entrepreneur,
but tell us about your background and what led
you to that dot. Yeah, sure. My, my career has steered in the direction of data platforms and
data science. So it's really as a software engineer, creating data pipelines and creating
machine learning tools and other analysis portions to help answer this question about we've got this high volume
data stream what does it mean so so in my career that has kind of been the arc that has been guiding
a lot of my technical work and so that has has led me personally through through positions as
software engineer principal software engineer director director of engineering. I've led research projects as well, focused on creating new
technologies, new capabilities.
So principal investigator on DARPA funded research projects and kind of the
constant thread through all that is this data question about here's a bunch of
data, what does it mean?
Absolutely.
And tell us a little bit about that dot.
What is, what is that dot?
And you know, why'd you found it?
So that dot is a young startup that we founded to commercialize
a technology called Quine.
So Quine is the world's first streaming graph.
It's an open source project that was just recently released and it's been getting great community feedback.
And that is the company behind the commercial side.
So providing commercial support, some extra tools on top of it for managing and scaling it and just running it in large volumes in enterprise environments.
Very cool. And tell us, okay, tell us about Quine.
So it's recently released. Where did it come from?
And then it's an interesting name.
And there's a, you know, before the show, we chatted briefly about this, but tell us
about the name or how the name Quine came about.
Yeah.
So from a technical perspective, Quine is a streaming graph.
It kind of lives in between two worlds.
The world of databases and data
storage, especially something that looks and feels like a graph database, but
it's really aimed at this high volume data streaming use case and and like I
was describing just themes in my own background, when you put those two
together, the big question is here's a high volume stream, what does it mean?
That meaning question is, you's a high volume stream. What does it mean? That meaning question is underlying
my philosophical background. I did a bachelor's degree in philosophy and just could never shake
that bug afterwards because there's some really interesting, deep philosophical questions that
have really nice tie-ins to modern data problems and modern software engineering problems.
So there's this really old question in the history of philosophy about how does a word convey meaning?
Not what does it mean, but how does it go about doing that process of conveying meaning?
And there's a long history behind that question, a lot of deep thought on that question. And there's just a really striking parallel to the data question.
So if you've got a stream of data and you think about each record in your data stream as a word and put all those things together and you've got something a whole lot more meaningful, you've got a real nice comparison to this long running question that has a lot of deep thought behind it.
And so the name for this project was really this synthesis of trying to say, there's this
age old question about how words convey meaning.
There's this very modern urgent question about how a stream of data conveys meaning and what
it means.
If we put those together, we can kind of leverage some
thinking on both sides to do something new and really advance and move the ball forward in that
case. Super interesting. I love it. Well, let's actually go back to basics. So a lot of our
listeners I know are familiar with all sorts of data stores, but we actually haven't talked about graph databases on the show a whole lot as a subject. And I know you said that wine is a
technology sort of, you know, looks and feels like a graph database, but sits in between two worlds.
So I want to dig into that. But could you just give us a 101 on, you know, graph databases,
you know, what are they, you know, what are the unique characteristics and generally like, how do you see them use?
Because they're not necessarily new as we were talking about before the show.
So, yeah, give us the graph database one-on-one.
Yeah, absolutely.
So a graph, if you close your eyes and imagine a whole bunch of circles swimming around, some of them are connected with arrows.
That's a graph. So the circles in the graph are nodes and the. Some of them are connected with arrows. That's a graph.
So the circles in the graph are nodes
and the arrows that connect them are edges.
It's a way to represent data,
especially when on each of those circles,
you can put properties,
which are just C-value pairs.
So imagine like a map or a Python dictionary
or something like that
on each one of those nodes in the graph.
And so what you say about a node is really two things.
A node has some collection of properties and it has a relationship to other nodes.
When you put that structure together, then you can represent literally everything that
you represent in other ways.
So in a relational database or NoSQL database or TreeStore or KeyValueStore,
you can represent exactly that kind of data in that graph.
But then the graph gives you a different perspective on it, and it turns out it
gives you some superpowers for working with that data.
So the different perspective comes from the sense that in a lot of ways, graph data and graph
structures feel like there's something different.
But in truth, it's more.
So you have data in a relational table.
You have data in a key value store.
There's a relationship between those in that relational tables are a bit more expressive.
They're more powerful.
You can join tables together.
You can run queries across them.
You can talk about the relationship between values there.
So in the same way that there is a relationship
between key value news stores and relational tables,
there's that same relationship
between relational tables and a graph.
So they're actually on a progression that gets more and more expressive as you go further
down that list.
And so internally, we have this picture we talk about among our team.
We call it the kite.
It's just this kite-shaped relationship about different data stores, where if you start
with a single value down at the bottom
and you ask this question about when I have a whole bunch of those, what structure do you get?
You get a key value store type of structure. It's a list or a set, something simple like that.
But what do I have when I have a whole bunch of those? Well, then it's a table structure or it's
a tree structure, just depending on how you choose to represent it, those are equivalent.
Okay. But what do I have?
And I have a whole bunch of those.
Well, imagine a bunch of trees that overlap and intersect.
That's when you're working with a graph.
When those leaves in a tree, when one leaf shows up in multiple trees, then
you've intersected your trees and you're working with a graph.
So there's this progression through a lot of traditional data storage and data representation
technologies that gets more and more expressive as you move up that progression.
At the end of that process, you arrive at a graph.
And Quine got started in part because we kept that thinking going and asking that question to say, well, what structure do I have if I have more of those?
And when you get all the way to the end and you've got a graph and you say, well, here's this soup of nodes and edges, all interconnected, connected together.
I've got to treat that as one thing.
When I've got a bunch of those, what do I have?
The answer is a graph. When you have a bunch of those, what do I have? The answer is a graph.
When you have a bunch of graphs, you have a graph.
And so the fact that there's this, this mathematical pattern, even that walks
through the history of database evolution from simple stores, like C-value stores
to relational databases and tree-structured NoSQL data
stores all the way through to graph databases, that question of this mathematical
progression through those things, it gets to a graph at the end and it stops.
That was just so suggestive to us.
We thought we've got to explore this all the way.
And so that was the thing that first got me hooked on graph data
and graph structures. That alongside of working with some real practical problems around
working in startups at the time. And we've got configuration problems about it's hard to get
the product configured and out the door for all these different customers. Some things they have
in common, some things that they have to be separate.
Teasing all that apart and understanding how to define that customer configuration
really drove in the direction of saying, it's got to be a graph.
We've got all these trees that are overlapping each other.
So teasing apart kind of those practical questions also
then steered in this direction of a graph.
And so that's what initially led to some of the ideas behind Quine
and then dove in to explore it all the way.
Very cool. That's such a helpful analogy.
I love the kite analogy and then sort of the spectrum of expressiveness.
Now, some of the use cases for graph, I think,
are pretty obvious to anyone who's worked with data, right?
So identity graph, where you have, you know, nodes and edges, and, you know, you're sort of representing, you know, relationships and all that sort of stuff.
But where, so let's just, let's just say we have kind of, you know, like a standard, quote unquote, vanilla data store set up in our stack, right? So you have a warehouse and you're storing, you know,
structured data in there and it's driving analytics and, you know,
all the other stuff that you do with tables and joins, as you mentioned.
Let's say you have a data lake and, you know, you have, you know,
sort of like unstructured data storage there,
maybe that's serving data science purposes, you know, et cetera.
Where does graph fit in?
You know, for someone who's sort of working on a data team, like where does graph fit in? Because I think what's interesting to me,
at least when you think about graph, like there are use cases across the stack, right? I mean,
in software engineering, of course, you know, to represent relationships of users, say. But then also,
to your point, analytics and, you know, sort of discovering meaning from data as well. So
graph is kind of can be a utility player in the stack. So help us understand where something like
Quine or that dot would, you know, would fit. Yeah. So like we were talking about before this kind of this spectrum of complexity or expressivity
you know where you move from single single value stores or key value stores all the way up through
relational and no sql up to up to graph structure so there's one spectrum about complexity so a
graph helps you answer more complex questions.
And what I've seen in the industry and what really led to Quine's creation is this other
spectrum about data volume, like the speed at which data has to get processed. So batch data
processing has been the standard forever and still is. Plenty of batch processing still happens.
But increasingly, the world is moving more and more towards streaming data problems, which means real time, one record at a time, as fast as you possibly can.
And it's never going to stop.
It's an infinite stream. So that second dimension about the spectrum from batch to streaming, when you plot those two things together, you've got simplicity versus complexity, say, on your X-axis.
And you've got speed or batch versus streaming on your Y-axis.
What we found is that, of course, what you want is the top right corner there. You want the complex real-time-off. There's this barrier between what we're
working with in this realm and what we want to get to in that top corner. So you usually have
to trade off how much data you're looking at, how much you hold in memory, how fast you can store it
to disk and pull it back in again. A lot of just classic data processing and data storage questions
that usually force architectural decisions into a compromise.
So the compromise is either, well, we've got a complex question,
so we're going to use an expressive tool like a graph database
to be able to describe and draw that complexity of some, some
pattern that is at the heart of our critical use case, but if you want to do it fast,
then, well, you can't really do that.
Graph databases have been notoriously slow for decades.
So they're cool, but too slow.
So you get pushed in the direction of, well, let's stuff it into a key value
store and let's build a big microservice architecture.
Let's let that be the graph.
So as data flows through that architecture, it basically becomes this graph structured system of processes as that data flows through to build the data pipeline, to try to build this custom bespoke one-off for my company
data pipeline. And we're going to try to build that into making the thing that can handle high
volumes of streaming data. So all the way at the top with complex answers to our question.
So all the way to the side. And a lot of data engineers spend their life in that world and making those trade-offs.
So, you know, we need it to go faster.
We need to scale it.
We need it to be horizontally scalable, faster, faster, faster.
It never ends.
But we need to put all this data together.
We have to understand it in context.
We have to know how this piece of data relates to that piece of data and be able to look for patterns that are built up over time as that data streams in.
So that's really where graph has fallen short in a lot of ways.
For anybody who's working on data pipeline tools, a graph is the right representation for anything that starts to be complex, but they've been too slow to actually operationalize and productionize.
So they've kind of been relegated to smaller toy problems in this space.
But if you've got to build something real and big and significant, then you can't, they're
just kind of off the table and you're going to have to go build this complex microservice architecture to simulate a graph built out of a bunch of key value stores for speed.
Super interesting.
Okay, so this is, I've been hogging the mic from Kostas,
but one more question for me.
Could you give us, as a follow-on to that,
just a real-life example of where you know of a of a challenge you know
and a real data stream that a company has and a problem they're trying to solve and and how
wine fits into you know an example architecture yeah we've we've been having them coming out of
our ears lately the uh fraud tends to show up very often, very commonly
as some form of a graph problem. And it depends a little bit on what kind of fraud you're talking
about. So authentication fraud for logging into a website, you know, there's lots of good tools
and good products available for signing in. But when something doesn't quite go along the happen path,
you've got, say, bad guys out in the world
who are trying to guess the password
or they have a password list
and so they're going to do this distributed,
this authentication attack, basically.
Well, that comes all in as one big flood
and it just dumps on there.
That's pretty easy.
Ray limiting has that covered. You've got that problem solved. No problem. But attackers are smart and they're going to keep moving and getting smarter. So now they know, slow it down,
you know, and, and spread out your attack, have it come from multiple sources in multiple different
ways, multiple different regions, lots of IP addresses.
Each one of those factors then lets them hide
in the deluge of other data that's happening behind the scenes.
And so to find that kind of authentication fraud
of someone who's trying to gain control of, say,
a particular account with special privileges,
if they gain control, it's a big loss, big problem.
So attackers are motivated, but to detect and stop that means you have to start
assembling the graph about all these attempts that are happening over time,
where they came from, who they're targeting, and these failed attempts
that fail, fail, fail, fail, fail on this one particular account, and then they succeed.
And that pattern shows up in this graph structure as you start connecting in each of those authentication events.
You start connecting them together, and it forms this often beautiful looking graph that then just leaps off the page as obviously here's the pattern of a password
spraying attack coming from so many different angles, low and slow so that it's not triggering
the easy alarms.
But then when you piece together all those attempts, you can clearly distinguish them
from the real user and pull those out.
And if they gain access, you can find it, see it and stop it right away.
So that's, that's one example of, you know, a fraud use case.
We've seen other kinds of fraud use cases as well, like transaction fraud.
So as money's being spent, financial institutions have to do something similar about considering other factors.
What's the geography?
What's the, the other most recent other most recent purchases? Who signed for
all sorts of different kinds of factors? What's the category of the purchase?
All these other sorts of factors that start relying on other values and other kinds of
information that isn't so readily usable with a lot of other tools. But when you put it into a graph, it draws this picture, which then leaps off the page
and tells this very clear story that right now, here's a case of someone who is using
a stolen credit card number to buy something.
Super helpful.
All right, Costas, please take it away.
I've been hogging the mic.
Costas Pintuipas
Thank you, Alex.
That was like super interesting, Ryan, to hear like all the stuff you
got to share about the ideas behind Squine.
I would like to start by asking you something about graphs. And I want your opinion on like,
we can approach graphs as a way to ask questions
and as a way to represent data, right?
But you don't necessarily need to have both.
Like you can have, let's say, a graph-based language
for creating the data and translate that into a relational
model at the end, on the back end, or even a key value store. We've seen quite a few companies
doing that with GraphQL and having something like Postgres behind. Correct?
Yeah.
And then obviously, you can have, let's say, use graphs as a way to represent
also data in a much lower level in terms of the database system that you have.
How do you see that?
And which part for you is the the importance and which part is probably
both, I don't know, like RMP-Mended, that's part of like, why?
David Pérez- Yeah.
What, so an interesting question and there's, there's probably
two sides to the answer.
I think the preview is graph applies to both, but it's in different ways, maybe significantly different ways.
In one sense, the way that you ask your question can be thought of as a graph.
Whether it's GraphQL or otherwise, there's query languages set up for graph queries so you can use the cipher query language
the gremlin query language there's an initiative to try to create a standardized graph query
language because ways to express your problem very naturally fit into a graph because that's
really how humans think a lot of a lot of what we think about and just the way language works, the way
our, our mental model of the world works is reflected in this node to edge node
pattern repeated and connected because that same pattern shows up in our language.
So subject predicate object, lion knows Costas.
I'm a node.
You're a node.
Our relationship is that edge that connects us.
So that's how you build the social graph.
But users get connected together by who knows whom.
And you can frame questions as a graph
in using graph query languages
or even just more naturally using natural language,
which is naturally graph shaped.
So on one side, there's a good argument and analogy for graph structured questions because
they fit the way that we talk.
On the other hand, when you go to write programs and write software, that's where the software
engineers need to turn it into something linear, some code that we can execute.
And there's other good reasons to then head in a different direction on that question
asking side.
So it's not exclusively that.
And even then, a lot of graph queries turn into something that looks like SQL, like in the Cypher language.
It's very SQL-like.
So you're writing out this linear progression of a query.
It's going to be a query graph, but you're expressing it in something that is this query
structure.
So the question side of it kind of comes in and out and it's maybe more
conceptual than it is literal for the graph view of the question asking side.
What we found is that the really the secret sauce for Quine and what makes it
different and special is in the graph runtime, you know, I mentioned before
that graphs have been graph databases, especially they have been known to be like really slow and lethargic in terms of what they can handle and that
limits their, their application and, and developers to kind of dive in and use them so often.
And I think because of that, there's kind of developed this sense in the industry that
graphs are slow.
If you want it to be fast, you got to use some other tools. Graph that graphs are slow. If you want it to be fast, you've got to use some other tools.
Graphs themselves are slow.
What we found through years worth of research is that
that's kind of an artifact of the graph database mentality,
that data is going to be primarily stored.
It's going to sit there at rest, and you're going to bring a query to it
and occasionally pull out an answer.
So that database mentality about static data stored on disk has been a limiting
concept for what graphs can be.
And Quine, the reason that lives in between two worlds, it lives in between
that database world and the stream processing world is because really at
its heart, we built
Quine setting, what if we didn't automatically adopt all the database assumptions?
We're going to have to confront the same fundamental challenges, but what
if we started from a different place?
Let's take that graph data model and let's marry it to a graph computational model.
So for us, that means an old idea from the 1970s, the actor model. So for us, that means an old idea from the 1970s, the actor model. So this asynchronous
message passing system that has really become the foundation of stream processing to build
reactive, resilient systems means to have this distributable, scalable actor implementations that allow you to send messages as a way of, of record
of doing processing and representing data in that system as well.
So Quine builds the graph where, where nodes get backed by actors under the hood.
So they're independent processes that can, you could have thousands, hundreds of thousands, millions of them running in a system.
You could have, you know, a lot of them moving.
You can distribute them across clusters.
And all communication happens through asynchronous message passing.
So that as the framework, so that much more stream processing kind of approach that lets us then incorporate
other stream processing considerations like back pressure into how queries are executed
and how data is processed.
So building a graph system that isn't automatically adopting the old school ideas of a database,
but is built for the modern world
of high volume streaming data
as the first class citizen
we're really trying to work with.
So on the back end,
what we found is that
when you bring that kind of new approach
to the runtime model,
it can unlock some pretty stunning efficiencies.
It gives us the opportunity to do graph computation at just the right moment.
The ideal moment when your data's in memory, it's already there.
And then it turns out the graph structure behaves like an index.
You've got a node here and you want to know, well, what data is related to this? What should I warm up in the cache so that it's ready to go for high volume data processing? Well, a very natural answer to that question is warm up the nodes that are connected to the one you're talking to. You know, the one you have an edge connecting to this node in question, warm up its neighbors.
That's super interesting.
Okay, so if we have, I mean, a key value store,
I think it's pretty clear what the data model is there.
You have keys and you have values, right?
Okay, things can get a little bit more complicated
if you also allow some nested data there.
But I think it's pretty clear to everyone's mind of how you model something like that,
how you create a key value store, how you design it.
Pretty much the same also with relational database.
You have the concept of a schema, you have the table, the table has columns, the column has like of a specific type. And then you think in terms of like relationships and how one table relates to the other, blah, blah, blah, like all that stuff. Okay. So there is like, I think it's, okay, like pretty common knowledge of like how someone can design a database to drive an application or like do some analytics.
How does this work like with Quine?
Like let's say I want to use Quine and probably I already have some
data sources that they already have a schema, some of them might be relational,
some of them might be, you know, like, hierarchical, like, documents or whatever.
What do I have to do to define this graph, right?
Or is, like, Quine can figure it out on its own?
Like, how does it work?
Like, how do I go from all this different data that I have out there to a consistently
updated, with specific semantics that I understand graph
on Quine. Yeah. So there's a, there's a two-step process to using Quine. And let me kind of preface
that by saying where this will fit is in the pipeline, in the data pipeline. So as data is
moving through your system, through your data pipeline, Quine plugs into Kafka on one side
and then plugs into Kafka on the other side. So it kind of lives in between two Kafka streams
as being really a graph ETL step. So to help take what's in that first stream, combine it,
understand it, express what you're looking for, and then stream out what will hopefully be a much lower volume, but higher value set of events coming out of Quine and into the
next Kafka topic.
So to use Quine is to first answer this question of about where's my source of data, plug it
into a Kafka topic or a Kinesis topic or Pulsar, some streaming system that is going to
deliver this infinite stream of events, plug it into that and aim to stream out meaningful
kind of distillations of what's coming through your pipeline.
So the detection of a fraud scenario or an attack graph from a cybersecurity use case, or the root cause
analysis of your log processing or whatever the use case may be.
So those sources of data get built into a graph with the first step.
The first step is defining a query that takes every record and builds it into a small little
sub graph. So imagine one JSON object comes through that stream.
That JSON objects has a handful of fields in it.
Maybe they're nested fields.
That's fine.
But the first step is to using the quine is you write a cipher query that says,
I'm going to take in that object and I'm going to build it into the small little sub graph.
It's like a little, you know, picture a paint splatter or something, you know,
there's a note in the middle of maybe a couple of things coming off of it.
Yeah.
That tends to be the shape that it gets built into because we can take that
JSON and we could just say, make one node out of it.
So take that JSON, make one node.
But the value in a graph is when you start
connecting it into other data. So the question is, we've got that JSON object stored as a node,
a disconnected node in a Quine graph, but what do we want to start intersecting it with?
Pull off some fields from that JSON object. If there's an IP address in there, pull that IP
address off and use it to create an edge to another node where that other node represents the IP address.
That way, if you end up with two JSON events somewhere in your stream that both refer to the same IP address, they both get connected to the same IP address node.
Maybe same thing with URL or username.
You can fit it into a hierarchy of timeframes,
so that there's this progression of events
that fit into this hierarchical time representation.
So that first step is to take that JSON object
and pull off fields or kind of build it into a subgraph
so that you can intersect it with other data.
Yeah, again.
So if I understand correctly, like, Quine does not, like, store the data.
It processes input data that's coming from stream, and outputs the
results to another stream, right?
Is this correct?
David Pérez- Great question.
Almost.
Quine does store data, but it does so using existing storage tools.
So there's a persistence layer built into Quine where you can choose how it's going to store
data. So locally in
something like RocksDB
on your local network
or in a managed cluster using
Cassandra, there's
a pluggable interface to choose
any of several different storage technologies
for as Quine is building up
the graph graph the value
comes from from intersecting events that happen over time in your stream it's been an unfortunate
limitation in stream processing to say that well if you want to join events you know through an
event stream we'll have to hold them in memory And so you'll have to set a retention window,
a time window for how much data you're willing to hold on to.
So if you're trying to match A and B,
you know, that A arrives first and you have to hold on to it and wait
and keep consuming the stream looking for B.
When B finally arrives, you can join it to A.
That's great.
What about when you have a C and a D and
an E and you need to join more things together, your problem gets a lot more complicated.
And if you have to hold all of that in memory in order to be able to join them together,
then you're forced to use expensive machines with huge amounts of RAM
and to set some artificial time window that says our RAM is limited.
We can only hold so much data there.
So I'm going to hold on to A for 30 seconds, a minute, 10 minutes, 30 minutes.
But if B doesn't arrive in that timeframe, I just let A go.
And if B arrives after that timeframe, I missed it.
We just missed that, that, you know, record. We missed that insight of what we were trying to understand because we had to
force this artificial time window which for operational reasons, you know,
was the state of the art.
So Quine, Quine tries to solve that by using this storage layer under the hood.
Just that you can manage your data storage separately in kind of known fashion, known quantities.
You can use robust tools like Cassandra or ScyllaDB.
And you can set TTLs so that you can expire out old data.
Or you can hold on to it in whatever way works best for your application. But joining data over time in that stream in a way that is then robust, durable,
and can help overcome this unnatural time window limitation.
So that we can find A and B and C and D and E when they all arrive more than 30 minutes apart or whenever your time window is.
So this becomes really important for some use cases like detecting advanced persistent threats
in the cybersecurity world. So attackers who are deliberately spreading out their attack over a
long period of time so that they can hide in the high volume of data that's coming.
Yeah. So just to make sure that I understand correctly, you have a client that,
let's say, calculates and updates a graph in memory, and it perceives the state of this graph
to a storage layer that can be anything. I don't know, like a RocksDB or like Cassandra
or something like that.
So how do you, I mean,
how do you store a graph-like data structure
to a storage layer that obviously
has not been optimized for that, right?
Like it's something different.
It can be like a key-value store or it can be
a relational database.
So how do you do that?
Yeah, so one of the
interesting things that Quine does,
which I haven't seen in any other,
definitely in any other graph system
and a lot of other data systems too,
Quine uses a technique called
event sourcing to store data.
So what it actually stores on disk is not the materialized state of every node in the graph, but it stores the history of changes.
So when a new record streams in from Kafka and we've got to go create some structure in the data, that flows to a point in the graph.
A node in the graph handles that and says,
I need to set these properties. Well, those properties are already there. If they're already
set, then it's a no op because we don't have to store anything on disk if we're setting set foo
equal to bar, but foo is already equal to bar. So there's no change. There's nothing to do.
There's nothing to update. It's only when foo setting foo equal to baz, and it used to be equal to bar, then we've got an update.
That update gets saved to disk in the event sourcing fashion that then lets us write a very small amount to disk so that we can keep up with a high volume of data that's streaming through.
So we just write the changes. Many times those changes can be duplicates or no ops.
And so we can reduce what we save. And when, when data is stored, it gets saved in a fashion that
kind of resembles a write ahead log. You know, it's a small little write-up that can be done very quickly in a simple structure
that looks like a key value store. So we can take advantage of the data storage that you get high
throughput with key value stores like Cassandra and others to have real high volumes of data
moving through, but then building it together into a graph that gives you that expressivity for answering complex questions.
And that time dimension is another interesting angle here too.
Because we saved the log of changes, we can actually go rewind the entire graph if we need to and say,
here's a question I'd like to run this query and get an answer for the state right now, but also tell me what the answer would have been
10 minutes ago or a week ago or a month ago.
Same query, just add a timestamp,
and Quine will give you the answer
for what it used to be at that historical time.
And so whenever you restart like Quine,
you have to go and recreate the graph.
How does this work?
So the graph gets stored durably.
We have that log of changes for every node in the graph that's saved to disk.
And then for optimizations and snapshots get saved as well.
Oh, okay.
So it's kind of shrink what has to be replayed.
But on demand, when a node is needed for processing in the stream that's flowing through, when it's needed, that node goes and wakes itself up.
So it has actors behind the scenes that can take independent action. So they go fetch their journal of changes and replay that journal or restore a snapshot
so that that node is ready to go
and handle the incoming message
for whatever time period it's meant to represent.
Mm-hmm.
That's awesome.
And can you share a little bit of information
about the performance and how well Wine scales?
Yeah.
So we have, we've been trying to find the limit and we haven't found it yet.
So in my experience, trying to, you know, I mentioned before I've led some
DARPA funded research projects that were aiming at building big graphs to
analyze for cybersecurity purposes.
And what we've kept finding was that when we used all the other graph systems out there,
that if you're trying to stream data into a graph database, that you can run anywhere from maybe like 100 to up to about 10,000 events per second, kind of max.
And, you know, there's, there's been some iterations on graph databases, you know, to,
to get, get to that 10,000, but there's still kind of this limit at that level.
And it's, and the problem gets harder when you add in a combined read and write workload.
So we're not just writing the graph.
We're also trying to read out the results in real time
and publish them downstream.
So in our experience, we tried every graph system out there.
Some of them lose data along the way, which is just not good.
Others just kind of have this natural limitation on their throughput
that kind of caps out at 10,000-ish events per second, it depends on the use case at that, you know, we're
deploying Quine in an enterprise environment, we had our customer come
to us and say, our problem begins at 200, 250,000 events per second.
So if you can show us something running that fast, then we can talk.
So we, we stood up Quine, ran it at 425,000 events per second on
a cluster that costs $13 an hour.
Oh, wow.
That's, that's impressive.
So since then we've we've kept going past that and we've done
over a million events per second.
And that's a million, a million events ingested per second.
And what we haven't gotten to actually,
and kind of this, apologies,
this long running kind of long way around
to answers to your question.
We haven't gotten to the other fact
that to get data out of Quine,
you'd set a standing query to monitor that graph
looking for as complex a pattern as you'd like.
You know, three node hop, five, 10, 50.
It doesn't matter.
As large and complex a pattern as you like, you know, conditions and filters and whatever you need to express here.
Looking for a pattern in that graph where we're doing that reading and monitoring at the same time that we're have this right heavy
workload all the numbers that i were talking that i was talking about getting up to a million events
per second and beyond that's a million events per second ingested while simultaneously monitoring
the graph for complex patterns and streaming out results so So it turns into the equivalent of another,
depends on your pattern,
but if it's a five node pattern,
you're probably doing something like
another 5 million read queries per second.
The equivalent of that,
we're not actually doing that.
But the equivalent of that,
that sort of complex querying of your whole data set
in order to find every change, every update, every new instance of the complex pattern you're looking for.
So those stream out of Klein using something that we call standing queries.
It's like a database query.
And you just say, here's, here's what I'm looking for.
Every time you find it, here's the action I want you to take.
So I've got to publish it to Kafka or log it to, you know, to standard out or save it
to disk or even use it to pull back into the graph and update something else.
So that standing query then, you know, triggers that output and kind of feeds it to the next
system.
And in terms of like query concurrency, I mean, how many standing queries someone can kind of feeds it to the next system. graph to create value at the end. So we've deployed with hundreds of standing queries so far.
And I know this was something our team was working on recently because
there's some low hanging fruit to even carry that number up a lot higher.
Okay.
This is one of the reasons for, for open sourcing this project is to kind of
show, show off this interesting work that that's been done here and also just kind of, you know, get input and, you know, get, get insight
from any community members who want to get their hands dirty and look at the code
and say, you know, Hey, you know, here's a, here's a little tweak that we could,
you know, improve the efficiency over here.
Henry Suryawirawanaclarena, Yeah.
That's amazing.
Okay.
One more question, which is a little bit less technical, but I think like it's,
it's quite interesting, like from a product perspective.
So you mentioned that, okay, like thinking in terms of the graphs and asking questions
in terms of the graphs, like a very natural way for a human to do right.
But at the same time, I mean, when it comes like to querying data, SQL has been around
like for like, since forever, everyone is trying to kill SQL, but you know, it's like a cockroach.
Like it never dies.
It's still there.
Like no one can kill it.
Like someone like a quite well known investor, like Martin Casado said once
that like you can bet as much as you want against JavaScript and SQL,
but at the end you are going to use.
Yeah.
They have proved themselves like again and again and again,
like they cannot get killed.
So how do you get like people to start using like a different query anchorage?
And how do you educate people and how big of a problem it is also like when you're trying
like to introduce a new paradigm for like interacting with data?
Yeah, great question.
So SQL, like the reality is that SQL is literally the most known most used language on earth.
It's even more so than, than JavaScript. Yeah.
Everybody knows SQL and it, it won't die.
It won't go away.
It's our, it's our declarative data language.
So it's kind of how you work with so many data tools for, for us, what, what that means is maybe a couple of folds. So on one hand we've implemented Quine to use actually two
different graph query languages.
So Gremlin is an early graph query language.
There's some limited support in Quine for using Gremlin, but it's
kind of a, it's a tricky language.
Cypher is something that Neo4j created.
It's meant to be much more SQL-like.
You know, it's syntax, it's use of keywords.
You know, it looks and feels very SQL-like.
So it's not that different.
The main difference is that instead of your select statement in Cypher, you use
the match statement and then you draw
ASCII art.
Your ASCII art kind of draws pictures of nodes connected by edges, you know, and that gives
you that connection to a graph structure.
Our team has actually done some additional research, which maybe I shouldn't say this,
but we've done some work
toward actually eliminating
the need for GraphQL languages in general.
So the ability to use SQL directly on a graph,
there's no reason they can't compile down together
and our team has figured out how to do that.
So it's not something we have released yet,
but it's an interesting forthcoming update.
Awesome.
Looking forward to that.
All right.
I think I really monopolized the conversation here.
So, Eric, all yours.
Yeah.
I think that's just how the show works.
I monopolize, you monopolize.
Yeah, true.
And then the show's over.
Yeah.
Okay, Ryan, let's talk about... and then the show's over. Yeah.
Okay, Ryan, let's talk about... You know, what's interesting is
I'm thinking about the team
that would operationalize
Quine, right?
And, you know,
potentially it involves a lot of different,
you know, sort of a lot of different players,
you know, from different sides of the technical table. What does that team usually look like? You know, cause you're, you know, you have like sort of the core, you know, you're deploying
something, you're connecting it to, you know, existing pipeline infrastructure, streaming infrastructure, you are setting up, you know,
a key value store that, you know, that manages the log, right?
You're querying that, right?
So can you just explain the ecosystem of the team that would both implement and then operate
client on an ongoing basis?
Yeah, sure.
So what we've usually seen is really three roles.
And sometimes these three roles are occupied by a single person.
Sometimes they're each occupied by separate teams, just depending on the scale.
So one is the architect, you know, the person responsible for kind of the big picture.
How do these things relate?
How does it flow together?
You know, how does the system scale? How does it flow together? You know, how does the system scale?
How does it connect into the rest of the data pipeline? So the architect working at that level
to, you know, see the structure that that coin fits into. Next is the subject matter expert. So the person who says, here's my Kafka topic or my Kinesis topic.
It's got data in it. I know what that data means, and I want to turn it into this kind of an answer.
So that's the person who is writing the cipher queries, understands how data gets built into a
graph, how that graph gets monitored with standing crews to turn it into a stream of answers coming out.
And then the third role is really the operator. So the operations team
or the data engineering team who is responsible for
standing up and keeping the system running over time in the long run.
And so those three different roles, as I mentioned, sometimes
they're occupied by one person and they wear three different hats in those roles.
But a lot of times we see, especially at larger companies working at scale, they tend to be separate concerns and occupied by separate people.
Yep, makes total sense.
And then, oh, sorry, go ahead. I was going to say the enterprise version of Quine
is something that the company behind Quine support.
And that's aimed at trying to help address the concerns
that those teams have in a real high volume enterprise situation.
So, you know, you want to run a cluster,
you want that cluster to be scalable and resilient.
So running a cluster, members are going to die be scalable and resilient. So running a cluster,
members are going to die. That's just how it's going to happen. You need to have that system
running in a way that can be resilient to failure, can be scaled over time, can coordinate with and
scale with the data storage layer behind the scenes. So that's really where the commercial work for that dot tends to focus.
Got it. Very helpful.
And then last question here, because we're getting close to time,
but we talked a lot about sort of why graph, you know,
or why, you know, Quine, what problem set does it
fit and talk to you a couple examples. Let's talk a little bit about when so some of the examples
that you mentioned, and even some of the, you know, the world, maybe especially some of the scale,
you know, requirements 250,000 events per second, you know, sound like very enterprise level problems,
right? Like high scale enterprise level problems, which makes sense, right? A lot of sort of
emerging technology, you know, this is a pattern that we've seen over and over again, you know,
emerging open source technology will sort of be built to solve these like large scale problems
where, okay, well, graph had a, you know, a limit of 10,000
events per second. And so it really wasn't a viable solution for sort of these like high
scale systems. And so technology will emerge to solve this problem that existing technology
couldn't. Do you see that trickling down? And is there a use case, you know, for smaller companies
that aren't at large scale?
And so maybe that's the question, not necessarily why graph, like that makes sense, but maybe
when graph.
Yeah, great question.
It's, we've started seeing this and I think this is part of the reason why there's so
much buzz around the open source community is that even in a small scale, we've seen users taking Quine and using it as
part of their data pipeline to put some of the expensive interpretation step ahead of their
expensive tools. So you're streaming data and it's headed to Splunk. And, you know, maybe you don't have a million events per second, you know, going into Splunk
or else you'd be spending billions of dollars a month on your Splunk bill.
But, you know, you've got data flowing through and downstream, it's going to go through some
expensive analysis.
Maybe it's some complex machine learning that has to happen.
Maybe it's being loaded into Splunk or some expensive tools downstream we've we've seen users put quine into their pipeline to do some of the early processing
that can can you know take some of those expensive downstream analysis steps out of the
out of the the, basically. Yeah.
Do them upstream so that you don't have to pay for it downstream.
And a lot of times,
those aren't millions events per second
kind of use cases.
You know, they're more reasonable scales
in the hundreds or thousands
or tens of thousands kind of scale
that let you put the pieces together,
understand what you're working with,
reduce your data,
and have the reduced amount of data
be more meaningful
so that you can either more effectively
or more cheaply and efficiently
do the downstream analysis that you need.
Yeah, super interesting.
And is that...
I know there are several factors there.
It's almost like a first pass filter
you know or sort of you know it's like a pre-compute before it gets to the expensive
compute layer and how much of that it sounds like there's two sides of that right so you said you
were running you know 450 000 events per events per second at $13 an hour.
So there's an infrastructure architectural advantage where, because of where the system is stacked and the way that it's architected, it just sounds inherently that you're sort of realizing low cost at scale.
But on the other side,
that's due to the nature of graphs, right?
So like the graph,
you're really leveraging the power of graphs
as the pre-compute
and it just so happens
that the architecture runs very cheaply.
Is that the best way to think about it?
Yep. Yep.
Completely agree.
And that step to assemble your data, you know, like we described, it's like one record at a time, but you spread it out for the sake of connecting it. And once it's connected, you can see all these meaningful patterns of the kinds of things that we just naturally talk about that get built over time. And so that's why a lot of the times we see that the total
volume of data go down because you can have a lot of things come in, a lot of partial patterns get
built, but not necessarily be what you need to analyze. And so we like to say high volume in,
high value out. The goal is to reduce how much data comes out, but have it be more meaningful,
more understandable, more important. Yep. Makes total sense. All right. Well,
we're at time here, but before we hop off, if our listeners want to learn more about
Quine or that., where should they go?
Quine.io is the home of the open source project.
So there's documentation on there.
It's getting started tutorials.
You can download the executable.
It's packaged up as a jar or Docker image, a handful of different formats.
So go check out Quine, pull it down.
It's free.
See if it can help in your data pipeline.
If, if that helps solve an important problem for you, or if you have questions about it and you want to share, share your questions or share your story,
there's a Slack community linked from Quine.io that you can join and ask
questions or kind of share stories about what's going on. And if that helps solve an important problem
and you want some help in scaling that
or commercial support for it behind the scenes,
that.com is the company behind it.
The creators of Quine and our team of excellent engineers
and others who are working towards supporting that.
Awesome.
Well, thank you so much for the time, Ryan.
I learned a ton and it was such a great conversation.
So thank you.
It was a pleasure to be here, Costas.
Eric, thank you very much.
That was a fascinating conversation, Costas.
I think, you know, I don't know if I have one major takeaway.
The volume in events per second, a million events per second is pretty wild.
So that's,
you know, that was certainly a takeaway. But I think the larger takeaway, we didn't necessarily discuss this explicitly, but what I'll be thinking about is we've talked about batch versus streaming
on the show, actually a good bit. And what strikes me about Quine is that it is fully adopting streaming architectures,
right? And it's assuming that the future will operate on primarily a streaming architecture,
which is pretty interesting. And that architectural decision for Quine, I think,
says a lot about the way that the people who created it, you know, see the future of data.
Yeah.
I think it's also like a more natural paradigm for the kind of stuff
that you're doing with graphs.
Because graphs naturally arise like in systems where you call like events
and interactions in general.
So, and this tend to be like streaming in nature.
So I think like makes, makes a lot of sense.
I think what I found like extremely interesting with the conversation with
Ryan is that it's, it feels to me at least that they have managed with
Quine like to figure out exactly what's the right type of problem that makes a lot of sense to solve with both streaming and graph processing. let's say like complimentary, almost like adding like a streaming
layer, sorry, not a streaming, but a graph layer on top of like whatever
you already have, then having another, like yet another monolithic database
system that's trying to do like everything around like graphs.
So that's what I found like extremely interesting.
And I think I'm, I'm obviously like very curious to see where Quine will get.
But this conversation made me like very excited about the progress of these projects.
And especially because of this like product, let's say, decisions that they've made.
Sure.
It's very cool.
All right.
Well, thanks for joining us on the show.
Subscribe if you haven't, tell a friend,
and we'll catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.