The Data Stack Show - 113: What Is Streaming Graph? Featuring Ryan Wright of thatDot

Episode Date: November 16, 2022

Highlights from this week’s conversation include:Ryan’s background and career journey (2:49)Quine and where it came from (4:36)Graph databases 101 (7:17)Use cases for graph databases (13:44)Purpos...es for graphs (22:27)How to use Quine (31:49)Quine’s performance and scale (43:06)Educating users about a new product (49:13)The team that would optimize Quine (52:23)When graph will gain popularity (56:15)Quine: https://quine.io/The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Costas, we are going to talk today about Graph, which we haven't talked about in quite some time. I think Neo4j was the last one that we talked about. So I love bringing up subjects that we don't cover a whole lot. We're going to talk with Ryan from ThatDot. They're the company behind an open source technology called Quine. And really my question actually, because he talks a little about graph, is just defining it and then understanding from Ryan where it fits in the stack.
Starting point is 00:00:56 It can be used for a number of different use cases, right? I mean, literally software development and building actual graphs, know, queries and insights and all that sort of stuff. So that's what I'm going to ask. How about you? Yeah, you're right. Like we haven't had that many opportunities to talk about like graph databases and like graph databases have been around like for quite a while, but we don't hear about them that much outside of like, okay, like
Starting point is 00:01:27 no4j, which is probably like the most recognized of the one. So it will be super interesting to hear from Ryan, like what's made him like start the project. Why we need Graph.top? Is it what's new about like the system that he has built? And as you said, like how it fits to the rest of like the data infrastructure out there, because it sounds like graph.top bases have been like a little bit niche. Yeah.
Starting point is 00:02:00 And it comes like to analytics, at least. Yep. Well, at the same time, like don't forget that we have stuff like GraphQL, which, okay, it's not about analyzing data, but in the front end development space, it has been like very well adopted. So it would be interesting like to share from humans, like what's next and what's new and exciting about graph databases. I agree. Well, let's dig in and talk new and exciting about graph databases. I agree.
Starting point is 00:02:26 Well, let's dig in and talk with Ryan. Ryan, welcome to the Data Sack Show. We are so excited to chat with you today. Thank you, Eric. Great to be here. All right. Well, give us your background. You have a super interesting background, you know, as a data practitioner, entrepreneur,
Starting point is 00:02:43 but tell us about your background and what led you to that dot. Yeah, sure. My, my career has steered in the direction of data platforms and data science. So it's really as a software engineer, creating data pipelines and creating machine learning tools and other analysis portions to help answer this question about we've got this high volume data stream what does it mean so so in my career that has kind of been the arc that has been guiding a lot of my technical work and so that has has led me personally through through positions as software engineer principal software engineer director director of engineering. I've led research projects as well, focused on creating new technologies, new capabilities.
Starting point is 00:03:30 So principal investigator on DARPA funded research projects and kind of the constant thread through all that is this data question about here's a bunch of data, what does it mean? Absolutely. And tell us a little bit about that dot. What is, what is that dot? And you know, why'd you found it? So that dot is a young startup that we founded to commercialize
Starting point is 00:03:53 a technology called Quine. So Quine is the world's first streaming graph. It's an open source project that was just recently released and it's been getting great community feedback. And that is the company behind the commercial side. So providing commercial support, some extra tools on top of it for managing and scaling it and just running it in large volumes in enterprise environments. Very cool. And tell us, okay, tell us about Quine. So it's recently released. Where did it come from? And then it's an interesting name.
Starting point is 00:04:27 And there's a, you know, before the show, we chatted briefly about this, but tell us about the name or how the name Quine came about. Yeah. So from a technical perspective, Quine is a streaming graph. It kind of lives in between two worlds. The world of databases and data storage, especially something that looks and feels like a graph database, but it's really aimed at this high volume data streaming use case and and like I
Starting point is 00:04:55 was describing just themes in my own background, when you put those two together, the big question is here's a high volume stream, what does it mean? That meaning question is, you's a high volume stream. What does it mean? That meaning question is underlying my philosophical background. I did a bachelor's degree in philosophy and just could never shake that bug afterwards because there's some really interesting, deep philosophical questions that have really nice tie-ins to modern data problems and modern software engineering problems. So there's this really old question in the history of philosophy about how does a word convey meaning? Not what does it mean, but how does it go about doing that process of conveying meaning?
Starting point is 00:05:43 And there's a long history behind that question, a lot of deep thought on that question. And there's just a really striking parallel to the data question. So if you've got a stream of data and you think about each record in your data stream as a word and put all those things together and you've got something a whole lot more meaningful, you've got a real nice comparison to this long running question that has a lot of deep thought behind it. And so the name for this project was really this synthesis of trying to say, there's this age old question about how words convey meaning. There's this very modern urgent question about how a stream of data conveys meaning and what it means. If we put those together, we can kind of leverage some thinking on both sides to do something new and really advance and move the ball forward in that
Starting point is 00:06:30 case. Super interesting. I love it. Well, let's actually go back to basics. So a lot of our listeners I know are familiar with all sorts of data stores, but we actually haven't talked about graph databases on the show a whole lot as a subject. And I know you said that wine is a technology sort of, you know, looks and feels like a graph database, but sits in between two worlds. So I want to dig into that. But could you just give us a 101 on, you know, graph databases, you know, what are they, you know, what are the unique characteristics and generally like, how do you see them use? Because they're not necessarily new as we were talking about before the show. So, yeah, give us the graph database one-on-one. Yeah, absolutely.
Starting point is 00:07:15 So a graph, if you close your eyes and imagine a whole bunch of circles swimming around, some of them are connected with arrows. That's a graph. So the circles in the graph are nodes and the. Some of them are connected with arrows. That's a graph. So the circles in the graph are nodes and the arrows that connect them are edges. It's a way to represent data, especially when on each of those circles, you can put properties, which are just C-value pairs.
Starting point is 00:07:38 So imagine like a map or a Python dictionary or something like that on each one of those nodes in the graph. And so what you say about a node is really two things. A node has some collection of properties and it has a relationship to other nodes. When you put that structure together, then you can represent literally everything that you represent in other ways. So in a relational database or NoSQL database or TreeStore or KeyValueStore,
Starting point is 00:08:07 you can represent exactly that kind of data in that graph. But then the graph gives you a different perspective on it, and it turns out it gives you some superpowers for working with that data. So the different perspective comes from the sense that in a lot of ways, graph data and graph structures feel like there's something different. But in truth, it's more. So you have data in a relational table. You have data in a key value store.
Starting point is 00:08:37 There's a relationship between those in that relational tables are a bit more expressive. They're more powerful. You can join tables together. You can run queries across them. You can talk about the relationship between values there. So in the same way that there is a relationship between key value news stores and relational tables, there's that same relationship
Starting point is 00:09:00 between relational tables and a graph. So they're actually on a progression that gets more and more expressive as you go further down that list. And so internally, we have this picture we talk about among our team. We call it the kite. It's just this kite-shaped relationship about different data stores, where if you start with a single value down at the bottom and you ask this question about when I have a whole bunch of those, what structure do you get?
Starting point is 00:09:30 You get a key value store type of structure. It's a list or a set, something simple like that. But what do I have when I have a whole bunch of those? Well, then it's a table structure or it's a tree structure, just depending on how you choose to represent it, those are equivalent. Okay. But what do I have? And I have a whole bunch of those. Well, imagine a bunch of trees that overlap and intersect. That's when you're working with a graph. When those leaves in a tree, when one leaf shows up in multiple trees, then
Starting point is 00:09:59 you've intersected your trees and you're working with a graph. So there's this progression through a lot of traditional data storage and data representation technologies that gets more and more expressive as you move up that progression. At the end of that process, you arrive at a graph. And Quine got started in part because we kept that thinking going and asking that question to say, well, what structure do I have if I have more of those? And when you get all the way to the end and you've got a graph and you say, well, here's this soup of nodes and edges, all interconnected, connected together. I've got to treat that as one thing. When I've got a bunch of those, what do I have?
Starting point is 00:10:45 The answer is a graph. When you have a bunch of those, what do I have? The answer is a graph. When you have a bunch of graphs, you have a graph. And so the fact that there's this, this mathematical pattern, even that walks through the history of database evolution from simple stores, like C-value stores to relational databases and tree-structured NoSQL data stores all the way through to graph databases, that question of this mathematical progression through those things, it gets to a graph at the end and it stops. That was just so suggestive to us.
Starting point is 00:11:18 We thought we've got to explore this all the way. And so that was the thing that first got me hooked on graph data and graph structures. That alongside of working with some real practical problems around working in startups at the time. And we've got configuration problems about it's hard to get the product configured and out the door for all these different customers. Some things they have in common, some things that they have to be separate. Teasing all that apart and understanding how to define that customer configuration really drove in the direction of saying, it's got to be a graph.
Starting point is 00:11:54 We've got all these trees that are overlapping each other. So teasing apart kind of those practical questions also then steered in this direction of a graph. And so that's what initially led to some of the ideas behind Quine and then dove in to explore it all the way. Very cool. That's such a helpful analogy. I love the kite analogy and then sort of the spectrum of expressiveness. Now, some of the use cases for graph, I think,
Starting point is 00:12:23 are pretty obvious to anyone who's worked with data, right? So identity graph, where you have, you know, nodes and edges, and, you know, you're sort of representing, you know, relationships and all that sort of stuff. But where, so let's just, let's just say we have kind of, you know, like a standard, quote unquote, vanilla data store set up in our stack, right? So you have a warehouse and you're storing, you know, structured data in there and it's driving analytics and, you know, all the other stuff that you do with tables and joins, as you mentioned. Let's say you have a data lake and, you know, you have, you know, sort of like unstructured data storage there, maybe that's serving data science purposes, you know, et cetera.
Starting point is 00:13:01 Where does graph fit in? You know, for someone who's sort of working on a data team, like where does graph fit in? Because I think what's interesting to me, at least when you think about graph, like there are use cases across the stack, right? I mean, in software engineering, of course, you know, to represent relationships of users, say. But then also, to your point, analytics and, you know, sort of discovering meaning from data as well. So graph is kind of can be a utility player in the stack. So help us understand where something like Quine or that dot would, you know, would fit. Yeah. So like we were talking about before this kind of this spectrum of complexity or expressivity you know where you move from single single value stores or key value stores all the way up through
Starting point is 00:13:55 relational and no sql up to up to graph structure so there's one spectrum about complexity so a graph helps you answer more complex questions. And what I've seen in the industry and what really led to Quine's creation is this other spectrum about data volume, like the speed at which data has to get processed. So batch data processing has been the standard forever and still is. Plenty of batch processing still happens. But increasingly, the world is moving more and more towards streaming data problems, which means real time, one record at a time, as fast as you possibly can. And it's never going to stop. It's an infinite stream. So that second dimension about the spectrum from batch to streaming, when you plot those two things together, you've got simplicity versus complexity, say, on your X-axis.
Starting point is 00:14:53 And you've got speed or batch versus streaming on your Y-axis. What we found is that, of course, what you want is the top right corner there. You want the complex real-time-off. There's this barrier between what we're working with in this realm and what we want to get to in that top corner. So you usually have to trade off how much data you're looking at, how much you hold in memory, how fast you can store it to disk and pull it back in again. A lot of just classic data processing and data storage questions that usually force architectural decisions into a compromise. So the compromise is either, well, we've got a complex question, so we're going to use an expressive tool like a graph database
Starting point is 00:16:00 to be able to describe and draw that complexity of some, some pattern that is at the heart of our critical use case, but if you want to do it fast, then, well, you can't really do that. Graph databases have been notoriously slow for decades. So they're cool, but too slow. So you get pushed in the direction of, well, let's stuff it into a key value store and let's build a big microservice architecture. Let's let that be the graph.
Starting point is 00:16:31 So as data flows through that architecture, it basically becomes this graph structured system of processes as that data flows through to build the data pipeline, to try to build this custom bespoke one-off for my company data pipeline. And we're going to try to build that into making the thing that can handle high volumes of streaming data. So all the way at the top with complex answers to our question. So all the way to the side. And a lot of data engineers spend their life in that world and making those trade-offs. So, you know, we need it to go faster. We need to scale it. We need it to be horizontally scalable, faster, faster, faster. It never ends.
Starting point is 00:17:16 But we need to put all this data together. We have to understand it in context. We have to know how this piece of data relates to that piece of data and be able to look for patterns that are built up over time as that data streams in. So that's really where graph has fallen short in a lot of ways. For anybody who's working on data pipeline tools, a graph is the right representation for anything that starts to be complex, but they've been too slow to actually operationalize and productionize. So they've kind of been relegated to smaller toy problems in this space. But if you've got to build something real and big and significant, then you can't, they're just kind of off the table and you're going to have to go build this complex microservice architecture to simulate a graph built out of a bunch of key value stores for speed.
Starting point is 00:18:09 Super interesting. Okay, so this is, I've been hogging the mic from Kostas, but one more question for me. Could you give us, as a follow-on to that, just a real-life example of where you know of a of a challenge you know and a real data stream that a company has and a problem they're trying to solve and and how wine fits into you know an example architecture yeah we've we've been having them coming out of our ears lately the uh fraud tends to show up very often, very commonly
Starting point is 00:18:47 as some form of a graph problem. And it depends a little bit on what kind of fraud you're talking about. So authentication fraud for logging into a website, you know, there's lots of good tools and good products available for signing in. But when something doesn't quite go along the happen path, you've got, say, bad guys out in the world who are trying to guess the password or they have a password list and so they're going to do this distributed, this authentication attack, basically.
Starting point is 00:19:20 Well, that comes all in as one big flood and it just dumps on there. That's pretty easy. Ray limiting has that covered. You've got that problem solved. No problem. But attackers are smart and they're going to keep moving and getting smarter. So now they know, slow it down, you know, and, and spread out your attack, have it come from multiple sources in multiple different ways, multiple different regions, lots of IP addresses. Each one of those factors then lets them hide in the deluge of other data that's happening behind the scenes.
Starting point is 00:19:53 And so to find that kind of authentication fraud of someone who's trying to gain control of, say, a particular account with special privileges, if they gain control, it's a big loss, big problem. So attackers are motivated, but to detect and stop that means you have to start assembling the graph about all these attempts that are happening over time, where they came from, who they're targeting, and these failed attempts that fail, fail, fail, fail, fail on this one particular account, and then they succeed.
Starting point is 00:20:27 And that pattern shows up in this graph structure as you start connecting in each of those authentication events. You start connecting them together, and it forms this often beautiful looking graph that then just leaps off the page as obviously here's the pattern of a password spraying attack coming from so many different angles, low and slow so that it's not triggering the easy alarms. But then when you piece together all those attempts, you can clearly distinguish them from the real user and pull those out. And if they gain access, you can find it, see it and stop it right away. So that's, that's one example of, you know, a fraud use case.
Starting point is 00:21:13 We've seen other kinds of fraud use cases as well, like transaction fraud. So as money's being spent, financial institutions have to do something similar about considering other factors. What's the geography? What's the, the other most recent other most recent purchases? Who signed for all sorts of different kinds of factors? What's the category of the purchase? All these other sorts of factors that start relying on other values and other kinds of information that isn't so readily usable with a lot of other tools. But when you put it into a graph, it draws this picture, which then leaps off the page and tells this very clear story that right now, here's a case of someone who is using
Starting point is 00:21:56 a stolen credit card number to buy something. Super helpful. All right, Costas, please take it away. I've been hogging the mic. Costas Pintuipas Thank you, Alex. That was like super interesting, Ryan, to hear like all the stuff you got to share about the ideas behind Squine.
Starting point is 00:22:19 I would like to start by asking you something about graphs. And I want your opinion on like, we can approach graphs as a way to ask questions and as a way to represent data, right? But you don't necessarily need to have both. Like you can have, let's say, a graph-based language for creating the data and translate that into a relational model at the end, on the back end, or even a key value store. We've seen quite a few companies doing that with GraphQL and having something like Postgres behind. Correct?
Starting point is 00:22:57 Yeah. And then obviously, you can have, let's say, use graphs as a way to represent also data in a much lower level in terms of the database system that you have. How do you see that? And which part for you is the the importance and which part is probably both, I don't know, like RMP-Mended, that's part of like, why? David Pérez- Yeah. What, so an interesting question and there's, there's probably
Starting point is 00:23:38 two sides to the answer. I think the preview is graph applies to both, but it's in different ways, maybe significantly different ways. In one sense, the way that you ask your question can be thought of as a graph. Whether it's GraphQL or otherwise, there's query languages set up for graph queries so you can use the cipher query language the gremlin query language there's an initiative to try to create a standardized graph query language because ways to express your problem very naturally fit into a graph because that's really how humans think a lot of a lot of what we think about and just the way language works, the way our, our mental model of the world works is reflected in this node to edge node
Starting point is 00:24:34 pattern repeated and connected because that same pattern shows up in our language. So subject predicate object, lion knows Costas. I'm a node. You're a node. Our relationship is that edge that connects us. So that's how you build the social graph. But users get connected together by who knows whom. And you can frame questions as a graph
Starting point is 00:25:00 in using graph query languages or even just more naturally using natural language, which is naturally graph shaped. So on one side, there's a good argument and analogy for graph structured questions because they fit the way that we talk. On the other hand, when you go to write programs and write software, that's where the software engineers need to turn it into something linear, some code that we can execute. And there's other good reasons to then head in a different direction on that question
Starting point is 00:25:36 asking side. So it's not exclusively that. And even then, a lot of graph queries turn into something that looks like SQL, like in the Cypher language. It's very SQL-like. So you're writing out this linear progression of a query. It's going to be a query graph, but you're expressing it in something that is this query structure. So the question side of it kind of comes in and out and it's maybe more
Starting point is 00:26:05 conceptual than it is literal for the graph view of the question asking side. What we found is that the really the secret sauce for Quine and what makes it different and special is in the graph runtime, you know, I mentioned before that graphs have been graph databases, especially they have been known to be like really slow and lethargic in terms of what they can handle and that limits their, their application and, and developers to kind of dive in and use them so often. And I think because of that, there's kind of developed this sense in the industry that graphs are slow. If you want it to be fast, you got to use some other tools. Graph that graphs are slow. If you want it to be fast, you've got to use some other tools.
Starting point is 00:26:46 Graphs themselves are slow. What we found through years worth of research is that that's kind of an artifact of the graph database mentality, that data is going to be primarily stored. It's going to sit there at rest, and you're going to bring a query to it and occasionally pull out an answer. So that database mentality about static data stored on disk has been a limiting concept for what graphs can be.
Starting point is 00:27:15 And Quine, the reason that lives in between two worlds, it lives in between that database world and the stream processing world is because really at its heart, we built Quine setting, what if we didn't automatically adopt all the database assumptions? We're going to have to confront the same fundamental challenges, but what if we started from a different place? Let's take that graph data model and let's marry it to a graph computational model. So for us, that means an old idea from the 1970s, the actor model. So for us, that means an old idea from the 1970s, the actor model. So this asynchronous
Starting point is 00:27:48 message passing system that has really become the foundation of stream processing to build reactive, resilient systems means to have this distributable, scalable actor implementations that allow you to send messages as a way of, of record of doing processing and representing data in that system as well. So Quine builds the graph where, where nodes get backed by actors under the hood. So they're independent processes that can, you could have thousands, hundreds of thousands, millions of them running in a system. You could have, you know, a lot of them moving. You can distribute them across clusters. And all communication happens through asynchronous message passing.
Starting point is 00:28:37 So that as the framework, so that much more stream processing kind of approach that lets us then incorporate other stream processing considerations like back pressure into how queries are executed and how data is processed. So building a graph system that isn't automatically adopting the old school ideas of a database, but is built for the modern world of high volume streaming data as the first class citizen we're really trying to work with.
Starting point is 00:29:11 So on the back end, what we found is that when you bring that kind of new approach to the runtime model, it can unlock some pretty stunning efficiencies. It gives us the opportunity to do graph computation at just the right moment. The ideal moment when your data's in memory, it's already there. And then it turns out the graph structure behaves like an index.
Starting point is 00:29:41 You've got a node here and you want to know, well, what data is related to this? What should I warm up in the cache so that it's ready to go for high volume data processing? Well, a very natural answer to that question is warm up the nodes that are connected to the one you're talking to. You know, the one you have an edge connecting to this node in question, warm up its neighbors. That's super interesting. Okay, so if we have, I mean, a key value store, I think it's pretty clear what the data model is there. You have keys and you have values, right? Okay, things can get a little bit more complicated if you also allow some nested data there. But I think it's pretty clear to everyone's mind of how you model something like that,
Starting point is 00:30:34 how you create a key value store, how you design it. Pretty much the same also with relational database. You have the concept of a schema, you have the table, the table has columns, the column has like of a specific type. And then you think in terms of like relationships and how one table relates to the other, blah, blah, blah, like all that stuff. Okay. So there is like, I think it's, okay, like pretty common knowledge of like how someone can design a database to drive an application or like do some analytics. How does this work like with Quine? Like let's say I want to use Quine and probably I already have some data sources that they already have a schema, some of them might be relational, some of them might be, you know, like, hierarchical, like, documents or whatever. What do I have to do to define this graph, right?
Starting point is 00:31:33 Or is, like, Quine can figure it out on its own? Like, how does it work? Like, how do I go from all this different data that I have out there to a consistently updated, with specific semantics that I understand graph on Quine. Yeah. So there's a, there's a two-step process to using Quine. And let me kind of preface that by saying where this will fit is in the pipeline, in the data pipeline. So as data is moving through your system, through your data pipeline, Quine plugs into Kafka on one side and then plugs into Kafka on the other side. So it kind of lives in between two Kafka streams
Starting point is 00:32:11 as being really a graph ETL step. So to help take what's in that first stream, combine it, understand it, express what you're looking for, and then stream out what will hopefully be a much lower volume, but higher value set of events coming out of Quine and into the next Kafka topic. So to use Quine is to first answer this question of about where's my source of data, plug it into a Kafka topic or a Kinesis topic or Pulsar, some streaming system that is going to deliver this infinite stream of events, plug it into that and aim to stream out meaningful kind of distillations of what's coming through your pipeline. So the detection of a fraud scenario or an attack graph from a cybersecurity use case, or the root cause
Starting point is 00:33:07 analysis of your log processing or whatever the use case may be. So those sources of data get built into a graph with the first step. The first step is defining a query that takes every record and builds it into a small little sub graph. So imagine one JSON object comes through that stream. That JSON objects has a handful of fields in it. Maybe they're nested fields. That's fine. But the first step is to using the quine is you write a cipher query that says,
Starting point is 00:33:39 I'm going to take in that object and I'm going to build it into the small little sub graph. It's like a little, you know, picture a paint splatter or something, you know, there's a note in the middle of maybe a couple of things coming off of it. Yeah. That tends to be the shape that it gets built into because we can take that JSON and we could just say, make one node out of it. So take that JSON, make one node. But the value in a graph is when you start
Starting point is 00:34:05 connecting it into other data. So the question is, we've got that JSON object stored as a node, a disconnected node in a Quine graph, but what do we want to start intersecting it with? Pull off some fields from that JSON object. If there's an IP address in there, pull that IP address off and use it to create an edge to another node where that other node represents the IP address. That way, if you end up with two JSON events somewhere in your stream that both refer to the same IP address, they both get connected to the same IP address node. Maybe same thing with URL or username. You can fit it into a hierarchy of timeframes, so that there's this progression of events
Starting point is 00:34:50 that fit into this hierarchical time representation. So that first step is to take that JSON object and pull off fields or kind of build it into a subgraph so that you can intersect it with other data. Yeah, again. So if I understand correctly, like, Quine does not, like, store the data. It processes input data that's coming from stream, and outputs the results to another stream, right?
Starting point is 00:35:29 Is this correct? David Pérez- Great question. Almost. Quine does store data, but it does so using existing storage tools. So there's a persistence layer built into Quine where you can choose how it's going to store data. So locally in something like RocksDB on your local network
Starting point is 00:35:54 or in a managed cluster using Cassandra, there's a pluggable interface to choose any of several different storage technologies for as Quine is building up the graph graph the value comes from from intersecting events that happen over time in your stream it's been an unfortunate limitation in stream processing to say that well if you want to join events you know through an
Starting point is 00:36:19 event stream we'll have to hold them in memory And so you'll have to set a retention window, a time window for how much data you're willing to hold on to. So if you're trying to match A and B, you know, that A arrives first and you have to hold on to it and wait and keep consuming the stream looking for B. When B finally arrives, you can join it to A. That's great. What about when you have a C and a D and
Starting point is 00:36:46 an E and you need to join more things together, your problem gets a lot more complicated. And if you have to hold all of that in memory in order to be able to join them together, then you're forced to use expensive machines with huge amounts of RAM and to set some artificial time window that says our RAM is limited. We can only hold so much data there. So I'm going to hold on to A for 30 seconds, a minute, 10 minutes, 30 minutes. But if B doesn't arrive in that timeframe, I just let A go. And if B arrives after that timeframe, I missed it.
Starting point is 00:37:27 We just missed that, that, you know, record. We missed that insight of what we were trying to understand because we had to force this artificial time window which for operational reasons, you know, was the state of the art. So Quine, Quine tries to solve that by using this storage layer under the hood. Just that you can manage your data storage separately in kind of known fashion, known quantities. You can use robust tools like Cassandra or ScyllaDB. And you can set TTLs so that you can expire out old data. Or you can hold on to it in whatever way works best for your application. But joining data over time in that stream in a way that is then robust, durable,
Starting point is 00:38:11 and can help overcome this unnatural time window limitation. So that we can find A and B and C and D and E when they all arrive more than 30 minutes apart or whenever your time window is. So this becomes really important for some use cases like detecting advanced persistent threats in the cybersecurity world. So attackers who are deliberately spreading out their attack over a long period of time so that they can hide in the high volume of data that's coming. Yeah. So just to make sure that I understand correctly, you have a client that, let's say, calculates and updates a graph in memory, and it perceives the state of this graph to a storage layer that can be anything. I don't know, like a RocksDB or like Cassandra
Starting point is 00:39:06 or something like that. So how do you, I mean, how do you store a graph-like data structure to a storage layer that obviously has not been optimized for that, right? Like it's something different. It can be like a key-value store or it can be a relational database.
Starting point is 00:39:27 So how do you do that? Yeah, so one of the interesting things that Quine does, which I haven't seen in any other, definitely in any other graph system and a lot of other data systems too, Quine uses a technique called event sourcing to store data.
Starting point is 00:39:44 So what it actually stores on disk is not the materialized state of every node in the graph, but it stores the history of changes. So when a new record streams in from Kafka and we've got to go create some structure in the data, that flows to a point in the graph. A node in the graph handles that and says, I need to set these properties. Well, those properties are already there. If they're already set, then it's a no op because we don't have to store anything on disk if we're setting set foo equal to bar, but foo is already equal to bar. So there's no change. There's nothing to do. There's nothing to update. It's only when foo setting foo equal to baz, and it used to be equal to bar, then we've got an update. That update gets saved to disk in the event sourcing fashion that then lets us write a very small amount to disk so that we can keep up with a high volume of data that's streaming through.
Starting point is 00:40:46 So we just write the changes. Many times those changes can be duplicates or no ops. And so we can reduce what we save. And when, when data is stored, it gets saved in a fashion that kind of resembles a write ahead log. You know, it's a small little write-up that can be done very quickly in a simple structure that looks like a key value store. So we can take advantage of the data storage that you get high throughput with key value stores like Cassandra and others to have real high volumes of data moving through, but then building it together into a graph that gives you that expressivity for answering complex questions. And that time dimension is another interesting angle here too. Because we saved the log of changes, we can actually go rewind the entire graph if we need to and say,
Starting point is 00:41:40 here's a question I'd like to run this query and get an answer for the state right now, but also tell me what the answer would have been 10 minutes ago or a week ago or a month ago. Same query, just add a timestamp, and Quine will give you the answer for what it used to be at that historical time. And so whenever you restart like Quine, you have to go and recreate the graph. How does this work?
Starting point is 00:42:15 So the graph gets stored durably. We have that log of changes for every node in the graph that's saved to disk. And then for optimizations and snapshots get saved as well. Oh, okay. So it's kind of shrink what has to be replayed. But on demand, when a node is needed for processing in the stream that's flowing through, when it's needed, that node goes and wakes itself up. So it has actors behind the scenes that can take independent action. So they go fetch their journal of changes and replay that journal or restore a snapshot so that that node is ready to go
Starting point is 00:42:51 and handle the incoming message for whatever time period it's meant to represent. Mm-hmm. That's awesome. And can you share a little bit of information about the performance and how well Wine scales? Yeah. So we have, we've been trying to find the limit and we haven't found it yet.
Starting point is 00:43:12 So in my experience, trying to, you know, I mentioned before I've led some DARPA funded research projects that were aiming at building big graphs to analyze for cybersecurity purposes. And what we've kept finding was that when we used all the other graph systems out there, that if you're trying to stream data into a graph database, that you can run anywhere from maybe like 100 to up to about 10,000 events per second, kind of max. And, you know, there's, there's been some iterations on graph databases, you know, to, to get, get to that 10,000, but there's still kind of this limit at that level. And it's, and the problem gets harder when you add in a combined read and write workload.
Starting point is 00:44:02 So we're not just writing the graph. We're also trying to read out the results in real time and publish them downstream. So in our experience, we tried every graph system out there. Some of them lose data along the way, which is just not good. Others just kind of have this natural limitation on their throughput that kind of caps out at 10,000-ish events per second, it depends on the use case at that, you know, we're deploying Quine in an enterprise environment, we had our customer come
Starting point is 00:44:33 to us and say, our problem begins at 200, 250,000 events per second. So if you can show us something running that fast, then we can talk. So we, we stood up Quine, ran it at 425,000 events per second on a cluster that costs $13 an hour. Oh, wow. That's, that's impressive. So since then we've we've kept going past that and we've done over a million events per second.
Starting point is 00:45:03 And that's a million, a million events ingested per second. And what we haven't gotten to actually, and kind of this, apologies, this long running kind of long way around to answers to your question. We haven't gotten to the other fact that to get data out of Quine, you'd set a standing query to monitor that graph
Starting point is 00:45:22 looking for as complex a pattern as you'd like. You know, three node hop, five, 10, 50. It doesn't matter. As large and complex a pattern as you like, you know, conditions and filters and whatever you need to express here. Looking for a pattern in that graph where we're doing that reading and monitoring at the same time that we're have this right heavy workload all the numbers that i were talking that i was talking about getting up to a million events per second and beyond that's a million events per second ingested while simultaneously monitoring the graph for complex patterns and streaming out results so So it turns into the equivalent of another,
Starting point is 00:46:08 depends on your pattern, but if it's a five node pattern, you're probably doing something like another 5 million read queries per second. The equivalent of that, we're not actually doing that. But the equivalent of that, that sort of complex querying of your whole data set
Starting point is 00:46:24 in order to find every change, every update, every new instance of the complex pattern you're looking for. So those stream out of Klein using something that we call standing queries. It's like a database query. And you just say, here's, here's what I'm looking for. Every time you find it, here's the action I want you to take. So I've got to publish it to Kafka or log it to, you know, to standard out or save it to disk or even use it to pull back into the graph and update something else. So that standing query then, you know, triggers that output and kind of feeds it to the next
Starting point is 00:47:01 system. And in terms of like query concurrency, I mean, how many standing queries someone can kind of feeds it to the next system. graph to create value at the end. So we've deployed with hundreds of standing queries so far. And I know this was something our team was working on recently because there's some low hanging fruit to even carry that number up a lot higher. Okay. This is one of the reasons for, for open sourcing this project is to kind of show, show off this interesting work that that's been done here and also just kind of, you know, get input and, you know, get, get insight from any community members who want to get their hands dirty and look at the code
Starting point is 00:47:53 and say, you know, Hey, you know, here's a, here's a little tweak that we could, you know, improve the efficiency over here. Henry Suryawirawanaclarena, Yeah. That's amazing. Okay. One more question, which is a little bit less technical, but I think like it's, it's quite interesting, like from a product perspective. So you mentioned that, okay, like thinking in terms of the graphs and asking questions
Starting point is 00:48:14 in terms of the graphs, like a very natural way for a human to do right. But at the same time, I mean, when it comes like to querying data, SQL has been around like for like, since forever, everyone is trying to kill SQL, but you know, it's like a cockroach. Like it never dies. It's still there. Like no one can kill it. Like someone like a quite well known investor, like Martin Casado said once that like you can bet as much as you want against JavaScript and SQL,
Starting point is 00:48:45 but at the end you are going to use. Yeah. They have proved themselves like again and again and again, like they cannot get killed. So how do you get like people to start using like a different query anchorage? And how do you educate people and how big of a problem it is also like when you're trying like to introduce a new paradigm for like interacting with data? Yeah, great question.
Starting point is 00:49:11 So SQL, like the reality is that SQL is literally the most known most used language on earth. It's even more so than, than JavaScript. Yeah. Everybody knows SQL and it, it won't die. It won't go away. It's our, it's our declarative data language. So it's kind of how you work with so many data tools for, for us, what, what that means is maybe a couple of folds. So on one hand we've implemented Quine to use actually two different graph query languages. So Gremlin is an early graph query language.
Starting point is 00:49:55 There's some limited support in Quine for using Gremlin, but it's kind of a, it's a tricky language. Cypher is something that Neo4j created. It's meant to be much more SQL-like. You know, it's syntax, it's use of keywords. You know, it looks and feels very SQL-like. So it's not that different. The main difference is that instead of your select statement in Cypher, you use
Starting point is 00:50:24 the match statement and then you draw ASCII art. Your ASCII art kind of draws pictures of nodes connected by edges, you know, and that gives you that connection to a graph structure. Our team has actually done some additional research, which maybe I shouldn't say this, but we've done some work toward actually eliminating the need for GraphQL languages in general.
Starting point is 00:50:50 So the ability to use SQL directly on a graph, there's no reason they can't compile down together and our team has figured out how to do that. So it's not something we have released yet, but it's an interesting forthcoming update. Awesome. Looking forward to that. All right.
Starting point is 00:51:11 I think I really monopolized the conversation here. So, Eric, all yours. Yeah. I think that's just how the show works. I monopolize, you monopolize. Yeah, true. And then the show's over. Yeah.
Starting point is 00:51:24 Okay, Ryan, let's talk about... and then the show's over. Yeah. Okay, Ryan, let's talk about... You know, what's interesting is I'm thinking about the team that would operationalize Quine, right? And, you know, potentially it involves a lot of different, you know, sort of a lot of different players,
Starting point is 00:51:49 you know, from different sides of the technical table. What does that team usually look like? You know, cause you're, you know, you have like sort of the core, you know, you're deploying something, you're connecting it to, you know, existing pipeline infrastructure, streaming infrastructure, you are setting up, you know, a key value store that, you know, that manages the log, right? You're querying that, right? So can you just explain the ecosystem of the team that would both implement and then operate client on an ongoing basis? Yeah, sure. So what we've usually seen is really three roles.
Starting point is 00:52:27 And sometimes these three roles are occupied by a single person. Sometimes they're each occupied by separate teams, just depending on the scale. So one is the architect, you know, the person responsible for kind of the big picture. How do these things relate? How does it flow together? You know, how does the system scale? How does it flow together? You know, how does the system scale? How does it connect into the rest of the data pipeline? So the architect working at that level to, you know, see the structure that that coin fits into. Next is the subject matter expert. So the person who says, here's my Kafka topic or my Kinesis topic.
Starting point is 00:53:08 It's got data in it. I know what that data means, and I want to turn it into this kind of an answer. So that's the person who is writing the cipher queries, understands how data gets built into a graph, how that graph gets monitored with standing crews to turn it into a stream of answers coming out. And then the third role is really the operator. So the operations team or the data engineering team who is responsible for standing up and keeping the system running over time in the long run. And so those three different roles, as I mentioned, sometimes they're occupied by one person and they wear three different hats in those roles.
Starting point is 00:53:49 But a lot of times we see, especially at larger companies working at scale, they tend to be separate concerns and occupied by separate people. Yep, makes total sense. And then, oh, sorry, go ahead. I was going to say the enterprise version of Quine is something that the company behind Quine support. And that's aimed at trying to help address the concerns that those teams have in a real high volume enterprise situation. So, you know, you want to run a cluster, you want that cluster to be scalable and resilient.
Starting point is 00:54:24 So running a cluster, members are going to die be scalable and resilient. So running a cluster, members are going to die. That's just how it's going to happen. You need to have that system running in a way that can be resilient to failure, can be scaled over time, can coordinate with and scale with the data storage layer behind the scenes. So that's really where the commercial work for that dot tends to focus. Got it. Very helpful. And then last question here, because we're getting close to time, but we talked a lot about sort of why graph, you know, or why, you know, Quine, what problem set does it
Starting point is 00:55:07 fit and talk to you a couple examples. Let's talk a little bit about when so some of the examples that you mentioned, and even some of the, you know, the world, maybe especially some of the scale, you know, requirements 250,000 events per second, you know, sound like very enterprise level problems, right? Like high scale enterprise level problems, which makes sense, right? A lot of sort of emerging technology, you know, this is a pattern that we've seen over and over again, you know, emerging open source technology will sort of be built to solve these like large scale problems where, okay, well, graph had a, you know, a limit of 10,000 events per second. And so it really wasn't a viable solution for sort of these like high
Starting point is 00:55:51 scale systems. And so technology will emerge to solve this problem that existing technology couldn't. Do you see that trickling down? And is there a use case, you know, for smaller companies that aren't at large scale? And so maybe that's the question, not necessarily why graph, like that makes sense, but maybe when graph. Yeah, great question. It's, we've started seeing this and I think this is part of the reason why there's so much buzz around the open source community is that even in a small scale, we've seen users taking Quine and using it as
Starting point is 00:56:28 part of their data pipeline to put some of the expensive interpretation step ahead of their expensive tools. So you're streaming data and it's headed to Splunk. And, you know, maybe you don't have a million events per second, you know, going into Splunk or else you'd be spending billions of dollars a month on your Splunk bill. But, you know, you've got data flowing through and downstream, it's going to go through some expensive analysis. Maybe it's some complex machine learning that has to happen. Maybe it's being loaded into Splunk or some expensive tools downstream we've we've seen users put quine into their pipeline to do some of the early processing that can can you know take some of those expensive downstream analysis steps out of the
Starting point is 00:57:21 out of the the, basically. Yeah. Do them upstream so that you don't have to pay for it downstream. And a lot of times, those aren't millions events per second kind of use cases. You know, they're more reasonable scales in the hundreds or thousands or tens of thousands kind of scale
Starting point is 00:57:40 that let you put the pieces together, understand what you're working with, reduce your data, and have the reduced amount of data be more meaningful so that you can either more effectively or more cheaply and efficiently do the downstream analysis that you need.
Starting point is 00:57:59 Yeah, super interesting. And is that... I know there are several factors there. It's almost like a first pass filter you know or sort of you know it's like a pre-compute before it gets to the expensive compute layer and how much of that it sounds like there's two sides of that right so you said you were running you know 450 000 events per events per second at $13 an hour. So there's an infrastructure architectural advantage where, because of where the system is stacked and the way that it's architected, it just sounds inherently that you're sort of realizing low cost at scale.
Starting point is 00:58:44 But on the other side, that's due to the nature of graphs, right? So like the graph, you're really leveraging the power of graphs as the pre-compute and it just so happens that the architecture runs very cheaply. Is that the best way to think about it?
Starting point is 00:59:00 Yep. Yep. Completely agree. And that step to assemble your data, you know, like we described, it's like one record at a time, but you spread it out for the sake of connecting it. And once it's connected, you can see all these meaningful patterns of the kinds of things that we just naturally talk about that get built over time. And so that's why a lot of the times we see that the total volume of data go down because you can have a lot of things come in, a lot of partial patterns get built, but not necessarily be what you need to analyze. And so we like to say high volume in, high value out. The goal is to reduce how much data comes out, but have it be more meaningful, more understandable, more important. Yep. Makes total sense. All right. Well, we're at time here, but before we hop off, if our listeners want to learn more about
Starting point is 01:00:01 Quine or that., where should they go? Quine.io is the home of the open source project. So there's documentation on there. It's getting started tutorials. You can download the executable. It's packaged up as a jar or Docker image, a handful of different formats. So go check out Quine, pull it down. It's free.
Starting point is 01:00:26 See if it can help in your data pipeline. If, if that helps solve an important problem for you, or if you have questions about it and you want to share, share your questions or share your story, there's a Slack community linked from Quine.io that you can join and ask questions or kind of share stories about what's going on. And if that helps solve an important problem and you want some help in scaling that or commercial support for it behind the scenes, that.com is the company behind it. The creators of Quine and our team of excellent engineers
Starting point is 01:00:59 and others who are working towards supporting that. Awesome. Well, thank you so much for the time, Ryan. I learned a ton and it was such a great conversation. So thank you. It was a pleasure to be here, Costas. Eric, thank you very much. That was a fascinating conversation, Costas.
Starting point is 01:01:15 I think, you know, I don't know if I have one major takeaway. The volume in events per second, a million events per second is pretty wild. So that's, you know, that was certainly a takeaway. But I think the larger takeaway, we didn't necessarily discuss this explicitly, but what I'll be thinking about is we've talked about batch versus streaming on the show, actually a good bit. And what strikes me about Quine is that it is fully adopting streaming architectures, right? And it's assuming that the future will operate on primarily a streaming architecture, which is pretty interesting. And that architectural decision for Quine, I think, says a lot about the way that the people who created it, you know, see the future of data.
Starting point is 01:02:08 Yeah. I think it's also like a more natural paradigm for the kind of stuff that you're doing with graphs. Because graphs naturally arise like in systems where you call like events and interactions in general. So, and this tend to be like streaming in nature. So I think like makes, makes a lot of sense. I think what I found like extremely interesting with the conversation with
Starting point is 01:02:34 Ryan is that it's, it feels to me at least that they have managed with Quine like to figure out exactly what's the right type of problem that makes a lot of sense to solve with both streaming and graph processing. let's say like complimentary, almost like adding like a streaming layer, sorry, not a streaming, but a graph layer on top of like whatever you already have, then having another, like yet another monolithic database system that's trying to do like everything around like graphs. So that's what I found like extremely interesting. And I think I'm, I'm obviously like very curious to see where Quine will get. But this conversation made me like very excited about the progress of these projects.
Starting point is 01:03:38 And especially because of this like product, let's say, decisions that they've made. Sure. It's very cool. All right. Well, thanks for joining us on the show. Subscribe if you haven't, tell a friend, and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show.
Starting point is 01:03:56 Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.