The Data Stack Show - 55: Tables vs. Streams and Defining Real-Time with Pete Goddard of Deephaven Data Labs

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the show. Today, we're going to talk with Pete Goddard of Deephaven. Deephaven is a really cool data technology, real-time streaming. It's just really cool. I can't wait

Starting point is 00:00:40 to dig into the technical details. But my burning question, I'm really interested. So Deep Haven was built by a group of really smart people who came out of the finance industry. And I'm just really interested to hear about that. I mean, based on some of my past experience, like people from finance tend to be really good at technology if they're passionate about it in terms of software development and building things. And things get pretty complex in finance. So I'm just, I'm really curious to hear the backstory. How about you, Costas? Yeah, absolutely, Eric. That's one of the things that I definitely want to hear about. I think it's also like the first time that we have someone who has experience in the finance sector. So

Starting point is 00:01:22 that's going to be very interesting to learn a bit more about how technology looks in this sector. And yeah, I'm really interested to learn more details, technical details, because I know that the guys in this sector, they are like very obsessed with performance and optimizations. So I'm looking forward to see what the special shows are on Deep Haven. Awesome. Well, let's jump in and talk with Pete. Let's do it. Pete, welcome to the show. We're incredibly excited to learn about you and about Deep Haven and all sorts of awesome data stuff.

Starting point is 00:02:01 Hi, guys. Thanks for having me. This is a real treat. Cool. Well, let's give us just a brief background on who you are and where you came from. Sure. I'm an engineer from the University of Illinois. I bounced around a little bit in the Midwest and London. I went right from school to, believe it or not, a capital markets trading company in Chicago. This was at a time when it was a little bit unusual to do something like that. I traded derivatives for a bit. And then for since 2000, I've more or less lived at the intersection of trading and quantitative trading, algorithmic trading and system development. So think of risk management teams and quantitative teams and computer science teams all working together.

Starting point is 00:02:45 That's really where I've lived. And then I founded a hedge fund that was based on that around 2004, 2005, and spun a piece of technology out of there in late 2016 with some founding engineers. And that's what I'm here to talk to you about today. It's called Deep Haven. Very cool. And I want to hear about the origin story because I just loved reading about that before the show. But before we go there, what does Deep Haven do?

Starting point is 00:03:15 Deep Haven is two things. It's a query engine that's built from the ground up to be first class with real-time data. So real-time data, both on its own, as well as in conjunction with static data or historical data, things that one might consider batch data. So that's the first thing. It's really that data engine. But then Deep Haven also is a series of integrations and experiences that come together to create a framework that

Starting point is 00:03:48 make people productive with that engine. These things are important because as you move real-time data around, as you move updating data around, the infrastructure to support that in the world is not as robust as with static data. So we had to build a lot of that tooling ourselves. Very cool. And just for the listeners, when we talk about, and we can get into the product more deeply, but when we talk about real-time data, updating data, moving data around, one thing that's interesting is we think about just all the different context tooling, et cetera, on the show. A lot of tools will focus on a particular type of data. Is that true for Deep Haven? Or are you agnostic and it's just you'll move any type of data? Is there a particular

Starting point is 00:04:36 type of data that Deep Haven excels for? Sure. So we came from a capital markets background, as I suggested. That's sort of where I lived. We built this tool originally for ourselves, though certainly it has evolved a lot. It was always developed, and I think today it certainly stands as mostly agnostic about data and use cases. It can certainly support unstructured data, but it really excels at either semi-structured or structured data that is either updating or historical and static. So the way I think maybe easy ways to think about data in regards to that is where does it live and how can you get at it? Certainly you can get data into memory via all sorts of methods and we can talk to that, but there's known technologies and known formats for data. And for us, that means parquet certainly is important, all sorts of files, as you can imagine. And then the real-time stuff

Starting point is 00:05:40 is we're in the world of Kafka and Red Pandas and Solace and Chronicle queues and things of this nature. So when we think of data, we're thinking about those formats and how do we interact with them, not necessarily what are the specific use cases that business users and their developer and quantitative partners are delivering on top of it. Yeah, absolutely. Well, I want to dig into the technical stuff. I know Costas has tons of questions, but really quickly, I'd love a quick story on the background. So reading about your story and the birth of Deephaven, it was interesting to me. I used to work in the education space at a company that taught software development. And a trend that we noticed was that people who came from quantitative finance really

Starting point is 00:06:38 made incredible software developers. And maybe in another show, we can discuss the reasons behind that because you probably have way more insight into that than I do. But when I read the origin story and looked at the platform, I thought, oh, that makes total sense, actually. Yeah, of course, these are the people who could build extremely high-scale real- time or infrastructure to support like high scale real time and all the use cases surrounding that. So you just want to give us a brief background on how did you go from being in a hedge fund to being a software entrepreneur? Sure. Thanks so much. And it is interesting. And I think in regards to the people that you cited or the type of people that you

Starting point is 00:07:26 cited, I think I've just been very lucky, frankly, in that I've only worked in that space. And I think I've been blessed to have very high quality engineers and really good people to work with. So I certainly echo your sentiments on that one. In regards to the origin story, chronologically, it goes something like this. In 2004, 2005, I was approached by a couple of friend of mine that I, to say I respect it is probably an understatement. I admired them to start a trading company with them, a quantitative trading company, a market maker, actually, at its core. We spent five or six years really just in more or less high frequency trading, most of it in the derivative space, the option space, as opposed to stock or futures or something like that. In addressing problem, that problem set invariably you have to build a team

Starting point is 00:08:19 that is very good at a number of things, certainly quantitative modeling, some sort of machine learning or predictive sciences, let's say, as well as just system building, both from a fast software perspective, as well as from a good technology, think hardware and networking perspective. So we assembled a team and were really in that space for six years or so. It's an interesting industry to be in because it's a bit of an arms race where you think you've built something that's kind of interesting and has an advantage, but you know that your advantage won't last very long. So in 2012, the partners and I decided that we should diversify, that we wanted to get into more scalable strategies. We wanted to move towards different horizons of prediction and

Starting point is 00:09:15 different business evolutions, let's say. And so we looked at, we defined what we would need for that business. Just like anyone, just like anyone else. So I'm going to diversify. What do we have to do to do that? What are our advantages? How are we going to pursue this? And one of the keys that we thought about is data infrastructure. So really that was the point that led us to the development of Deep Haven right there

Starting point is 00:09:41 in early, right around 2012 is when it became, it first was seeded, let's say, as an idea. I can, if you'd like, I can tell you the criteria that we used at that time to define a system that we would want. And maybe we could discuss how you think about that criteria, what your reaction to that criteria might be both in 2012 and maybe even today. Yeah. criteria might be both in 2012 and maybe even today. Yeah, I'd love that because I mean, 2012 is, I love that that was the year where the idea was seeded because you have the birth of Redshift

Starting point is 00:10:14 means the very early innings for like modern cloud data warehouse. And then if you fast forward to modern day, the available data tooling is pretty significant. And so I'd love that comparison of what did you define then? And then what does it look like now? Yeah. And I feel so lucky to have the seat I'm in and to be able to witness this journey of the industry, both as an insider, at least a wannabe insider, as well as somebody who's just learning all the time. And as you know, as you're all aware, this space is moving very, very quickly. Yeah. So our criteria back in, at the time I'm CEO of a hedge fund, right?

Starting point is 00:10:59 A quantitative hedge fund stuff is we say, okay, we're going to do this business. We need a data system. And we really established what we thought were some pretty basic criteria. We said, okay, we want this data system to be a central source of truth. We want it to be one system. If it's going to be one system, we literally want every single person in the company to coalesce around the system. You can imagine we had lots of different people. Very, very few of them would consider themselves DBAs. Most of them would say they're, oh, I'm a developer, I'm a quant, or I'm a systems person, I'm a trader, I'm a portfolio

Starting point is 00:11:36 manager, I'm a compliance person, I'm in accounting. We said, well, it's going to be data. We want all of these people around it. That was first and foremost. The second criteria was that we wanted to support all use cases that weren't high frequency. So high frequency in the capital markets means at the time, let's just say it was certainly sub millisecond. Right now, it's depending on, it's much, much, it's many orders of magnitude faster than that. But at that time, let's say I said, we want it to support everything that's not high frequency. So 10 milliseconds and higher. So 10 milliseconds to the last century, any data that's within that span, we want this one system to be able to handle it.

Starting point is 00:12:21 And that was the second criteria. And the third is just simply, we wanted it to be fast and to perform it. And that was the second criteria. And the third is just simply, we wanted it to be fast and to perform well. And we wanted that to be the case for basic use cases, like, I don't know, looking up P&L or understanding position as well as sophisticated use cases, like using Python or Java to build a predictive model or do other neural network stuff. So those were the basic criteria we had. And so one, I can tell you what my reaction was then and what my reaction was now, but maybe you or Costas could volunteer, like, what do you think? We wanted a system that lots of people could use that was good with both real-time and historical data, and it was

Starting point is 00:13:02 pretty high performance how would you would how could you have assembled a system like that in 2012 or how could you do it today coastless you got any ideas you're the man on this stuff ah yeah yeah yeah i can i can i can i tell a few things eric do you want to give it a try or you want me to no i'm going to let you handle this because in 2012 you were working with whatever available data technology was there while i was still no i don't know if i was quite that deep into the stack at that point so take it away you can speak for both of us yeah yeah yeah absolutely i will say one thing though it is it is really interesting in 2021 to think one phrase caught me or one phrase, well, I mean, super interesting criteria, but saying we want everyone to have access to the

Starting point is 00:13:56 data, that sounds almost cliche in 2021, just because, I mean, A, it's been just abused in marketing speech for a very long time now, but it's easy for us to say like, oh yeah, I mean, with the data tooling available today, like, yeah, that's like, that's not uncommon. But back in 2012, it really was like, that was a pretty technical requirement, right? Not just necessarily like a cultural one. So anyways, that'll be my contribution. Yeah, my contribution, like I'm trying to time travel back to 2012 and how I would architect a system like that.

Starting point is 00:14:38 My first approach at least would be something like what was called, and I think it's still called like a Lambda architecture, right? So you pretty much have like a combination of a streaming system with a batching system. And you are talking pretty much for two systems, actually, not one. But it gives you, let's say, like the flexibility to implement both use cases, both for like the streaming data and the real-time data, let's say, and the batching data.

Starting point is 00:15:06 Now, when it comes to the batching side of things, probably should be, at that time, if I was, let's say, adventurous, it would probably be Spark. I think it was Spark that was introduced in 2009 or something like that. Yeah, Spark was young. Yeah, Spark was young. So that or something like that. Yeah, Spark was young. Yeah, Spark was young. So that would probably be that. Now, in terms of like the real-time part, if I remember correctly, Kafka was released in first time in 2011.

Starting point is 00:15:37 But I think that, how was the platform from Twitter called Storm? I think. Yeah. So maybe I would probably use something like Apache Storm for the processing. That's what I would do back then. Okay. Now, what would happen today?

Starting point is 00:15:57 That's another. That's what we're talking about. Yeah, exactly. But probably it would be something like that, like a Lambda architecture with these two main technologies as their pillars, one for the batching processing and one for the streaming processing. And of course, we are talking about two separated systems with whatever issues these can introduce in this architecture, right? You hit on actually one of the phrases that we'll use sometimes, and that is Lambda architecture. Yeah, that is very much what we delivered. The short answer of our experience in 2012 was that we looked out in the marketplace and we were thinking, I'm the CEO of a trading company. I'm thinking what vendor can provide this? Because I don't want to build this from scratch. I want it now, not in a couple of years or something like that. And I don't want stitching together a bunch of different technologies was not something that I was versed in at the time.

Starting point is 00:16:49 So we just found none. The reality is we surveyed the marketplace and we just found no solution. Some of the technologies you talked about seemed relevant, certainly, but no solution. And one thing to remember is when you talk about Kafka, at least certainly at that time, and largely Kafka is used as a phrase in a number of ways. A lot of time when you're talking about Kafka, you're talking about the PubSub system, not the data engine, right? That there are data engines on top of Kafka, but that's not identical. And Confluent has one, but it is not identically what we think about when we think of Kafka. So the short answer is we rolled our own. We built this system with a Lambda architecture. We had a couple of breakthroughs. We think one was an update model, which allowed for

Starting point is 00:17:38 incremental calculations rather than calculations on whole data sets. That is very, very empowering as you either from a compute perspective, or as you think of use cases that have complexity and a lot of pipelined logic, being able to incrementally see new data and then just do small computes instead of massive computes is really quite a big deal. So we had a breakthrough in regards to that architecture. And then consistent with that Lambda comment you made earlier, we created a unified model for handling historical static batch data, as well as for handling real-time streaming, updating, ticking data, such that users, whether they be people that are just writing table operations or users that are doing sophisticated things, combining that API with

Starting point is 00:18:32 other languages could really be blind to and agnostic about whether they are working on historical data or real-time data, which is huge for a company that's trying to have a very quick turnaround from explore to productionize to deploy, right? And that is really what defines a quantitative trading firm, being very good at finding new signals and then getting them out into the marketplace. Yeah, it makes a lot of sense. Let's go a little bit back. I know that you touched some of the stuff that I'm going to ask you,

Starting point is 00:19:06 but I want to ask these questions first because I have something very specific that I want to address with your product. I feel like you're setting me up here. I like it. Let's go. Yeah, yeah, yeah. So you mentioned time, okay, and real time. So real time, it's probably one of the most abused terms, especially in computer engineering and like in tech in general, mainly because it's very context specific, like how you interpret like real time, right? For example, and this is something that's like, let's ask, for example, Eric, right? Like in marketing, Eric, what would you consider real-time for you?

Starting point is 00:19:47 That's a great question. Not to complicate it, but I think it differs with B2B and B2C. I mean, in B2C, anything close to, well, I was talking with a company the other day, and they went from six hours to 15 minutes for real-time e-commerce analytics reporting on A-B tests for conversion rate optimization. And for them, that's real-time e-commerce analytics reporting on A-B tests for conversion rate optimization. And for them, that's real-time, right? And I think from that standpoint, it really is because you can't really act on the data faster than that, right? You can't look at results from a test and then deploy the next test in less than 15 minutes. And then I would say in a B2B

Starting point is 00:20:25 context, I mean, for a lot of companies, like getting your data updated daily is plenty sufficient. If you think about like leads or deals going through the pipeline. So yeah, the sub 15 minute, and of course, there are other use cases, right? Like there may be some recommendations engines that are running based on data science models for delivering a user experience where I would say that really, that has to be extremely like the lowest latency possible, but that's more about infrastructure delivering experience and less around like marketing analytics data. Maybe that was more than you were looking for, but there's much. at the problem set from two different perspectives, two different histories, really. But I think where I would hope we can get to by the end is perhaps we might both discover that

Starting point is 00:21:34 there's a new future where we become quite close to one another. And it's a future that presents both a lot of challenge, but also a lot of opportunity to companies like Rudderstack, but companies like Deephaven as well. And I think we can talk about that some more. But Kostas, you were about to ask a question. I didn't want to get in front of that, please. No, no, no, no, no. So, yeah, I wasn't actually what I was trying to do is like get different interpretations of what real time is. Yeah, it's a great question. And I'll answer the real time financial markets in just one moment.

Starting point is 00:22:10 But I think just I get this question when I now that I'm starting to talk to people outside of the capital markets, like their instinct is a little like yours, like, I'm not sure what real time means or how relevant it is. But when we think of real-time data, the first thing we think about is not latency. The first thing we think about is the fact that data is changing. We see data as a thing as in flux. And there, Eric, I imagine you would agree, if you just put a pin in when do people need it? If I say, is data the same or is data changing? You're probably content to say that it is changing.

Starting point is 00:22:51 So then I will just ask a fundamental question about your architecture. And that is, would you be open to an idea that perhaps your architecture should be designed with that fundamental in mind? Or do we have to insist that the future of data, even for lower latency, has to be that everything is a picture in time. Everything is static data where you have to look at only the whole universe of data all at once. We think fundamentally real-time data is first about just saying data is in a state of flux and you need to architect your system accordingly. And if you really believe that, you've narrowed your options quite a bit because most systems are not organized that way. And we think Deep Haven is organized uniquely that way. And I think that's one of those things that as you, Costas, we keep delaying your question. I'll stop after this.

Starting point is 00:23:48 I'll get to it. Yeah. But when you ask that question, it makes logical sense. But for me and my experience, it's one of those situations where if you're not asked the question directly, it's hard to imagine that world, if that makes sense, just because the world that most people I think have lived in when you think about real time is just related to latency, right? That's how you interact with, consume, plan, all that sort of stuff. And so it's one of those things where you almost, it's like, oh yeah, I guess I need to think about that. And even some ideas are going through my head around, wow, like I'd probably approach some things

Starting point is 00:24:34 pretty differently if that were the case, right? If latency weren't the first thing I thought about. Yep. And I think Eric, now that we are talking about this, I remember like at some point we were discussing another episode, where we ended up with the conclusion that batch and batch processing at the end is just an approximation of a stream of data, right? And at the end, everything is the same when it comes to data, exactly, because you have this dimension of time, right? Like everything is a snapshot in time. Yes. And I think you're really starting to speak my language here, Kostas, because we actually don't only think about this idea that data is changing, but we actually think time actually really matters.

Starting point is 00:25:16 And we think this dichotomy between relational use cases and time series use cases is a false dichotomy. We think there's all these dichotomies that will separate the rudder stacks of the world from the deep havens of the world. And we don't think that they're right. We don't think data scientists and developers are different from one another. We think that it's a spectrum and there's not two camps that they have a lot in common. We don't think applications and analytics are different. We think it's a spectrum and we architect systems so that we're really covering this

Starting point is 00:25:51 whole spectrum, this whole arguably continuum in these different dimensions. And when it comes to real-time data, to answer your question, in the capital markets, it's as weird of a question as it is between capital markets and other places because high frequency lives in single mic turnaround times and jitter within systems matter as much as performance of systems. And they'll use that as real time, whereas some portfolio manager might say he's an algorithmic trader and he's trading in real time and he's at 10 millisecond latency just with his signals. And you'll have an asset manager who rebalances his book based on factors in real time, and he's really doing it on a 15-minute cycle.

Starting point is 00:26:45 So there's nothing sacred here. I think we're respectful of this term meaning lots of things to different people, but we think anything below a few milliseconds, you're really not in a general form data system. And anything above that, we think you should be able to cover in a general form or general purpose data system. And Deep Haven certainly lives within that space all the way through any sort of static or historical data engine workload that you can imagine.

Starting point is 00:27:15 Yeah, I love that because at the end, you pretty much gave a quite similar answer, but it's one from their own domain. And that's lovely. At the end, when it comes to real time, the answer is depends. Yeah. But here's what's interesting. And I think this is maybe something I'd like to put a bow around, right?

Starting point is 00:27:36 One, I think you, I'm guessing, I know very little about this world, right? I'm not at all an accomplished computer scientist the way you are. But I know very little about your history and your workflows, but I expect you come mostly from an OLTP world originally, right? Where transactions are, when you think of database, at least over the longevity of your career, transactions are really fundamental to what you think about, right? And so it used to be just OLTP, then, oh, we need to do analytics on this. We need a second system that can do that, right? And now that's evolved into a variety of alternatives. But now, as you suggested, Kafka became a thing, right? This idea of event streams and maybe not even just Kafka or its competitors, but think of APIs to vendors or different things you want to do to scrape the world. Or now there's Twitter feeds. Oh, there's sensors on windmills that are out there.

Starting point is 00:28:37 There's all sorts of IoT. I want to analyze what the different telemetry is from my Apple Watch is telling me about health, the health of me as an individual, or maybe healthcare providers is consuming a lot of those telemetric feeds from a variety of people and aggregating them and denormalizing them and trying to create baselines for different people. Well, now we're moving into a pretty interesting space, right? We're moving into a space where classic transactional workloads, as they were handled earlier, are not at the center. That maybe that is not the fat part of the future when you think of the Gaussian of this distribution in say 2030. It certainly has a role. There's certainly some implementations where that acid is at the core of what's important about this data, but now there's many,

Starting point is 00:29:35 many more use cases that all of a sudden become interesting. Kafka suggests that they're pretty good at transactional loads. Certainly a lot of transactional loads that are local don't need as much heavy OLTP. And now you're in an environment, which is always what we thought from the get-go, which is, hey, I want data in one place. I want it to meet software. I want that software to certainly be great at table operations,

Starting point is 00:29:58 but I also want that software just to be Python, just to be Java, just to be C++, just to be name your language compiled down with the table operations and be performing. And that's how we think of the world, and that's where we think the world is going. And we built a system that we think serves a lot of those use cases today and is certainly skating in the direction of the puck.

Starting point is 00:30:22 Yeah, yeah, absolutely. I mean, I think this whole, let's say, approach of, as you say, like moving a little bit away from the transaction model with the database systems and start thinking of the data together with like the time dimension there. It's something that it's not just about data. It's pretty much about anything, to be honest.

Starting point is 00:30:41 You can see it even in software architecture, right? You start listening about event-driven architectures, about CQRS, about microservices, and about like data that is like immutable and you consume it and you can replicate the whole, let's say, history of what has happened like with this service. So the main core ideas,

Starting point is 00:31:03 they pretty much exist everywhere and we see them being applied almost everywhere and that's like super interesting like for someone who is trying to not focus only in one side of technology because okay like obviously each one of us is hyper focusing on something and we see like the world only from the lenses of this of our work but it's pretty much everywhere it's very interesting we had some episodes where we were talking with software engineers. The guys had pretty much nothing to do with data engineering or with data platforms.

Starting point is 00:31:33 And they were describing how they re-architect their products. And you would hear similar terms used. Events, event-driven, pub-sub, immutability, all these terms that we hear a lot about, like delivery semantics, like all that stuff that we hear from someone who works with Kafka, like with data systems, you would hear it also from people who are architecting actually like infrastructure for a product,

Starting point is 00:32:03 which I found like super, super super interesting that's one thing the other thing uh that i really found interesting what you said pete is about everyone should have access to that right it doesn't matter if you are like a data engineer if you are like a data scientist if you are a software engineer you need to have access to the data right and you you need to have access to this data using your own tools and environment. And that's very interesting because I remember I had a conversation with a customer a couple of days ago, and they were saying that they are trying to unify the platform that the data scientist team is using and the data analytics team is using, because they are using two completely different systems, right? The data analysts, they work on a data warehouse like Snowflake. And then you have

Starting point is 00:32:50 the data scientists who have a data lake and they are using Spark. And you're like, even like these two types of people that are very close, like both of them, they work like pretty much with the same data, right? And even then, today, if you go inside organizations, you will see them be completely siloed from each other, which is crazy. It's interesting because a few weeks ago, you guys had somebody on that was kind of an expert and he was talking about Snowflake versus Spark. And I think you guys did a pretty interesting comparison of those two, right? There's certainly, it's impossible in this industry not to consider them heavyweights. And yet you look at what either one of them

Starting point is 00:33:31 is able to provide, and we do not see them as sufficiently unifying. We think, I can tell you that emphatically, an organization, and an organization can be a company, but an organization could also mean a community. Like I think of like public policy here and how compelling it could be, or even things like trendy things. I talked to my kids about sports and data around sports and how interesting it could be to coalesce people around platforms where lots of people can get at the data. In the business use case, we know how businesses work, right?

Starting point is 00:34:10 Historically, it used to be like you'd have DBAs, you'd have developers, you'd have quants or data scientists. Those two used to be sort of the same thing. Now they're somewhat different. And then you had business people, whether it be analysts or managers or executives or salespeople. We know that in 2021, every one of those people cares about data in a lot of companies. Most companies, most white collar roles will have an interest in data. And we think it is exciting.

Starting point is 00:34:44 And soon it will be more than exciting, it will be a requirement to be able to coalesce all of those people around a single platform, and to have, and not just to be like, oh, it's there, but to give them tools that they love. So the data scientist needs to be able to use Python. They need to be able to use all those libraries. It needs to be empowering for them, right? And yet at the other extreme, you need to have a non-technical person that can write a functional script, can create dashboards for themselves, can share them out, can even launch applications with only a few clicks of a button, right? All of these are very modern ideas. And now they're somewhat penetrating the industry and many people are working on them.

Starting point is 00:35:28 From our point of view, Deep Haven really delivers them out of the box today. These things just work. So these are not new ideas. These are things that our customers very much rely on and things that we think our customers use to get them ahead of their competition and to really deliver alpha and differentiators to their business. Yeah. And I have to say that one of the things that I really love when I visit the Deep Haven webpage is that I can see the Pandas logo next to GFPC. Yeah. Okay. So you're talking about things I love now. So can we, Costa, should we get into the meat a little bit? Yeah, let's do that.

Starting point is 00:36:11 Okay. So one of the things that I think is important today are being more and more embraced in the marketplace and in the development community, many of them are trying to apply them to systems that were not designed accordingly, okay? So just saying, oh, we want to have an update. We want to get non-technical people, quants and developers around the same platform. Okay.

Starting point is 00:36:46 For a lot of systems, it's going to be very hard to do that because they have ANSI SQL and transactions at the core, right? So it's going to be hard for a non-technical person who's already said, I'm not learning SQL. How are you going to force them to do that? For a machine learning person who wants to, or a data scientist who wants to use their Python models and compile them with the table operations and bring the code to the data to service complex use cases, how are you going to do that in a client-server model, right? So the architecture matters, the infrastructure you've built really, really matters. And so we've embraced these from the ground up. Remember, we built something new from the ground up to service all of these use cases, both for static loads and for real time. moving data around both within a cluster that's running Deep Haven, but also agnostically to other systems. So you highlighted a couple of the trendy, but also we think really good open source projects that exist out there with gRPC and Apache Arrow, right? So in particular, we've put a full embrace around these for all of the communication

Starting point is 00:38:07 workloads that we talked about that I just mentioned a moment ago. But here's something to note, Costas, and that is that they really are organized to, or Arrow in particular is organized for static data, for batch data, right? And yet we've talked on and off here in the last many minutes about the importance of real-time data or the importance for dynamic and changing data, okay? So what we've done is we've written an extension of Arrow, or specifically Arrow Flight that goes across the network that will support moving this type of data between applications and between nodes in an agnostic way. And in particular, for DeepAven, allows the data engine to consume it across the network and do the type of computational workloads I've described so far. So we think the world is coalescing around these technologies.

Starting point is 00:39:03 We think Python really showed you something, right? That if you have a technology that is good enough at a lot of things and is easy to use, a lot of people will jump in. And I think that's something certainly that I've observed here in the last 10 years. And we think there are communities that are really forward-looking around streams and real-time data and the technologies around there, around data science and the data transportation

Starting point is 00:39:35 and in-memory and on-disk formats for that. And that's really a part of the conversation that Deep Haven's trying to enter. Yeah, and now you are touching also technology that I'm a big fan of and I'm looking forward to see what kind of impact it will have, because I think it will have a big impact and that's Arrow. But that's, I think, a discussion for another episode. Yeah.

Starting point is 00:40:00 Okay. So a bit more about the technology itself, right? Sure. So can you share with us, let's say, like the three, four, or I don't know how many they are, basic principles that the technology is built on? And that also differentiates it compared to other technologies out there. Sure, sure. So I think the first one I've already spoken of, but I'll just reiterate, and that's that we have an incremental update model, which means that we're doing a lot of work. There's a lot of data structuring

Starting point is 00:40:35 in the background in the system that's saying, okay, data just came in. What does that mean to the state of the objects that we're keeping, right? Fundamental to all this is that Deep Haven thinks about tables as a very important thing. I mean, by everything you've said, tables feel very natural to you and important to you. They're very natural and important to us. We think of tables like data frames, right? However, though we embrace tables, we think of streams just as table updates.

Starting point is 00:41:06 So anything in Deep Haven, we've unified this construct so that anything you can do on a stream, you can do on a table and vice versa. And you don't really have to be privy to that. There's data coming into node sources and you're doing stuff with it, whether that data is real-time or historical, whether it's a stream or a table classically, as others would think about it, you and Deep Haven get to remove yourself from that duality and just operate on your alpha. At its core, Deep Haven is a Java engine, but we've bound it tightly with CPython and NumPy through a bridge. The bridge is an open source project called JPy that we're working to support out there that allows bidirectional, it's a bidirectional bridge between Python and Java, which means that you can write, you can deliver Python to

Starting point is 00:42:00 deep pavement servers and it just works. And bringing code to data really enables some of the complex use cases I talked about before where you have a table operation and you're doing joins and aggregations and filters and sorts and all the things that you would typically think. You're decorating the data with new columns, but you're also delivering bespoke functions

Starting point is 00:42:24 and third-party libraries and all that is getting compiled down to Java, you're decorating the data with new columns, but you're also delivering bespoke functions and third-party libraries and all that is getting compiled down to Java, whether it was brought as Java or C++ through the JNI or Python through this JPI bridge. So, and at its lowest level, DeepHaven is array oriented such that this idea of moving between languages is cheap because we amortize the cost and performance is great because we don't work on record by record. We work on array by array. It's all a vectorized process essentially.

Starting point is 00:42:58 So these are some of the fundamentals of our design and our architecture and the system that is out there today on GitHub, as well as obviously the gRPC API we mentioned before. That's super interesting. And a follow-up question that has to do with how you relate or you compare with two specific technologies. One is Kafka. From my understanding, they operate very well with, you relate or you compare with two specific technologies one is kafka from the other hand they operate very well with but at the same time kafka at some point at least confluent they tried like to introduce some primitives around tables right so you had like traditionally the concept

Starting point is 00:43:41 of the mutable log that you can build a stream on top of that. And that's like the main with the topics. But then you also had the K tables and all that stuff that they introduced. So that's one thing, how you convert to that and like what are the differences there. And then another question is we had like the chance to discuss with the CEO of Materialize a few weeks ago. And again, there we have the interesting case of having a table that can be updated in real time, let's say by feeding it a stream. And this happens through technology that's called timely data flow, right? So what are the similarities and what are the differences between these two? Sure. So when we think of Kafka, again, I try and mentally divorce Kafka, the transport pub-sub system, arguably, KSQL DB or some other

Starting point is 00:44:35 confluent apparatus that is doing something on top of such a stream, right? So the contrasts are significant. I think most importantly, in regards to the real-time data, Deep Haven very, very much excels at joins in ways that KSQL really doesn't, right? So stream-stream versus stream-table versus table-table joins are different things. If you have stream-stream, then you need to have windowing functions, right? That's because they don't have this incremental update model, which means that, hey, I'm going to join two streams. I mean, you need to tell me how much of the streams I'm supposed to look at, and then I'm going to do some batch joins, right? We think that that is a different model and one that many of our deep

Starting point is 00:45:18 users for many of their use cases wouldn't be very happy about. There's other significant differences. We have ordering is a very important concept, which is one of the reasons that Deephaven can present itself today as a very compelling time series database. Kafka has no such thing. You and I just spent some time, or at least you were listening to me, as I spent some time talking about the value of bringing code to data and how that enables some of the very important use cases. Again, that doesn't exist in Kafka. And then I think the last thing is this idea, there's a very, very interesting idea where there's all this data in the world. And what I want to do is, or in my world, let's say, and what I want to do is I want to join it together. I want to do some stuff.

Starting point is 00:46:09 And then I want to just create derived streams, right? In DeepAven, this is a really compelling approach, right? So we have a functional language. Somebody names a table. Really, that could be a table or it could be a stream. You write a little thing. Oh, I'm just going to get the data. Then I just take that, let's call it table one. And then the next line, I just get to write table two equals table one dot where, and then I'm filtering it. Right. And then I can write, if I want to, I could just keep it super simple. Table three equals table two dot join with table ABC and do some decoration and do some aggregation, right? In our case, you're setting up a tree is the way I might

Starting point is 00:46:55 describe it, but you would know, Kostas, that it's really a graph, an acyclic graph, right? And the new data is propagating through that graph, right? Again, using our incremental update model at each node. And we're doing this in a very performant and lazy way so that you can grab intermediate results or the end result. This is, it's a very lightweight, easy, fast moving way for a user to take a bunch of, for example, Kafka streams, since you're asking me to compare to Kafka and generate derived streams without registering schemas or doing anything heavyweight, just in a very quick, fluid way. And we think that the workflows around that are a meaningful difference, not just what the engine

Starting point is 00:47:41 can do, but what a person, how quickly a person can move. And that type of accelerated capability really is important to business. I love the arc of this conversation because it touches on a theme that we've brought up on the show multiple times. And that is imagining this future world where the technical limitations around data are gone. And batch to streaming is one example of that, where data is happening in real time, in real life all the time. And batch actually is just a technical limitation, right? It was developed because it's really hard to move data in real time, right? Or historically has been in many ways. And it goes to what we were talking about with latency, right? Like people think about analytics or even data in terms of latency,

Starting point is 00:48:30 just because that's been our entire experience with data. And so I just love the way that you talk about deep haven in a way that you're thinking about that future, right? And trying to break down some of those limitations that have traditionally existed, which creates the opportunity for so many new use cases. It's nice of you to say that. What's been really interesting for me, because you have to remember, I'm not actually qualified in the way that you all at Rudderstack are. I focus primarily on capital market stuff for a long time. And then we created this piece of technology because we had business needs. If you look at some of the other systems that

Starting point is 00:49:13 have been developed, they grew out of smart engineers that left Oracle or people that were grounded in Cockroach and then they wanted to evolve the world. We came at it from, hey, we're business users. What do we want our system to do? And so we built a system from scratch. And that really isn't my point. My point is what's interesting to me as an observer of all this is that particularly over the last three or four years since we spun this out, and to be fair, we rebuilt it essentially to meet the marketplace, is a lot of the founding principles are really important to us are resonating in the community and other people are building solutions that just map to the way we think of the world and the architecture we've put together such that

Starting point is 00:50:05 we're very inspired by the data science community. We find it compelling that somebody like Wes McKinney has gone from pandas to arrow and has created a framework for data in memory and across the network that really makes sense to us and we can elaborate on. We love GRPC. We rely heavily on Envoy, which is a proxy server that was put out by Lyft. We think that this, it's just so exciting, frankly, as an outsider that kind of wants to be invited to the table to look at all this and be fascinated by the fact that our vision is shared by other people and immodestly stated, hey, guys, we might have done some work that's going to be helpful here and that we can bring to the picnic. And somebody's like, yeah, let's serve that up too.

Starting point is 00:50:59 So that's really where we stand. And it is an exciting future because it's all going to move quickly and in unpredictable ways, as we know. Yeah. If we think about going back to 2012 to today and just thinking about such a fun exercise to think about, how would you build this in 2012? But if we think about the next decade, it's going to be wild. One thing I want to do, though, is the product is so interesting. There are tons of use cases that I know have come to mind for a lot of our listeners. But I'd like to drive home what it looks like to use Deephaven on a practical day-to-day basis. I know you came from the financial markets, but a lot of your customers are outside of the financial markets. And so could you just give us a couple examples of some cool things that your customers are doing with the product?

Starting point is 00:51:52 Yeah, sure. I mean, so I think there are some industries that are ahead of other industries in regards to really caring about streams and pub subsystems and are dominated by that. And those are the ones that are, are for the most part, first movers with us. So capital markets certainly think crypto, but also not crypto, but blockchain, right? There's all sorts of data on the blockchain. I don't want to just build analytics. I want to build applications off of that. Sure. IoT, telemetry, gaming, energy and power, these types of things. So I think Deep Haven is this core engine, but it's also the framework I talked about, all those experiences and integrations.

Starting point is 00:52:34 So some of the classics, what people will typically use us for in the early going is, as I suggested earlier, hey, I have a lot of Kafka streams. I want to marry them to some CSVs or to some Parquet. I want to do that in an interactive way where I'm interrogating and exploring the data. We have a really, really easy to use, click a button and go console experience or what we call a code studio, what others might call a REPL. So people will investigate data. If you want to look at real-time data in a browser, you have a Kafka stream, I want to see it. Or, oh, I have a Parquet table and it has 2 billion, 3 billion records. I want to see it in a browser. If you want to do either of those two things, there is not an option.

Starting point is 00:53:23 The only option you have is deep haven. And within a few minutes, you can be going and seeing that stuff, seeing that stuff and touching it and interactively doing all the stuff you think, filter, plot right there from the UI. Not to mention now, all of a sudden, you're also in an environment where, okay, I want to do more sophisticated stuff. Like I want to join it. I want to aggregate it. I want to create new data here with different decorations. So bringing those types of things are classic. People do data science, right? So, oh, I want to bring this data and then I want to marry it to a pie torch or something like that. Or I want to do even just statistical modeling on

Starting point is 00:54:04 this, on the data, particularly as it relates to data that is changing, as I've suggested, that's a hard problem that, that for the most part we make easy. And then the last thing is oftentimes like a lot of what our users will do is they'll say, Oh, I have a non-classic source of data. Oh, I'm going to scrape a website. Here's one. Oh, I'm trying to get a sentiment indicator for earnings calls. This is what a capital markets customer did, for example. I'm going to scrape a website. I'm going to essentially inherit in real time the conference call of the CEOs of all these companies right after they have these earnings, their leadership team, they have the earnings call. I'm going to inherit that transcript.

Starting point is 00:54:52 I'm going to parse that transcript through classic Python libraries. And I'm going to establish a sentiment indicator that then I can process as a signal that combines with other signals I have that tell me whether I should buy or sell something. And I'm going to do that in real time as they're talking. That is a classic model. And you can do all of that in Deep Haven. So anything else, you could do something like that, but one, it would be delayed. Two, you would have to use a client server model where you're doing some table operations one place and you're pushing data around and you're copying it. You're moving it, right? You're transforming it. We very much just allow people, Hey, just push that code.

Starting point is 00:55:32 It's all going to get compiled together. You're going to get this answer. You're going to scrape the website right there. You're going to handle the objects that are the data right there. You're going to deliver the sophisticated Python libraries right there. Yeah. Can I ask you a question that straddles the fence between like entrepreneurship, like SaaS entrepreneurship and technology? Please. Yeah. I'm, I'm please. I'm not sure I'm qualified. I don't think I'm qualified to ask this. So I think we're both entering uncharted territory here, but you know, I was thinking you don't hear, I'm qualified to ask this. So I think we're both entering uncharted territory here.

Starting point is 00:56:05 Sure. You know, I was thinking, you don't hear, I'm sure there are a lot out there. So this is anecdotal, but when you think about SaaS focused on data, a lot of it's venture backed, which kind of makes sense, right? I mean, especially when you think about database products, which take years to sort of mature and come to market and become generally available. And that's challenging stuff, right? And so it makes sense to capitalize the business according to the go-to-market timeline. And it's pretty rare, at least in the examples that I'm thinking of, for a company to spin SaaS out of another business and for that to be a truly robust tool. And I don't know why this example came to mind, but it's like the glut of project management

Starting point is 00:56:59 tools. It's like a team's like, oh, we can't find a project management tool that works. So we're going to build our own. And it's like, yeah, the reality is like, they all have their limitations and you just have to figure out how to deal with it. Right. And I'm sure there's some success story of that happening well, but many, many more failures. Right. And so I was thinking about Deep Haven, if we compare it to the project management type thing, and the problem is very hard, right? I mean, real-time data in capital markets is sort of the tip of the spear when it comes to complexity, quantity, requirements for real-time, computation, variety of needs on processing or running your own custom code on the data. And so I'm interested to know, just for you as an entrepreneur, do you think the difficulty of the problem contributed to

Starting point is 00:57:53 the success of spinning a data SaaS product out of another company, which I think in itself, I need to say congratulations, because I think that's quite a feat. Well, there's a lot there. Let me try and unpack it. So first, I would say that I appreciate the congratulations. Our team would suggest that it is premature and we have a lot of work to do in servicing customers. So we hope to earn that congratulations over time, but I very much appreciate it. The second thing I would say is in regards to real-time and capital markets, I think when we spun it out, we believed quite heartily, as you suggest, that we were at the tip of the spear. We were handling quite reasonably with good performance and with extreme flexibility, I would immodestly state. Not some use cases, but a combination of use cases,

Starting point is 00:58:47 a portfolio of use cases that were challenging and that we were servicing them for really pretty demanding customers, internal customers, but customers. And we spun it out because we believed what I told you earlier, and that is, wow, this is a really interesting problem set. And we think the world is going there. We think that feeds were just a capital markets term. And now the whole world thinks about feeds. A Twitter feed is like a known thing. That's data. There's tons of, everyone knows you can, there's data, there's money to be made. There's problems to be solved. There's questions to be asked of all sorts of feeds. So, oh, the world's moving to feeds and Kafka became a thing. So we were like, oh my gosh, this is, this is really interesting that we felt expert

Starting point is 00:59:39 at a world that was about to grow. And everything we've seen in the last several years since we've spun it out and since we meaningfully re-architected the the the platform seems seems to reinforce that view so so so that so that was very exciting yeah no that's super interesting and again hearkening back to the 2012 and thinking about how common the word feed was, right? It is all over the place now. observation that, yeah, most stuff like this is championed from venture capitalists and largely on the coast. And I think we were lucky in that we had a series of investors that had seen the power of Deep Haven to really revolutionize an organization and to allow an organization to move quickly. And then we were lucky in finding a number of early customers, not a huge number, but a number of sophisticated customers who pointed

Starting point is 01:00:52 their development and data science teams at us and were greedy. What that means is they said, we need the platform to move in this direction. We need these features. So a lot of the ideas that I've expressed to you today are not even not mine. It's certainly not mine, not even my teams, but rather their reaction to a pretty, a small customer base, but a pretty sophisticated customer base that really had lots of options in regards to technologies they could choose, but we're choosing ours and asking it to evolve. So I think as we, again, we think the fat part of the curve, the belly of the bell curve in regards to data workloads, we think that real-time data is going to matter there. We think that coalescing a lot of different types of people around the platform is

Starting point is 01:01:43 going to matter there. We think bringing code to the data so you can handle complex use cases at the same time as you have real-time use cases. We think handling relational and time series stuff together to deliver analytics and machine learning on the same platform as applications, right? We think that that's where it's all going. And we've been fortunate to have a group of people as investors that have believed in that and a series of customers that have been very involved in trying to move us to the promised land. So cool. Well, we are close to the buzzer here, but before we hop off, Pete, your open source, which we didn't have time to talk about, sadly, because that's something, you know, I'm very passionate about. I know Costas is as well, but what if, if anyone in

Starting point is 01:02:38 the audience wants to try Deep Haven or explore, what's the best way to do that? Sure. So you could go to our GitHub, which is just Deep Haven on GitHub. There is, we hope, simple instructions for you to download our images and launch an instance and tutorials there that will hopefully show you, or not show you, but introduce you to the range of concepts that I've described here. We've both invested reasonably and are very committed to try and provide good support and are pretty dogmatic about, we have lots of fights internally about whether something is easy enough to use in regard. So we're trying to be vigilant about that. And, and we are very much accustomed to supporting customers. And we see all of these community users as customers of

Starting point is 01:03:34 ours and are dedicated to not just supporting their use cases, but really listening to where they think the product should move. Pete, this has been an amazing conversation on multiple levels, philosophical, technical, and even a little bit of business thrown in there, which I think makes for an awesome time. Well, it's been an indulgent experience for me. I learned a lot and I very much appreciate the time you guys spend with me. Great. Well, we'll check back in with you in another six months or so and see how things are going. Beautiful. This was great. Thanks, Eric. What a fascinating guy. I love how many of our guests have studied things that are astrophysicist-esque, if I may. And then Pete's story is really interesting because he went from there into the financial markets, which is interesting. I think my big

Starting point is 01:04:26 takeaway was Pete's challenge to me in terms of thinking about analytics through the lens of latency and pressing on that and asking why. And I just love that because I think it's expressive of a mindset that doesn't accept the currently available tech as status quo, which opens the possibility of imagining what can happen if you break down current barriers. So it just made me think a lot. And I'm sure I'll keep thinking about it the rest of this week. How about you, Kassas? Yeah, I think this part of the conversation was super interesting.

Starting point is 01:05:01 And I really enjoyed that, let's's say the reframing of the term real time from being around latency to being around something that changes right I think that was like the most interesting part one of the most interesting parts and the other thing that I would add there is something that it's it that has started emerging as a pattern to our episodes about the importance of streaming data. That's something that we discussed also today. And it seems that streaming data and data that they change often are becoming more and more important. And we build more and more technology around them. And what we saw today together with what we had like a couple of weeks ago, where we were discussing about materialize, I think we are going to see more and more technology and interesting

Starting point is 01:05:55 products coming out that will be dealing with streaming data and data that they change often. Absolutely. Well, join us for upcoming shows to dig more into streaming and meet other cool people working in data. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 55: Tables vs. Streams and Defining Real-Time with Pete Goddard of Deephaven Data Labs

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.