The Data Stack Show - 186: Data Fusion and The Future Of Specialized Databases with Andrew Lamb of InfluxData

Episode Date: April 24, 2024

Highlights from this week’s conversation include:The Evolution of Data Systems (0:47)The Role of Open Source Software (2:39)Challenges of Time Series Data (6:38)Architecting InfluxDB (9:34)High Card...inality Concepts (11:36)Trade-Offs in Time Series Databases (15:35)High Cardinality Data (18:24)Evolution to InfluxDB 3.0 (21:06)Modern Data Stack (23:04)Evolution of Database Systems (29:48)InfluxDB Re-Architecture (33:14)Building an Analytic System with Data Fusion (37:33)Challenges of Mapping Time Series Data into Relational Model (44:55)Adoption and Future of Data Fusion (46:51)Externalized Joins and Technical Challenges (51:11)Exciting Opportunities in Data Tooling (55:20)Emergence of New Architectures (56:35)Final thoughts and takeaways (57:47)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. We're here with Andrew Lam, who's a staff engineer at Influx. And Andrew, Costas and I have a million questions for you that we're going to have to squeeze into an hour.
Starting point is 00:00:36 You worked on the guts of some pretty amazing database systems. But before we dig in, just give us a brief background on yourself. Hello. Yes, thank you. I'm Andrew. Obviously, I've worked on lots of low-level database systems. I started my career at Oracle for a while, then I worked on an embedded compiler for a while in a startup. And then I spent six years at a company called Vertica, which built one of the first sort of a phase of big distributed, shared nothing, massively parallel databases. I then worked in some various machine learning capacity startups, which was fun, but not really related to data stuff so much.
Starting point is 00:01:15 And then for the last year, I've been working with Paul Dixon and the co-founder and CTO of Influx Data on the new storage engine for Influx DB 3.0. So I've been down working on building a new sort of analytic engine for focused on time series. And you've also been working a lot with Data Fusion, right? The open source projects. And there are like a lot of things we can chat about, as Eric mentioned. But something that I'm really interested to hear from like your experience and perspective, Andrew, because you've been in this space for a very long time, is that it feels like we are almost like at an inflation point when it comes like to data systems and how they are built, something that's probably happening for a long
Starting point is 00:01:53 time. But data systems are complex systems and very important systems. So there's always like, let's say, risk is not exactly like like, something that people want to take when it comes, like, to their data, right? So, like, the evolution usually is, like, a little bit slower compared, like, to other systems. But it seems like we are reaching that point right now where the way that we build these systems is going to, like, radically change. So I'd love to hear from you how we got to this point.
Starting point is 00:02:22 What are, let's say, the milestones that are leading to that, and what the future is going to be looking like based on what you've seen and what you're seeing out there. So that's the part that I'm really interested to hear. What about you? What are some things that you would like to talk about? Yeah, I would love to talk about that. And I'd love to talk about sort of the role open source software plays in that evolution and how we got there. That was something that Influx, I think, always as a company has understood and really valued is open source and how that's both evolving and then also how you leverage open source
Starting point is 00:02:55 to build the next generation products. I think it'll fit beautifully. And I think we can illustrate the story of sort of what we're doing with Influx DB 3.0 as part of that longer-term trend. And I think there's lots of interesting things to talk about there. Sounds great. What do you think, Eric? Let's go and do it. Well, I'm ready. I'm ready. We need to hit the ground running here because we have so much to cover. Yeah, let's do it. Okay. I'd love to just level set on what Influx is. I think it's very commonly known as a time series database,
Starting point is 00:03:26 but can you just explain what it is and sort of orient us to where it fits in the world of databases and what people use it for? Absolutely. Yeah. So I think InfluxDV is a part of a longer term trend in the database world where we went from super specialized sort of monolithic systems systems very small number of very expensive ones like oracles or db2 that kind of stuff and instead we've seen a proliferation of databases that are specialized for the particular use case you're using for using them for and so time series database as a category relatively new and i think the reason you end up with specialized categories is because if you focus on those particular use cases you can end up up building systems that are like 10x better than the general purpose one.
Starting point is 00:04:08 And I think that's approximately the bar you need in order to justify a new system. So time series databases in particular often deal with lots of really high volume denormalized data. So that means you're basically having to read CSV files that come in and you quickly, they quickly want to be able to query them like within milliseconds. Not, you know, like when I started in the industry too many years ago, like you did batch loading at night, right? Like you do it once a day, if you were lucky. Yeah. Right.
Starting point is 00:04:35 And now it's like maybe once an hour, maybe once every five minutes, but like times your day was like, it's like measured milliseconds. So that's really important. Probably being able to query super fast afterwards. And then I think also because this it's not part of an So that's really important being able to query super fast afterwards. And then I think also because it's not part of an application that's been set up before, you don't necessarily know what's coming from
Starting point is 00:04:51 the client, so you don't upfront specify what columns are there. So you basically have to figure it out. Data just flows in, you figure out what the database has to figure out what columns to add. And then of course lots of time series data, the most recent stuff is really important.
Starting point is 00:05:08 The stuff that's older is much less important, and making sure your system can efficiently manage a large amount of data, even though the only part of it is really important to deal with low latency. And there's a huge number of examples with open and closed source at this type of database. Obviously, Influx is one, but there's a whole bunch of examples with open and closed source at this type of database. Like obviously Influx is one,
Starting point is 00:05:25 but there's a whole bunch of ones like Graphite and Prometheus and the whole math of them, right? There's a whole bunch of closed source ones, both at the big internet companies, like stuff like Gorilla or Monarch or Gorilla from Facebook or Monarch from Google. And there's like, the dog has versions of this, right?
Starting point is 00:05:41 That are all proprietary to their own server. So it's a whole category. And Naflux is the biggest open source one. So I want to dig into something you mentioned. So this evolution, we'll dig into this specifically, like this evolution of purpose-built databases all combined together. But in terms of time series specifically, is the need for those, you said it's relatively new, is that need driven
Starting point is 00:06:09 both by the opportunity that's created because there's technology that can solve some of those problems that were probably difficult to solve before from a latency standpoint? And then how much does just the availability and volume increase in volume of data that we are able to collect and observe? How do those two factors influence the somewhat recent importance of time series databases? Yeah, I mean, I think you hit it. I mean, you understand, right? As we had more and more systems that operated, you need to actually be able to observe them and figure out what they were doing, right? Which actually turned out to be a major data challenge themselves. But it's a very different data set than like your bank account records,
Starting point is 00:06:53 which is what the traditional transaction system databases were designed for. And so I think the workload characteristics are quite different. Can we walk through just one? I'm going to throw an example out and you tell me if it's a good one or bad one, but I'm just going to throw a use case out that I think would make a lot of sense for InfluxDB and would love for you to speak to like, well, what are the unique challenges for that? I'm thinking about like IoT devices. So, you know, think about a factory that's wired up with potentially hundreds or thousands of different types of sensors. You mentioned actually the incoming data can change.
Starting point is 00:07:34 And when you think about sensors, installing new sensors, collecting different types of data, that is probably changing intentionally a lot as you measure different things and install new equipment. Am I thinking correctly about a use case where you know yeah yeah no and that's another good example of like you know in a factory the rate of change of the software you have installed on your robots is probably different than the rate of change of software that's deployed in some cloud service right like it doesn't get deployed every minute right because you've got like yeah yeah mess around the robot so that's all the more reason why like why the data format that comes off those is relatively simple, and you need to get back in
Starting point is 00:08:08 to handle that. But yeah, time series or IoT is definitely the classic thing. And the idea is you dump your data in a time series database, and now you can do things like do predictive analytics or pull the past history of those machines or the sensors or whatever, and you build models
Starting point is 00:08:24 to predict when they're going to fail or when they need maintenance or, you know, variety of other, you know, machine learning problems, which is now of course, the hotness, right. But like to drive any of that stuff, you need data to actually to train the models on. Yeah, totally. That makes total sense. But we'll switching gears just a little bit. So the use case makes total sense. You've spent a lot of time, though, sort of building the guts of this system, right? And in fact, sort of helped transform the system into the latest iteration of InfluxDB. And I love talking about this because that's unique to, you know, say like, you know, you know, ingesting data or, you know, a dedicated analytics tool or, you know, some of these other things that are maybe like moving data or, you know, operating on data after it has been loaded into some sort of database system. So what types of things have you had to think about
Starting point is 00:09:25 as you've architected this system and interested specifically in the unique nature of time series data? Yeah, so maybe I should start by just blaring a little bit about, like, from my, so I don't exactly have an academic database background, but it's been a lot of time studying them. And so I'll just describe a little bit, like basically all the original time series databases all have approximately the same property, or approximately all have the same architecture, which is they all are driven by
Starting point is 00:09:53 a data structure called the LSM tree, right? Which is basically just a big, it's like a key value store effectively, but it's a way to quickly ingest data. And so it helps you with the really high volume ingest rate, and it lets you sort of move data off the end, right? As the data gets cold, it's easy to drop it off the end. But it also requires you effectively to map the data that comes in into like a key value source. You got to basically map the identifier of the time series into a key.
Starting point is 00:10:21 And then in order to do any kind of querying, you need to basically be able to map your queries to ranges on that key, which I'm not sure I'm doing a great job explaining that, but that's like the high level. Yeah. Totally checking. Yeah. So what that means,
Starting point is 00:10:35 the types. So what that means is that architecture is phenomenal at queries that like you can ingest data super fast. That's great. You can also look up individual time series, right? Like if you know exactly the robot you care about, it basically calculate the key range that you care about and you can immediately fetch. And it's like, it's hard to imagine a better architecture to do that.
Starting point is 00:10:55 Sure. I think tends to come in several ways. Like, so the first one is if you want to do a query, that's not just like, I already know how to, which, which robot I care about, right. Or I want to go do something like look across all the robots. Then it's not just like I already know which robot I care about or I want to go do something like look across all the robots, then it's typically very challenging because you basically have to go walk through all the data to do that. Likewise, if you have a large number of individual things like lots of little tiny robots
Starting point is 00:11:17 or containers or something, which we would call high cardinality at the time, that's typically very challenging with the traditional architecture because you have the same problem of the index you need to figure out exactly where to look in the LSM tree becomes huge and becomes overbearing. Andrew, can you expand a little bit more on the high cardinality?
Starting point is 00:11:42 I think it's one of the things that it comes up like very often when we are talking about time series data. I think people have the concept of what cardinality is and what high cardinality would mean, but what does it specifically mean when it comes to time series databases and why it's so important? Yeah. So cardinality just means the number of distinct values in some ways, right? That's the academic term. And so even if you have a million rows, if there's only seven distinct values, for example, there's only seven distinct, you only have data in seven distinct AWS regions, well, that's a different data shape than you've got a million rows and they've all got a million individual values, like individual IDs or something, right?
Starting point is 00:12:26 Unique IDs. And though you might imagine it's the way you store that is different. So the way it's really in particular important for time series is because the reason it's called a time series database is there's a notion of a time series, which tends to be like set of time, like timestamps, right? And then measurements of something values yeah at that time but it's not just values and times that kind of hang out in the ether they're attached to something which identifies like where they came from right like what is it a measurement of it's the measurement of pressure of the robot over time or something right but not just the robot like this particular robot in this factory on this floor whatever and so for a time series database they're typically the classic ones are organized so that they store all the data for the
Starting point is 00:13:09 time series together and the queries then ask for individual series right so like i want a dashboard that shows me for this robot what its current pressure value is or something in order to build a database that answers that kind of query really quick, you need some kind of way to quickly figure out from the robot identifier, whatever you have, where it's stored in the robot id and the time stamp and so then if you want to do a range of like tell me something about the robot for the last 15 days the keys that have that data are like all next to each other in the key space and so your database just can go look up one place and just read all the data contiguously and so so that's good right so that's great now you asked what's the problem with the high cardinality? It's that I've been hand-waving and saying there's some index structure that lets you quickly figure out where in the key space the particular robot is.
Starting point is 00:14:14 So in Influx, that thing is called the TSI index. Other systems have the same notion that it's called something different. But it basically lets you quickly figure out where in the key range the thing you care about is. And the problem is typically those indexes are the same order of magnitude size as the number of distinct values in your keys. So if you have a thousand robots, you've got a thousand entries
Starting point is 00:14:35 in this file, it's fine. You've got a million robots or ten million robots or a hundred million robots. The index size just keeps growing bigger. And so then managing that index becomes, that becomes the problem. And that's exactly, I mean, so that's why most time series databases will tell you don't put high cardinality data in. So it's not that they can't deal with large numbers of values. They can't deal with large
Starting point is 00:15:02 numbers of individual identifiers for the time series themselves. And I would assume there's always some trade-off there, right? That's why we have these problems, that we have to trade something for something else. So what do we have, let's say, in a time series database, if we would like to also support high cardinality without even having to care about, right? What we would have to trade off? Right. I think what you have to trade off is the absolute minimum latency to request a single
Starting point is 00:15:41 time series. The way the traditional time series databases are structured, it's hard for me to imagine some way that you could do better. If what you want is just the particular time series for some particular range, you've got that basically in some memory structure and you've got a single place where you can look at it and figure
Starting point is 00:16:00 out what range you need to scan. That's pretty fast. And it's hard to, i can't you know maybe engineering wise you can eat some more percentage out of that but that's not like some fundamental architectural thing like you look up where you want it's all one contiguous thing and you just go find it so so that's like the thing that that they're phenomenal at and so you're going to have to trade off like super low latency. I think you'd still be very good, right? We talk all about this. But I don't think you're ever going to
Starting point is 00:16:31 be able to do quite as good. Yeah, and so that's like the... So your high cardinality is really an impact on latency, right? As you're operating over the data. That's the trade-off for the end user?
Starting point is 00:16:49 Well, the trade-off for the user is typically if you try to put high cardinality data into these systems, it will tend to cause real trouble. Like the maintenance of the index will just basically dominate. Yeah. The other alternative is you don't put that in and you... But instead, you now need to, like, every query, instead of knowing exactly where to look, it has to look
Starting point is 00:17:09 at a much broader swath, so it just won't work every query. Yep, yep. Sorry to interrupt, Kostas. No, go ahead. I interrupted you in the first place, so please... No, no, no. You were asking some questions. Well, actually, just to continue on that.
Starting point is 00:17:26 So when we think about this trade-off, especially as it relates to cardinality, like what are the, because I mean, you don't necessarily always have control over the actual nature of your, I guess, physical data, if you want to say that, right? And so is the response to this to say, or like, when you talk about sort of best uses for a time series database, like, is that limiting it to certain, you know, so time series data that has lower cardinality or how, like, is a data, if I'm choosing a specific tool to manage time series data how do i need to think about where to employ that and where because of cardinality issues it might cause you know cause issues for me yeah
Starting point is 00:18:13 so just to be clear i want to start saying like i'm not i haven't been down the trenches deploying times for database to people so i can give you a high level answer that but not like born out of experience so i think typically what will happen is, you know, where does cardinality data, high cardinality data come from? Right. If you, if what you're doing is tracking like servers in your server farm, like you'd only have so many servers and probably most times you would do it in your server farm. If what you're tracking is like individual Kubernetes containers, right.
Starting point is 00:18:42 Of which every time, you know, your microservice has however many of them, right? And they continually get created and destroyed, right? Like that starts looking like a high carnality column. And so I think if you want to use one of these sort of traditional time series databases, you have to be careful when you are deciding what to send into the database and what to store and how to identify the things you care about.
Starting point is 00:19:03 You know, like probably you don't send the individual container ID, right? Like you probably have some sort of service identifier instead, which is low cardinality. So I think it's not going to use traditional database, time-series databases for those use cases, but it means that you have to be much more careful when you stick data in that you don't, you know, inadvertently include a high cardinality column
Starting point is 00:19:23 in something that's going to cause trouble. Yeah. There's plenty of people who are able to work, you know, inadvertently include a high cardinality column in something that's going to cause trouble. Yeah. There's plenty of people who are able to work, you know, make that kind of change. One other question that I have based on, this is great, by the way, I don't know if we've talked about time series in this step about a use for time series data being the need to query that data in sort of an immediate, you know, for the queries needing to be more immediate, if you will. In terms of thinking about when data goes cold or when you need to offload it, that's obviously a decision for the end user to decide what their thresholds are for that. But generally, if we go back to the IoT device question, in that case, what would you see for InfluxDB users? When are they offloading the old data? Are there use cases around the historical analytics around that,
Starting point is 00:20:22 as opposed to getting a ping for potential device failure because an ML model is running on, you know, the data in near real time. Yeah. So I think you've hit on the head there with like, the reason there's, well, let's see, let's take a step back. So like the traditional architecture, you have a time series database that ingests the data quickly and keeps it available for some window of time. I think typically seven days or 30 days, like either of those would be common times, maybe two months, but probably not that much longer. And then if you want to do longer term analytic queries, you tend to have to basically have a second system, right?
Starting point is 00:20:56 So rather than the time series being directed to both the time series data and this other system, right, that then uses, that's probably some big distributed analytic database or something, right? Yeah, just doing classic, you know, look back analytics. Yeah other system, right. That then uses it. That's probably some big distributed analytic database or something. Right. Yeah. Yeah. Just doing classic, you know, look back analytics. Yeah. Yeah.
Starting point is 00:21:09 Yeah. And where the late, where you're much less late, late fee sensitive, less latency sensitive. So, I mean, obviously that that's one mission of inflexdb 3.0 is that, you know, it's really inconvenient to have two systems, right. If you can have one system or even better, if you have a system that you don't have to, you, the developer don't have to direct the data streams into two systems, right? If you can have one system or even better if you have a system that you don't have to, you the developer don't have to direct the
Starting point is 00:21:27 data streams into two places, right? You only have one ingest pipeline and then maybe I'll talk a little bit about the new version later. But you directed it into your time series database and eventually ends up as Parquet files on object store, still with all the low latency love you enjoy, you need. But at that point, your existing
Starting point is 00:21:44 analytic systems, which probably also deal with parquet on object store a lot of these days sort of have a unified view the other thing that i wanted to mention about the the cost thing is is i think most customers aren't like oh i only want seven days of data right like they they want more they don't want they want much more than 30 or whatever they're typically driven by the cost right they just sure certain amount of cost or capacity and the current time series technology required basically fast disks and in memory you know like large amounts of memory and so sure that tends to limit the scope but i think there was a real need to like hey i don't want to have to put like it like it doesn't make sense right if some of your data is really important to be hot but like i'd love to keep
Starting point is 00:22:28 three years worth of data but i but like my expectations on query latency on that is much lower than if well you know it's okay if it takes longer than like the stuff that came in 30 minutes ago being able to do that in a cost-effective way was another driver for InflexDB 3.0. I was just going to say, I want to dig into Inflex 3.0 before we get there, and maybe this is a good bridge, but can we zoom out just a little bit and talk? Because you mentioned having to do those in two separate systems. I mean, there's a lot of overhead involved in that. But zooming out just a little bit, that really gets at this question of what you said at the beginning, right? Where there's this trend of sort of breaking apart the monolithic systems into these various database systems, tons of really good open source tooling out there today, sort of enterprise grade,
Starting point is 00:23:22 I guess you could say. How do you think about architecting a system where you have multiple databases? I mean, in some sense, like having a giant monolith is, you know, it certainly decreases the complexity, even though there's tons of trade-offs there. But how do you think about that? I mean, we could call it modern data stack. We could call it, you know, sort of compartmentalizing these pieces using best of breed tooling. But help us think about that? I mean, we could call it modern data stack, we could call it, you know, sort of compartmentalizing these pieces using best-of-breed tooling, but help us think through that.
Starting point is 00:23:49 So, traditionally, right, since the database was tightly bound to the storage system, like, the database was the source of data, and so if you wanted multiple databases in your system, you'd actually have to, like, orchestrate the data flow between them, right? That's what ETL is all about, right? Like, you take the data from somewhere, you extract it,
Starting point is 00:24:06 you transform it, you put it somewhere else, right? So that's the, and so I think as you pointed out, the problem with doing that is now you want specialized tools for the specialized job. You have copies of your data and data pipelines everywhere. And it's, you know, not just ingest, but like, then once inside you're calculating whatever you need to calculate. So I think what's happened over the last five years,
Starting point is 00:24:24 and it's only going to happen more, is that the source of truth for analytics specifically is moving to Parquet files on object store. I had Ryan Blue from Tabular here the other week. That's solving a problem that comes out of putting a whole bunch of Parquet file on object store. But the economics and the compatibility of your data is Parquet files on object store but the economics and the compatibility of your data is parquet files and object stories i think compelling enough that's where everyone's going to move to like it's basically it's not you know like having an infinite ftp server in the sky sorry if i'm dating myself right but like s3 is basically you know it's like it's an incident and super cheap oftentimes we like we have terabytes of stuff that are like, oh, we probably don't even need that, but like, hey, it's cost a month, who cares, right?
Starting point is 00:25:07 Like, you know. Yeah. And so I think that's going to become the source of truth, right? And then what you're going to have is a bunch of specialized engines that are operating and are specialized for whatever your particular use case is that will read data out of there and potentially write it back. And obviously, InfoEasy Vue 3 is one of those. But I think basically any database system
Starting point is 00:25:30 that's architected relatively recently, like that for analytics especially, that's the trend. And I think that's only going to accelerate. And I think that is really good because now you have your data in one place, maybe depending on how you feel about Amazon or the other cloud providers.
Starting point is 00:25:46 But now your tools will now be able to talk to it in the same place rather than have to have end-to-end point connections. Right, like daisy-chaining all these different systems. Yeah, Costas, I know you have a ton of questions around this architecture and have actually thought a ton about it yourself. So I would love to hear your thoughts as well
Starting point is 00:26:03 on the market moving towards this. Yeah, 100%. I think it's kind of like a natural evolution. I think in the same way that if we think about how we were building applications 15, 20 years ago, the monolithic way of doing it in the early 2000s would never be able to scale to where the industry is today or the market is today. So we had to become much more modular. At the beginning, we were talking about, okay, we have the backend, the frontend, and this actually evolved to a much more granular architecture at the end. That's for the applications, right? We kind of need something similar
Starting point is 00:26:51 for databases too. If we want to build, say, products around data, we can't afford for the teams to continue... What they were saying, that know, like to continue, like, you know, what they were saying that to build like a database company, you need like 10 years just to go to markets before you go to market because you have like to build first.
Starting point is 00:27:15 Like in 10 years, it's like a completely different world out there. That's why we have so many databases that have died. Like people are like trying to go and do that, and then the time you're ready to go out, you just cannot go to market anymore. So if we want to have a similar way of building and creating value off the top of data, we need to replicate these kinds of patterns
Starting point is 00:27:40 that software engineering has done like other systems but now doing it like on data on databases practically like it doesn't matter like at the end if it's like a distributed processing system or whatever like at the end they all share the same let's say principles behind them of like how they they are like architect. So it's inevitable to happen. And there's no way that we can, whatever we say about AI or whatever is going to happen in the next 10 years, without strong foundations on the infrastructure that manages, creates, and exposes this data, nothing will happen. It's just how this industry works, how the world works at the end, physics. Right. So that's, I think, inevitable. The question is how fast it will happen and what
Starting point is 00:28:37 direction. And that's what I would like to talk a little bit with Andrew because Andrew is bringing like a very unique experience here, in my opinion. First because, okay, he's been like building database systems for a very long time, so he has seen like the old monoliths to whatever like the state of the art is like today, but also has experienced, let's say, what it takes to leave the monolith behind and build something that adopts these new patterns. So, Andrew, I'd like first to ask you, what do you think are the main milestones in these past 10-15 years that got us where we are today,
Starting point is 00:29:27 if you had to pick two or three. And I think you mentioned something already, which is the separation of storage with compute, in a sense, right? Which I think was what started everything in a way. But there are others too. It's not just like that separation. So tell us in your mind
Starting point is 00:29:49 what you would put there as like the top three most important changes that happened. Yeah, well, so take what I say with a grain of salt because it's very focused on analytics and it's also colored by my background. But I think the first thing
Starting point is 00:30:03 that was really important was like, you know, the late late whatever it is 20 2010s was parallel databases actually became a thing right that they weren't some crazy hardware you know i don't know if you guys remember but like earlier you want to do like four cores i mean you had to buy like some fancy tandem like specialized hardware thing right to actually run multiple cores in your database it was crazy and so there was a raft of companies, including Vertica, that were part of this wave that they went from single-node machines to actually had commercial
Starting point is 00:30:30 offerings that were parallel databases. So I think that was one. They actually figured out parallel databases. Second one was columnar storage, as you put it, for analytics, right? Before, again, it's actually often a lot of the same companies, but it sounds stupid, right? Like, well, traditional the same sort of, it's actually often a lot of the same companies,
Starting point is 00:30:45 but it sounds stupid, right? Like, well, traditional databases store stuff in rows. Now we're going to store stuff in columns and that's going to be earth shattering. Like it doesn't sound all that amazing, but it ends up being very important when your workload shifts from like uploading, load heavy, right?
Starting point is 00:30:59 Like transaction systems where the, you have lots of little thing, lots of little transactions, but each one touches like a small amount of the data. When you switch to when you really care about analytics or really analyzing large amounts of data, the workload pattern is very different. The queries that come in, there's many fewer per second, but each one touches a much broader swath of the data.
Starting point is 00:31:17 And so organizing your data in columns as a column store really makes that significantly faster to do and much more efficient. And then the third one, since I get three, is the separation of storage and compute. Like you said, that I think is often referred to in the database term as a disaggregated database. And the distinction is like Vertica and similar,
Starting point is 00:31:40 like Power Excel, your early versions of Redshift were like this too, were what's called shared nothing architectures, which basically meant you had a compute node attached to disk and the distributed database
Starting point is 00:31:51 was sharded effectively internally, but then the individual nodes are responsible for individual parts of the data, which is great because that meant you guaranteed to have co-located compute.
Starting point is 00:32:01 The problem is if you ever wanted to change, like basically, if you wanted to have elasticity and change the resources you gave to the system, you had to move all the data too, because the data was tightly bound to the compute. And so that was definitely a challenge of Vertico's dealing with that scalability. But I think that's why disaggregated databases became a thing. Snowflake now seems obvious, but I remember reading the Snowflake paper like that. They wrote a paper about their
Starting point is 00:32:27 architecture back then. It sounded crazy, right? Like, we were building distributed databases for these hard, really high-end enterprise-grade machines. And they're like, we're going to build a distributed database on a bunch of crappy VMs from who knows where, and we're going to build it on
Starting point is 00:32:43 top of this object storage thing, which has hundreds of times slower than these fast disks that we're using in the shared nothing architecture. But I think they really, looking back on it, it's clearly prescient. They were basically the first major commercial success
Starting point is 00:32:59 to have that architecture. And it just seems obvious now, but that was really revolutionary, I think, at the time. Yeah, yeah. It sounded like a crazy idea, like how are you ever going to build a database like that? Yeah, 100%.
Starting point is 00:33:13 Okay, so let's talk a little bit about the evolution of Influx, because from what I know, you started with different, let's say, core technology there. And then at some point, you re-architected and moved to some different technologies with what will get us also talking about data fusion here. So tell us a little bit about the story. What was how, first of all, Influx looked like? What triggered the need to investigate moving to something else,
Starting point is 00:33:49 and what, at the end, made you move into using something like Data Fusion. Yeah, so I can talk a little bit about the motivation for Influx, the product. I think Paul is probably the best person to talk about it, but he's spoken about it publicly, and it's not surprising when you think about it. A bunch of the challenges we talked about earlier with time series database, like high cardinality and having bigger retention intervals and doing sort of more analytics style queries, all those were long-term asks from influx data customers, right?
Starting point is 00:34:18 So Paul was looking for technology that would allow basically to satisfy them. And I think also, as I understand it, the market's effectively becoming commoditized, right? There's a huge number of, you know, Prometheus came out and a whole bunch of other open source various products that they're basically influx clones. I'm sure that everyone would take offense at that, right? I'm not trying to offend, but like basically the high level architecture is the same, right?
Starting point is 00:34:43 You got this in memory tree and you have some index and maybe it's better but like at the core there it's all getting commoditized so i think paul is looking for like how do we where do we bring the next bring the architecture to the next level and to your point earlier though like building a database as he know because he just did it for like eight or ten years right like it's a long-term experience right it's much longer than you ever want and it takes hundreds of millions of dollars i have some slides somewhere that i you know you go through and just look at like pick your database company and go look at how much money they raised and it's probably you know hundreds of millions of dollars and it's not that all that money gets spent on engineering of course but a substantial amount
Starting point is 00:35:21 does so i think when paul was casting around like, how am I going to build a new database engine? The only practical way to do it probably is find things you can reuse so you don't have to build it all from scratch. And I think he, early on, even in 2020, he wrote a blog post that said Apache Arrow, Parquet, Data Fusion, Flight, they're game changers from implementing databases. You know, at that point, I was like stuck in some engineering management nonsense. So like, I just basically took it by faith. Paul said, hey,
Starting point is 00:35:54 these are some great technologies. I said, as long as you let me code, and I'll code it, I'll do it in Rust, and I'll do whatever these technologies you want. I just need to get back in engineering. But I think he was very, like, he was very prescient.
Starting point is 00:36:06 Like, I think he identified those early on. And then, you know, we've invested to help those ecosystems drive forward. But I think those are like the core, some of the core technologies. And the reason they're core is because if you're going to go build an analytic system, now it's been done for like 20 years. So, like, we've had two or three waves of commercialization and that was built after like 10 years of industrial of academic research so basically the point is like people have built the same things over and over again and now i think we
Starting point is 00:36:37 are finally at the point where like the patterns are clear and so rebuilding it again like you don't need to rebuild it again you can actually build it and the apis are the same so like things that are really like basically common to every olap engine and i'm talking everything like vertica had a snowflake has it dot db has it you know data fusion has it like you need some way to represent data in columns both in memory on disk because that's so much more efficient for the analytics workload. And so that turns into Apache Arrow is basically a way to store memory, store data in columns on memory, right? And ways to encode it. And Parquet is basically the same thing, but for disk.
Starting point is 00:37:19 I'm obviously skipping a bunch of the details, but between those two technologies, now you have the in-memory processing component, you've got got the storage component and then you're going to build one of these database systems you typically need a network component right because it's either distributed system or you've got a client server so you need somebody to send it over the network that's what arrow flight is and then often right you don't want to well i'll tell you like building a something that can run sql or some language like sq SQL is a non-trivial endeavor. But just like building a programming language is a non-trivial endeavor, but you can still do it.
Starting point is 00:37:54 But basically, you can do it these days because you don't have to build the whole thing from the front end to the intermediate representation to all the optimizations to the code generator. Because basically, LLVM is a technology for compilers that does all that. I think that's why you end up with systems like Rust or Swift or Julia. The reason they can be even made at all these days is because they don't have to go reinvent all the lower level stuff. Got to the same point with SQL engines where you don't need to go reinvent the whole SQL engine from the bottom up because it's basically all... We understand how to do it.
Starting point is 00:38:25 The patterns are there. Arrow is basically what you want. Obviously there's things about Parquet you could improve, but it's pretty good. And so with that type of building blocks, so I should say Data Fusion is a Apache project for that query engine, right? So it will let you run SQL queries and a whole bunch of other stuff, which we can talk about at length, but I don't want to take over the whole talk. Yeah, so would you say that data fusion is the equivalent of LLVM, but for database systems?
Starting point is 00:38:57 Yes, that's how I like to think about it. Okay, okay. I think that's a great metaphor to use to give a high-level view of what Data Fusion is trying to do. So, all right, question now. I understand the need of going from the pure time series workloads to a more analytical capability, adding more analytical capabilities. So it makes sense to go and look into a system that
Starting point is 00:39:26 is built primarily for analytical workloads and data fusion. If I'm not wrong, like, when Andy started building it, that's what he had in his mind, right? He didn't have time-series data in his mind. Which means that somehow these things need to be breached, right? You still need to work with time-series data. You can't abandon what you were doing. So how do you take Data Fusion, which is like a system for analytical workloads, and you turn it into the core of a time series database
Starting point is 00:40:05 that does everything that a time series database does and has to do, but at the same time also has, let's say, analytical capabilities there. Yeah, so you're absolutely right that there's a bunch of stuff that's time series specific that's not in Data Fusion, right? So in fact, in some ways, it's a great story because you get Data fusion not for free get like you spend a little bit of time but you know data fusion is there and you can work with a bunch of people make it better and then like the vast majority of it influxes time in terms of like where engineers spend their time is on time series
Starting point is 00:40:39 specific things right like so there's a lot of effort that went into the component that ingests data really quickly turns it from whatever this like this line protocol csv kind of stuff is into in-memory formats right someone's got to parse that quickly got to stick it in you've got to handle persist like getting that into parquet files quickly and there's a system in the background to like merge them together and whatnot so there's a huge amount of engineering effort that went into that. And then the systems aspect of knowing who has what data and where it is and all that sort of stuff.
Starting point is 00:41:11 So where Influx spent a lot of its engineering time is on those time series specific features, right? And then as long as data fusion is fast enough, it works. And so it's kind of cool.
Starting point is 00:41:24 Like early versions of our InfluxDB 3.0 products only did SQL because we just ran it through data fusion. However, InfluxDB also has a specialized query language called InfluxQL, which if you ever worked with time series, you'll know that SQL is miserable for a lot of different types of queries. So InfluxQL is actually not much nicer query language. So that's another example.
Starting point is 00:41:46 I guess it's very time-specific. But actually what we did was we don't have our own whole... Now InfoSQL supports, or 3.0 supports InfoSQL. We don't have our own whole query engine for InfoSQL. There's just a front end, right, that translates InfoSQL into the same intermediate representation that DataView, which are called logical plans. And then the whole thing, then the rest of it
Starting point is 00:42:06 is shared with the SQL engine. So maybe those are some specific examples of like, when we built the time series database, it didn't just magically come out of data fusion. We had to build the time series pieces around it. But we don't have to then go build the basic
Starting point is 00:42:22 vectorized analytic engine, right? If you want to hear a list of the type you know you need like it needs to be fast it needs to be multi-threaded you need to make sure it's streaming you need to make sure that it knows how to aggregate stuff really fast you have to do it quickly with multiple columns and single columns and all the different data types and oh there's you know five different ways of representing intervals and you know yep yep yep and yeah yeah no mixed total sense i think that's like one of the my opinion like also the kind of like traps that people get into when they are building like new
Starting point is 00:42:55 data systems is sql is the standard but at the same time like it's not great for many things and when you get like into specialized things then you're like, okay, oh shit, let's reinvent and build a new query language, right? And the problem, and I think that's what is great with what you are saying about data fusion, is that now it's much easier for someone to be like, hey, I can have the standard SQL that I support here, which makes it easy for anyone to be like, hey, I can have the standard SQL that I support here, which makes
Starting point is 00:43:25 it easy for anyone to go and work without having to do the heavy lifting of learning a new language. But at the same time, I can also add in my product new dialects or new seductive sugar in the existing SQL or whatever without having an overly complicated system there. And that's the equivalent of, let's say, what the user interface is in a way for applications, the graphical user interface. But in our world, the equivalent of that is the syntax that we provide to the user. And I think a big problem in the past, actually, because of the monolithic architecture, right? Like adding something new there pretty much had to go through the whole monolith. So that was a big no to go in that. And I think that's a lot of the value that's,
Starting point is 00:44:15 let's say, for the user, for the developer experience is broad because of how these systems work. And what Influx does is, I think, an excellent example of that. Yeah, we have SQL, and we also have the specialized syntax that we need to go and easily work with time series data and be more efficient and more productive, which is amazing, right? What's that, if you have an anecdote,
Starting point is 00:44:40 in a way, what's from data fusion and its focus in the analytical workloads made your life harder when you had to work with time series data, if there was something, right? The first major challenge is that the time series data model, at least as shown by... I'm going to talk all about Lion Protocol, which is the influx data version of it. But basically every system I know about that does time series, especially for
Starting point is 00:45:10 metrics, basically has the same data model. They call things differently, but it's basically the same. So you basically have some identifier that's a string, right? That identifies your time series, and then you have a value and you have the time series. And so Data Fusion is very much a relational engine, right? And so is Apache Arrow. Yes, they have structured types, but at the end of the day, it's basically tables of rows, right? And so
Starting point is 00:45:35 the very first challenge is you've got to map the influx data model back into a relational model. So we did, right? But that definitely is non-trivial. And it ends up like, you know, there's nulls and new columns can appear
Starting point is 00:45:51 and dealing with backfills and what the updated semantics are and stuff like that. There's quite a lot of logic that we had to work out as opposed to just we had Parquet files on disk array and we just run them, right? Like it's a lot more complicated.
Starting point is 00:46:04 We actually get this other thing in, we get a lot of work to figure out exactly how to map them into the relational model. Yeah. That's very interesting. Probably a topic on its own for a whole episode I'd like to talk about. But let's switch gears a little bit
Starting point is 00:46:22 because I want to talk a little bit more about DataFusion and the current state of the project and what's next. So you had, you engaged in using DataFusion, you're at a pretty mature
Starting point is 00:46:39 point right now, I guess, with how things look like using DataFusion. It's also how a project has gained much more traction. There are many people using it and many contributors back. What's next? Because, okay, a system like LLVM takes a lot of hundreds or, I don't know, thousands of engineering years to build and maintain. And that's probably true also for something like Data Fusion, right? What's next there for the project? What do you see being the part of the project that needs more love?
Starting point is 00:47:14 And where the opportunities are, in your opinion? I'll talk about the technical challenges and opportunities in a second. I just wanted to, at a project level, where I see the project first. So I think now we're at the point where there's early adopters like Influx and there are a couple other early adopter companies that built products on top of Data Fusion. So Influx, DV is one of them
Starting point is 00:47:36 obviously, but there's GrepTime, there's a company called CoreLogix. And then there's a whole, there's another so the early adopters maybe have been doing this for four years. Then there's people like maybe there's another, so the early adopters maybe have been doing this for four years. Then there's people like maybe in the last year or two have basically decided to build products on top of it.
Starting point is 00:47:52 And there's a whole new raft of them, like Sineda and Arroyo. And there's probably a whole bunch of other ones. I can't, like C-File, I guess, was actually one of the early adopters. But there's a bunch of other sort of startups now that are starting there because they want to build some cool analytic thing, right? And they realize
Starting point is 00:48:07 building a whole new SQL engine is probably not where they can afford to spend their time. So that's sort of where the adoption trend is. And I think we're just going to see more and more of that. This is obviously people I know. It's an open source project. Anyone can use this. I'm sure people use it that I don't know about. 100%. And then
Starting point is 00:48:23 the project has been part of this Apache Arrow governance thing, but we're in the final phases. I expect it to be finalized in the next week or two. It'll become its own Apache top-level project, which for those of you who aren't deep down in the guts of Apache governance, which you don't need to be. But it's just basically a recognition that the community is now big enough and self-sustaining enough that it needs its own place rather than being an arrow so arrow did it
Starting point is 00:48:48 was wonderful to be incubated there but it will be its own project very soon i think that'll help uh drive its growth as well so yeah so now technically yes i think from my perspective theta fusion now has basically very competitive performance. We just wrote a paper in Sigma, if anyone cares. And our results, if you look at it, querying per K files is we're basically better than DuckDB in some cases, not as good as some cases, but basically approximately equivalent. And so that's actually pretty good. I mean, the DuckDB guys are very smart. That's like the basic state of the art integrate OLAP engine these days so i feel very good about the the core performance and it has the basic so basically from like a if you're going to build a an analytic system today you need like basic sql you need all the
Starting point is 00:49:35 time like places that are hard to do that we've done in data features like like you need the basics of sql you need the basics of all the timestamp functions, which if you've never built a database, you don't understand. It doesn't seem like it should be that big a deal, but actually getting the time and time arithmetic right is just a major undertaking. But we've done that in Data Fusion. Implementing window functions is also another one that takes a long time that you might not even appreciate that those are a thing in SQL and so you have to implement them. Data fusion has that. And then you need a fast aggregation,
Starting point is 00:50:11 you need fast sorting, you need fast joins. And then not just do you fast, you need to be able to handle those three major operations if your data size doesn't fit in memory. So data fusion does all that, except the one thing it doesn't do is it doesn't do externalized joins yet. So if you want to go build a system on top of Data Fusion that was going to compete with other, like basically going off the complicated enterprise market,
Starting point is 00:50:36 like some enterprise data warehouse thing where they've got 300 joins in their queries, I'm not sure Data Fusion will do a great job. Actually, having spent like four years of my life on the Vertica optimizer, I'm not really sure that any database system can do a great job. Actually, having spent like four years of my life on the Vertica optimizer, I'm not really sure that any database system can do a great job on a query with 300 joins. Yeah. In terms of a sophisticated optimizer, it doesn't
Starting point is 00:50:53 have one of those. It has a fine one, right? It does fine, but it's not going to work great with hundreds of joins. Yeah. It doesn't have externalizing joins. Can you expand a little bit more on that? Because probably our audience might not know what first of all means like externalized joins.
Starting point is 00:51:11 Yeah, so this, yes. What this really means is can you do these operations if they don't fit in memory? And maybe I'll start with the simpler one where like you just want to sort the data. Let's do grouping, right? Like let's say you're trying to like group on all your different distinct user IDs or something
Starting point is 00:51:25 and you're just calculating some, whoever clicked the most or something or how much dollars, but you have a huge number of individual user IDs. The way a database will typically do a query like that is you'll read the rows in, you'll figure out the row ID and you have sort of like a hash table
Starting point is 00:51:39 that you're keeping track of the current values for each one of the different users, right? So as the data flows in, you find the right place in the hash table, update the aggregate, and go on. The problem with that's called hash aggregation. That's basically what they all do.
Starting point is 00:51:53 The problem is that if you have a huge number of distinct users, you might not be able to fit the hash table memory. By the way, there's also all sorts of ways making that quick is actually really another whole fascinating discussion, but I'm just talking about externalizing right now. So you don't typically do it one at a time. Anyway, I digress.
Starting point is 00:52:09 So the problem is if your hash table doesn't fit memory, what do you do, right? The first thing you do to the, well, first database system you write probably just crashes and gets killed by Kubernetes. The second thing you do is you actually track how big the hash table is, right? Because it just allocates more memory, the operating system kills it. Second thing you, so then what you do is you add a memory limits in that at least tracks how big the hash table is. When it gets too big, it errors the query, so that's better. And then the third, more sophisticated thing to do is you actually take the hash table, dump it state to disk somehow,
Starting point is 00:52:36 write some persistent storage, do that a couple of times based on how much data is coming in, and then also then take... You could run a bunch of little files to disk, then you read them back in, merge them them compute the final result and generate up so that process of like taking the state out of main memory putting it to some other secondary storage that you have more of it would be typically called like externalizing okay yeah stuff like spilling data to the hard drive so you don't want to kill the process because it runs out of memory. So Data Fusion doesn't do that for joins. It does it for sorts and it does it for grouping,
Starting point is 00:53:11 but it does not. Doesn't do it for joins. Okay. So if everyone here is out there and would like to contribute. If you want to do it, you know, that's a whole, yes. I wouldn't recommend,
Starting point is 00:53:20 it's probably not an afternoon project, but there's certainly other people who are interested. But I mean, I joke about this afternoon project, but it's, it still amazes me like working, Data Fusion is the first time I've worked in an open source project. You know,
Starting point is 00:53:36 I've used open source software for a long time, but like actually running these things, we're running it. Like, I don't really know why people show up with contributions. Sometimes
Starting point is 00:53:45 it's clear, I'm working this in my job and I need this feature. That's probably what happened. I'm pretty confident a bunch of people, they just find it interesting. It's one of the other beautiful things about working on the internet. There's lots of people. There are a certain number of people that find database internals interesting and they show up. But I'm pretty sure the first version of the window function was written by a guy who like i don't think it had anything to do with his job i think he just thought it was an interesting challenge and he you know he's like he i think he's a manager at b&b or something like not at all obviously using data she just basically pounded out the first version of the window function it was it's still amazing to me
Starting point is 00:54:24 you know he did a great job. Hopefully he had a good time. I think he had a good time. I had a good time. We got a feature out of it. It's a really cool little vibe. That's amazing. All right. We are close to the end here and I have to give
Starting point is 00:54:40 the microphone back to Eric. I think we need multiple more of these recordings, Andrew. There are just so many things to talk and it's such a joy to listen to you. It's a lot to learn. So I'm looking forward to do that in the future
Starting point is 00:54:58 again. Eric, all yours again. Andrew, I'm interested to know, we've gone into such depth on several subjects here, which has been so wonderful. Outside of the world of databases and time series and the subjects that we've talked about, in the broader landscape of data tooling,
Starting point is 00:55:17 what types of things are most exciting to you? I think the broader story that the Parquet files and object store, data lakes, data table, whatever they're eventually people decide story of that, the parquet files on object store, data lakes, data table, like whatever they're eventually people decide to call that. I think that is a amazing opportunity because now, you know,
Starting point is 00:55:35 it's not going to make things simpler. What it's going to do is it's going to let you make all sorts of even more specialized tools for whatever your particular workload is. So I think that is very exciting. And I, you know, I also, I'm also super biased, right. But I think the idea very exciting. And I'm also super biased,
Starting point is 00:55:47 but I think the idea of data fusion is super exciting too, because as Costa was talking about before, previously, if you had an idea because you hated SQL and you had a better idea of how to do it, not only do you have to be really good at figuring out what the UX of that new tool should be, you didn't have all the ability
Starting point is 00:56:04 to go do all this low-level database stuff that we're just talking about whereas i you know some people have that like the guy who like pullers the richie actually i think has that but like it's very rare and so i think having something like data fusion will let people like innovate in the query language space or you know space without you know basically it lowers the cost so you have a great product without having to invest all this time in like one of these analytic engines so i think that's a super exciting area as well to see what people build with it yeah i agree yeah it is really neat to see you know sort of architectures and technology emerge that encourage all sorts of creativity by,
Starting point is 00:56:47 you know, unbounding people from traditionally what were pretty severe limitations, except for people like you said, where it's like, you know, pretty rare that you can manage all of these different things to create something purely novel. I think that's the most positive way to look at it. You know, another like more it. Another more boring economic way is just commoditizing these OLAP engines. That's another summary of basically the same thing, but I think it undersells the value of it because it's not just, yes, it's making it cheaper, but it's not just the cost comes down. What it really means is there's a whole bunch of things that become feasible that weren't previously. I think that's where the realization will happen.
Starting point is 00:57:24 Yeah. Very cool. The removal of constraints. Well, it's going to be certainly an exciting couple of years. Andrew, this has been an incredible conversation. We've learned a ton. I think the audience has learned a ton. And yeah, we'd love to have you back on sometime to cover all the things
Starting point is 00:57:40 that we didn't get to cover in this episode. Sounds great. I'd love to. Thank you very much. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback.
Starting point is 00:57:56 You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastack show.com. The show is brought to you by rudder stack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudder stack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.