The Data Stack Show - 129: Databases, Data Warehouses, and Timeseries Data with David Kogn of Timescale

Starting point is 00:00:00 The Data Stack Show is brought to you by Rudderstack. As we said last week, March is data transformation month and there's a competition. If you submit a transformation to our source library and win, you get a $1,000 cash prize. You can catch all the details on Rudderstack's Twitter. Good luck. Welcome back to the Data Stack Show. Today, we're going to talk to David Cohn from Timescale. Costas, I love that we have so many different types

Starting point is 00:00:33 of databases on the show. We've talked about some sort of time series flavored stuff, but I don't know if we've actually talked with a specific TimeSeries database technology yet on the show, which is pretty exciting. I think what's even more exciting is that Timescale is built on Postgres. And so we can kind of dig into some of that stuff there. But I guess I'm most excited about getting a 101 on time series data and time series databases. I think so much of that happens in sort of your vanilla, you know, sort of relational warehouse, right? A lot of analysts are sort of building time series type stuff, which works fine. So why does this, you know, why do we have time series specific technology?

Starting point is 00:01:25 So that's what I'm going to ask about. How about you? Yeah. Well, so many things to ask, actually. First of all, I'd love to learn more about the choice of Postgres. Postgres is one of these database systems that has been the foundation of a lot of innovation. For example, Redshift was based on Postgres and many other systems. So I'd love to hear what's so special about Postgres that makes it a platform

Starting point is 00:01:56 for innovation and database systems and why they chose to go after this direction instead of building a storage and query engine from scratch like most of their competitors are doing. So yeah, I think we're going to have a lot to talk about there. And it's going to be super interesting. All right, well, let's dig in. David, welcome to the Data Sack Show. So excited to talk about lots of data things, but specifically time series data things. Cool. Glad to be here.

Starting point is 00:02:36 Okay, you did not start in computer science or sort of the world of data in terms of technology, as we talk about on the show, you started in environmental engineering. So can you tell us how you stumbled into the world of databases? I guess it wasn't a stumble. It was an intentional step, let's say. Yeah. So I was working at a little battery manufacturing startup in New York City. Actually, they were based out of an old brewery in Harlem. And we're actually beginning to produce some of the batteries there and had a test facility there. It was actually really nice. I was living in New York at the time.

Starting point is 00:03:17 I could walk to work. It was great. No one in New York has that. But I sort of felt like we needed something that was, that would work better to deal with the battery data that we'd been gathering the test cell data. Like what were these things made of as well as just the battery cycling data. So we had hundreds of batteries that were being cycled, tested in different ways, sort of trying to do various types of tests to see how they would age, all of that sort of stuff. And I was like, okay, well, someone needs to work on this. I wasn't an electrochemist, or I wasn't an electrochemist PhD, I should say,

Starting point is 00:03:51 which the other electrochemists who were working there were. I'd studied some in undergrad and done some, but I knew enough to be dangerous, but I wasn't that good at it. But I did have a computer background that they didn't have a little bit from my environmental engineering days. So I was like, okay, well, maybe I can help build a database to help us track this better and get people the things that they need to evaluate these cells more quickly. So they're not spending all of their time figuring out where various Excel spreadsheets are and then trying to write analyses on them and not doing it great. So I started looking into databases and, you know, honestly, like the first one that I started with was FileMaker Pro at the, you know, woohoo. I was using MATLAB to connect to it because, hey, engineering, you know, the hard science

Starting point is 00:04:39 side of engineering doesn't necessarily know the good technology. And we were connecting to an Oracle database that the battery tester ran in its backend. That was K ish. But, you know, there were problems accessing it. We didn't have full control.

Starting point is 00:04:57 There were all sorts of things that we wanted to do, or it was like, well, we really need our own thing. And FileMaker was pretty quickly, not pretty quickly showed itself to not be the right thing Can I ask a question about FileMaker though because

Starting point is 00:05:09 FileMaker is is totally it is an origin story though for many things like it's interesting how FileMaker like is in and of itself sort of like a like first step catalyst almost

Starting point is 00:05:24 is there anything you remember about I know you hit the limitations but is in and of itself sort of like a first step catalyst almost. Is there anything you remember about... I know you hit the limitations, but is there anything you remember about FileMaker where you're like, this is pretty nice? Because for what it is, or for what it was, I guess is probably a better way to say it. It had some cool stuff.

Starting point is 00:05:43 Yeah, I mean, it wasn't like... mean it certainly didn't scale all that well but i mean it's sort of it does feel a little bit like the gateway drug to some extent where it's like well i'm used to working with spreadsheets but this is like this can do a bit more but it also holds your hand a bit more yeah and so i i think that honestly i felt a little bit the same with my so like the next step was my sequel for a little bit and then i moved on to postgres pretty quickly after that and my sequel has a little bit more of the hand holding and but like postgres is more powerful in my mind and it has a lot more like constraints and other sorts of stuff that I think are really good,

Starting point is 00:06:25 but it's a little bit less easy to get started, et cetera, et cetera, et cetera. And so I sort of made my way down that pathway over like six, eight, 10 months, and then just spent a lot of time reading the Postgres documentation. Yeah. And so like, sort of have made a bit of a career of being able time reading the postgres documentation yeah and so like sort of have made a bit of a career of being able to read the postgres documentation better than other people which is you know that works i guess hey look that's great it's a free education that drives the career i love it yeah and so i mean but i think that the i mean the postgres documentation

Starting point is 00:07:02 is i mean if you're willing to dive into it, it's pretty technical. You have to like really get in there, but it's really good. It does describe a lot of stuff really well. And like, I mean, but that was also, but it had ways. So I quickly got into like using things in the ways that they weren't really meant to be. Maybe that's my, you know, that's part of why I like engineering actually in general is like, that's the creative side of it is like you figure out how to turn this thing and shove it in there and it works. And so I always sort of found myself doing that. And I mean, even to try and deal with time series and Postgres at the beginning, like it wasn't great at it.

Starting point is 00:07:41 And so I started futzing around with arrays and all sorts of other stuff oh wow yeah that postgres has and then that actually sort of and then I went to a postgres conference and that led me to my next job in databases because I started talking to someone about toast and arrays and all sorts of other stuff and what I was thinking about doing there and it was like you should come work for us and that's how I ended up at Moat, which is an ad analytics company. They had switched, you know, their backend from MySQL to Postgres and were, had all sorts of other data warehouse-y type things that they were working with. And, but there were a few people there who were really into Postgres and thought it could do more. And the flexibility that Postgres allows in terms of data types and some of the other interesting stuff that it could do allowed us to build some really interesting stuff. So we ended up

Starting point is 00:08:28 sort of taking out Vertica and Redshift from our backend, at least our customer facing backends, and replacing them with some like a couple different actually like Postgres backends that were pretty highly tuned for our specific use case. Like we had essentially a data warehouse, but for some reason our app always wanted all the columns and we had like 120 columns. And our app always wanted all of them. So you throw that at a column store and also you're like trying to serve a front end off of this,

Starting point is 00:08:58 which means like a column store is usually made for like relatively large columnar scans and relatively few clients and now you're trying to run a front end off of it where you have a fair number of clients and you're always getting all the columns and the column store is just like nope sorry yeah that's not what i was built for well that's kind of why i said like when you said before like you started to go down this path of like trying to do time series-esque stuff it's like okay well things are gonna get like pretty gnarly like with vanilla postgres pretty quickly but it sounds like you actually so you start you actually did some cool stuff

Starting point is 00:09:35 yeah so so at moat basically i helped build a data warehouse like a data warehouse replacement that would that sort of stored all of its columns in a composite type array and would like give you all of the columns back but it compressed it a bit it used a lot less io in order to do it and like we ended up you know swap like getting rid of vertica and redshift and just sticking with that and another one that we built that was like using Postgres FDW to, and another person mostly worked on this to base, it was basically like Athena light where we actually had compressed SQL light files and S3. And then we had like, we would kick off all of these Lambda jobs to go and grab SQL light files and pull them up and run some queries

Starting point is 00:10:23 on them. And I mean, that was where I started using Timescale. So I started using Timescale at like 0.5. Oh, wow. As you know, mostly because it was doing partition management. And I had already started doing partitioning and all sorts of stuff to handle that data warehouse that I was talking about. How did you find Timescale, by the way? Because, well, okay, so partitioning is like was that sort of the like the entry point where you

Starting point is 00:10:50 were like doing partitioning and you were trying to figure out a better way to do that and so then you sort of that was like yeah so i gave a talk actually at pgconf about the data warehouse that i'd built and i met the timescale folks there but But I was like, oh, I should start using this because it's a pain in the ass to make all of my, like do all of the partition set up manually. And they were doing it more automatically. And there were a lot of reasons why that was really nice. And there were a few like key things

Starting point is 00:11:18 that really made that really good for us. So I started using it very early. And, you know, we weren't like, it wasn't as full featured as it is now. And we weren't taking advantage of all of it, I'm sure. But like, it was really helpful for what we were trying to do there. So yeah, it was all like, but it was all sort of like the partition management type stuff that we needed to do. Yeah, totally. And with moat analytics, can you explain like, what were you? Was it a tool where people would like, run their own analytics? Was it a way for was it,

Starting point is 00:11:53 you know, companies who needed to serve analytics to their customers? What type of analytics were you? Yeah, so it's ad analytics. So yeah, advertising. Okay, got it. Yeah, advertising analytics. And so there was a whole side that was doing JavaScript and pixels and all of that stuff. And there was a whole real-time processing thing that we had. And then what I was mostly working on... So you're ingesting that into Postgres. Yeah. So once we had some of the, like, essentially the daily aggregates from that real-time pipeline, we were ingesting that into Postgres for all of our customers across all of the segments

Starting point is 00:12:28 and other sorts of, we had slicers and roll-ups and slicers. I don't remember exactly which one was which, but like, you know, a few different ways that we could slice and dice data. And we had a whole front end based on that and they were using it for their, so that front end was what we were powering. We had a whole thing also that was quite fun where we actually took a lot of the PHP that was powering that front end.

Starting point is 00:12:51 And we took a lot of the calculations out of PHP and moved them into these entirely ephemeral Postgres instances. So we basically... Because PHP would, you know, some of these calculations, it was like, I want all the memory in the universe. And then it would die because PHP. And so we basically said, well, Postgres could do all these calculations really nice. And so we just built this thing where it was like, okay, we have these, like, they're basically stateless.

Starting point is 00:13:23 They would actually reach out to the other Postgres backends and other types of backends that we would use via foreign data wrappers. And then they would basically just serve as the computation engine for a backend. And all the PHP was then doing, basically, was writing SQL queries that would run, it would create some, like, temporary tables. It wasn't exactly temporary tables, but basically temporary tables, put the data in and then run some queries on that. And then, you know, those would go away after some time. Yeah, yeah.

Starting point is 00:13:55 And that worked really nicely and really helped us. I mean, that's really kind of like headless flavored architecture, which is super interesting. Yeah. Very cool. Wow, that is fascinating. Let's zoom out and just talk about time series for a minute.

Starting point is 00:14:17 You talked about partitioning, right? That's one component of this, but generally people know and love Postgres, right? You obviously, you know, you've studied the documentation and built a successful career off of that. And Postgres is, it's awesome. Can you just give us a quick 101 on time series databases?

Starting point is 00:14:41 And then, of course, we have to talk about why Postgres. Actually, Kostas, you probably need to take the why time series on Postgres, which I can't wait to hear about. But can you just give us the 101? So time series database, how is it different? Why is it different? Why do you use it, et cetera? Yeah. So I ended up leaving Moat and joining Timescale about five years ago, partially because I saw the challenges there, right? So I think the challenges and also the opportunity, right? So the big thing with time series for me is that you're usually doing something that traditional databases weren't really built for.

Starting point is 00:15:20 So you're usually very right-he heavy and they're usually inserts. They're not updates. So a traditional sort of transactional OLTP type workload is much more oriented towards updates than it is to like lots of inserts. You're usually doing some sort of repeating value or you're tracking states like event data, something like that. So everything is associated with a timestamp. And for us, it's always, when we talk about this, like it's not, you know, right only or whatever, it's like mostly. So every once in a while you want to go back and update something, but it's a more rare occurrence. And so those are two of the defining characteristics where

Starting point is 00:16:00 you have this repeated measurement, something that you're measuring over time. The classic examples are of course, things like sensors, any sort of sensor where you just like, you're like, okay, what was the temperature at this time? What was the temperature at this time? What was the temperature? That's yeah. Or the bat, exactly. The battery example where we had temperature sensors too, but also voltage, current power, whatever else we're measuring. But in the observability use case in computer science, in, I mean, some things that are more typical data warehouse-y type bits, but where you're really working on mostly

Starting point is 00:16:35 like more recent data, I would say, that would be more time series-ish, where I think that maybe distinguishing between data warehouse and time series data is, where I think that maybe distinguishing between data warehouse and time series data is slightly harder in some ways. I think one of the things that... That's a great statement. Yeah. So I think that often time series and data warehouse type data have a lot in common, right? There's a lot to do with more columnar oriented data where you're doing you know with

Starting point is 00:17:07 a data warehouse you're often not doing updates you're instead like racking updates and then treating them as inserts in the data warehouse so you have a point in time where a value changed and that's like treated as a as an event essentially and so it looks a little bit more like a time series. And I think that those lines are blurry and it's something that we're trying to figure out where do we fall in that? But I think in general, time series tend to be more read heavy in relatively recent periods is a lot of the type of workload that you're doing where you have more clients where it's like, I have dashboards for all of my clients to see what their sensors were doing.

Starting point is 00:17:47 They're mostly looking at the most recent data, but sometimes they want to go back and dive deep into a certain period. Whereas more data warehouse workloads, I would say, you usually have fewer clients. They want to scan most of the data to do a big machine learning job on basically everything. And that's the sort of distinction that I would make. But there is, I think, a little bit more overlap there, if that makes sense. Yeah. So when you go back to, like, let's go back to the Moat example, right, where you sort of adopted timescale for, you know, partitioning purposes. But if you had to start over at moat and sort of rebuild everything, how would you build it today?

Starting point is 00:18:37 I think that some of the choices that we made were actually pretty good for what we were doing. Because a lot of our queries just wanted everything we actually made some good choices and some of the things that we would do would actually be somewhat similar right for right now assuming that constraint is still there i mean even with our compression in timescale like we we have implemented compression now on top of postgres and that could work nicely except that it's still column oriented to some extent and if you're really trying to get everything on every query, like we still have some overhead to do the projection. Sure. And bring all the rows back.

Starting point is 00:19:16 So, you know, you'd have to think a little bit about that. You could still use composite types in our compression. There's various things that you could do but we had a lot of things where we sort of self-implemented something like continuous aggregates that we have so i would definitely use those that'd be nice uh to have different levels of roll-ups of this data over time and that's one of the things that we do to help make some of those queries that might want to scan backwards in time or do something like that's a little more maybe data warehousey to some extent where you're taking something and you want to say, compare it to baseline over a year. I want to know this temperature, like how does it compare to the

Starting point is 00:20:02 other temperatures over the course of a year? You know, the battery voltage at this hour, like how do I do that in a reasonably efficient manner without having to rescan all of the individual points is a big question. That's really important to try to answer and continuous aggregates and things like that really do help do that. And that's something that we certainly like it was, that was a big problem for us. Like that was something that like the data science team would sort of do once a month or something. Right. And then they published the baselines and like, it wasn't all that flexible. It was hard to do. It was, there were things where it was like, it wasn't all that flexible. It was hard to do it was there were things where

Starting point is 00:20:45 it was like, it was just difficult. And so I think that there are things there that we can make that, you know, with some of the stuff that we have now, we can make a lot easier. With that said, I mean, there's also other stuff that I think would be really interesting for some of the real time stuff. I heard a little bit of some of the stuff you talked about with materialize a little while ago. Yeah. And that sounded interesting to me for some of the stuff where we had a whole in memory,

Starting point is 00:21:11 like crazy thing for doing the real time stuff back there. And that seemed like it could be an interesting type of system. Yeah, totally. Yeah. I mean, yeah. Postgres,

Starting point is 00:21:23 they run, run on Postgres natively. Supergirl's up. Okay, one more question before I hand it off to Costas. This is, it's so interesting to me to hear about what you were doing with Postgres.

Starting point is 00:21:36 Was your, sort of looking back on that now, were you sort of doing some things with Postgres that were maybe more rare? And did you find, did you find, it's a little bit of a leading question because I probably was. Did you find other people in the community who are just, Postgres is actually a good way to do some of these things. I mean, of course, you're sort of stretching it to some of its limits without something like timescale before you, you know, before you adopted that. But what was that like to sort of be breaking new ground? I don't know the right term to use, but doing some things

Starting point is 00:22:16 with Postgres that, you know, were not really conventional. Yeah. So I think, I mean, one of the things that we were doing was like packing rows into arrays. And this goes's definitely a characteristic of time series data, I'd say. Like at some point when you have billions of rows, it's just like you have to, in order to make sense of them. There's like, there's just no choice. So one of the things, we were using TOAST, which is my favorite acronym in Postgres, at least, and maybe in this world, which is the oversized attribute storage technique, which is basically what Postgres does when you can't fit a row onto a page. It takes it and it chunks it up and it stores it in another table. It does that for you

Starting point is 00:23:15 automatically. You don't have to do anything. It just, it says, I can't fit this on a page. I have to store it somewhere else. So we were using that essentially to our advantage to to say okay well we're usually going to access these rows together i don't want them randomly distributed i don't want to have the overhead of postgres's row overhead and some of the indexing stuff i actually want to be able to index to a group of rows so we're gonna store them in array and let toast do its thing and like then take that out and like that was a thing that was not at all like postgres was just trying to figure out how do we deal with this problem where like sometimes rows are too big for a page right that's not what they were trying to build but it turns out it

Starting point is 00:23:57 solved our problem really nicely and some of the folks in the community were like okay cool that's great yeah right and it's actually some of the basis i mean we have more advanced techniques for how we compress the data and store it out but it's still using the toast mechanism to store it out of line so like in timescale we use a modified version of that's more performant and written in c rather than sure lpgl and all sorts of nice stuff but but in the end like that fundamental thing that postgres built which is like how do i store this like a larger block of data out of line of the main table and this sort of insight of okay i want to take that i want to have it correspond to a set of something rather than to an individual row.

Starting point is 00:24:47 That's sort of the fundamental, that's still the fundamental basis for some of the, what we're doing there. And it also means that like the main table is very small, right? So if I store a little bit of metadata about what's in that row, then I can avoid going and looking in toast for that section. And I basically implemented a skip list that's common in a columnar database, but I've done it inside of Postgres. Lickety split.

Starting point is 00:25:10 Yeah. Right? And so, you know, that's the sort of stuff that we were doing a little bit of at Moat and we've done a lot more of at Timescale. Yeah. And it's just taking what was built for something else and repurposing it a little bit and flipping it on its head. And it's one of the things that I really do like about Postgres is that they have all these extension hooks and other sorts of things.

Starting point is 00:25:35 I mean, even just the way that it was originally built to accept different data types so you could define your own so that it was more extensible. Like, that's been built into Postgres from its beginnings. And so, yes, it's something new, but it's also part of what, and it's a slightly different way, I think we're taking it a little further than maybe most people have. But it is something that was imagined from the beginning of this project 30 years ago, that they wanted it to be more extensible, wanted it to be something that was more powerful, that you could build your own thing on top of it and customize it more. And it would be more flexible to work with different kinds of data. And so that has been there. That has been a

Starting point is 00:26:19 through line of what Postgres has done for its entire history from the days back at Berkeley, I guess. I love it. I love it. Spoken like someone who has been a diligent student of the Postgres docs. No, no, super helpful. Okay, Costas, I've been monopolizing here, so please, please jump in. I could keep going, but you're probably going to take a couple of questions off the tip of my tongue. Yeah, yeah. Thank you, Eric. So, David, I was listening to what you were saying about, like,

Starting point is 00:26:56 building a time series database, right, on top of Postgres. And what I find, like, super interesting here is that, you know, it is that it almost sounds a little bit contradictive in a way because we make the claim that, okay, time series data is special enough to require their own, let's say, way of working with it. But at the same time, you decide to use a technology that initially was not built for that to do it, right? And I'd love to hear the why first, and then we can talk about the how.

Starting point is 00:27:36 But why not go and do what, let's say, all the time of the time series databases out there have done which is to create their own dbms system right and go and do it like on top of postgres like why you chose that yeah so i think a few things knowing a little bit about so i joined times going over about 11 people and it had been like so it was it had been something else before that and timescale actually came out of like one of our sort of very early employees trying like saying maybe i should just build some postgres because we tried some time series databases and they weren't like really meeting our needs very well. And so like, that was the early history of timescale was like, this isn't working. Why isn't it working? Well, I actually want some of the things that come with like a standard database approach. Like I want to be able to use SQL. Like it turns out it's pretty powerful. There's a lot

Starting point is 00:28:42 of stuff it can do. And I kind of know how to use it. I want to be able to do some of the things like a join to a reference table that has the information about the sensors that I'm using, right? I want to just have that there. So like some of what we wanted to do is that your OLTP workloads and all of the information about what the time series are, if that lives somewhere else, then it actually becomes really hard to use your time series database effectively, because you can't do something like a join. You can't ask the questions about all of this other information that's out there very easily. you have to then essentially write your own application to do that so those two things living side by side in some way or another was one of the big reasons to do that um there's also

Starting point is 00:29:34 all sorts of stuff around atomicity consistency the ability you know to it turns out like when you write your own engine you have to make some hard choices. And like often in time series use cases, it's like, well, you can't actually go back and rewrite history. Like we, we can't do out of order inserts. And like in time series use cases,

Starting point is 00:29:55 out of order inserts are rare. Like they're infrequent compared to the overall number of inserts usually, but they can still be really important. Yeah. Right. So there were some trade-offs that we were making where we're like, well, we get this for free, right? We get the drivers for free, which I think is, you know, can be good and bad to some extent. And I think that there are trade-offs in all of these approaches. And like, there are times when Postgres has held us back to some extent, right? Like it is a bear.

Starting point is 00:30:29 Yeah. You can't work with a 30 year old C code base with however many million lines of code in it and figure out all of the interactions and the weird shit that you have to deal with. But it also means that you get 30 years of them seeing all of these use cases in the wild where that weird shit like there's a reason for it someone needed to do that and so it gets you somewhere in terms of like what the real world use case for a lot of the stuff is and figuring out where you can deviate from that and where

Starting point is 00:31:06 you shouldn't has i think been one of the big challenges for us but i think there's a lot of value there um it also means that you're less likely to lose data which like like is important not always the most important thing in some time series use case where it's like a duplicate it's a duplicate store but like you already have various knobs that you can twiddle in Postgres where you can say, well, I'm okay. Maybe if I lose a little bit of data here, like there are things that you could do, but I think having some of that consistency built in all of that stuff really helps you get started and build something out that is easier to work with. So I don't know if I actually answered the question there, if that was helpful, but yeah.

Starting point is 00:31:51 Oh, 100%. I mean, I think the history of database systems is like a history of three doves in general. I mean, you take it from an abstract point of view, you can define a database system, then that is universal, right? At the end, it's all about, like, storing and managing states, allowing the user, like, to perform, like,

Starting point is 00:32:14 some computation on top of that, right? The things start, like, getting interesting when we need to start making trade-dos. And these trade-dos are what is causing all these flavors. And of course, all the opportunities to build products and businesses on top of that. So definitely, I think if there's something that has been proven, is that there's no one database system that can manage all workloads at all times. But a follow-up question to that.

Starting point is 00:32:47 Okay, you answered the why, but let's talk about the how, right? How do you take something as complex as a database system that has been built for 30 years, more than that, and you build on top of it a completely different database system. What I

Starting point is 00:33:12 would like to hear, I think it's both from the timescale has done to make this happen, but also what Postgres offers that allows timescale to do that, right? Yeah.

Starting point is 00:33:26 So there are a few things that I think are key. And Postgres has had an extension system for a long time. And that extension system has sort of, I'd say, two different sorts of interfaces that the extensions can use. So there's one set of interfaces that is basically entirely in SQL. And it's basically like being able to create types, being able to create functions, being able to create aggregates, being able to create operators. And actually, the part of the product at timescale that I work on, the toolkit team is written in Rust, and it mostly interfaces with Postgres at that level. So types,

Starting point is 00:34:06 functions, operators, all mostly in SQL. Very little work behind the scenes on the C code. And then there's another set of hooks that Postgres provides that are essentially places in the actual C code where it allows an extension to take over control and provides an API inside the C layer, where you can sort of take control and do something different. And the main timescale DB extension works more heavily on that side. And so that's a, it's, and that's, you know, that's, that is, that can be hard. There were some clear things that we wanted to do to improve the partition management stuff that Postgres has gotten much better at over time. So when we started, it was on 9.6, and declarative partitioning wasn't even really a thing in Postgres. So we're still working with the old inheritance-based partitioning system.

Starting point is 00:35:13 And over time, Postgres has gotten a lot better there. We've also gotten a lot better and some of the hooks are a little easier to work with. But then there's all this stuff around compression and continuous aggregates and the other sorts of hooks that we need to look into to then take over processing and certain types of just optimizations that we do that are more specific to time series data, the way that we order the partitioning and can do various types of constraint exclusion or partition exclusion inside the planner and inside the optimizer. Sometimes at execution time, we're working on some things around aggregation and making that faster at execution time so like there's various things that you can do in various ways that you can hook into postgres

Starting point is 00:35:49 it actually offers a lot of these hooks that i think are really helpful for both like i think that we helped also drive the community forward right like in some of these senses and it's a great way for the postgres community to have innovation happen and let them sort of at least see that there's something going on here that they need to respond to but you know they are often a little bit slower just because they're a huge project that has 30 years of history and all that stuff they release once a year with a new version and like so some things that you if you want to contribute back upstream you might not see it for two or three years which is a long time yeah and so i think but that you know so we now have a team that's contributing back upstream

Starting point is 00:36:41 more which i think is great it's really good for us to be sort of giving back in that way. But you see that it just takes more time. But of course, there's a lot of benefit to that for us too, because there are a lot more people working on Postgres than are working on Timescale still, right? That's going to be the way it is for a while probably and so we get the benefit of their of all of that work because we're an extension it takes some time also for us to make sure that we're compatible with the next version and all that stuff right but that's uh so those are the trade-offs that we see we very much wanted to be an extension not a fork there have been a lot of successful projects. I think Vertica is one of them.

Starting point is 00:37:26 Redshift's another one that have sunk Postgres in the past, of course. But we didn't want to go down that route. We wanted to make sure that we stayed an extension so that we could keep benefiting from the community as well as giving back. And feeling like we had a, you know, because a lot of the projects started unforking at some point. So they're like, oh my god, all these things that we want have gotten into mainstream Postgres and now we're screwed because we're a fork. So... Makes sense. Okay, you mentioned there are two

Starting point is 00:38:01 extension systems in a way and you're working one of those has to do with extending SQL. And you also mentioned that you write in Rust there. So can you get a little bit more into the details of that? First of all, why do you need a language outside of SQL to do that? What's the reason for that? And how does it work with something like Rust? Because I would assume like, you know, the system was built a lot earlier than Rust itself, right? So how does this work?

Starting point is 00:38:38 Yeah. So a Postgres can work with lots of different languages. It can sort of hand control over to them. As long as you sort of stay within some of its memory allocation type stuff, it works fine. And it even has some languages that you can Postgres through the sort of SQL interface. Most of why we're writing in something like Rust is because we are writing new types of, say, aggregates or functions that we want to have a bit more control than SQL. Like you could, I mean, SQL is turn complete, so it's possible, but like it's not necessarily efficient to write it that way.

Starting point is 00:39:30 We want to have more control over what's going on in the innards, whether that's, and so we've worked with, so a lot of the stuff that we do on the toolkit team is like we write some things around like data sketches,

Starting point is 00:39:41 like hyperlog, and some of the frequency type accounting, countman sketch. We do things with things like counters and other sorts of things that just make analysis easier. So when we write a custom aggregate or a custom type, we're giving something that then you can use inside of SQL. And it makes the interaction with SQL and with the rest of the database easier somehow to solve a particular problem. That's really what we're trying to do there. So, and why did we choose Rust? Essentially because it has some nice qualities around memory safety and

Starting point is 00:40:31 other bits that we thought would be good to take advantage of as well as some nice libraries and stuff that we've used for some of these to solve some of these problems yeah yeah 100 so okay i think get an idea of like one part of what's like timescale dcaleDB is adding on top of Postgres. What other secret sauce is in there, right? What are the constructs that Timescale introduces that they are not native to Postgres, right? So

Starting point is 00:40:57 one construct I talked about a bit, which is our concept of a hypertable, which actually comes from technically the way that we do some of the partition management. So it's a way of treating a set of partitions more like a normal table. The partitions are created automatically as data comes in. And so that's one thing that we add on. And the name actually comes, hyper table comes from the concept of a hypercube and multi-dimensional spaces and how that hypercube sort of gets segmented as an individual partition has something like compression, which really reduces the size of the time series data set. It takes advantage of some of the qualities of time series data, reduces the size and can actually increase performance on

Starting point is 00:41:54 a number of types of queries, especially where you're IO bound, which is often the case. So you can compress a hyper table by adding a compression policy. You can also add things like continuous aggregates, which are essentially materialized views that have some sort of aggregation involved. Like we said, an aggregate is something that we often do with time series data. We want to have some summary statistics of some sort that we're adding on there that allow you to better understand what's going on. So continuous aggregates are basically an incrementally updated materialized view that also keeps track of any inserts to the underlying hyper table. And we'll go back and do essentially eventually consistent updates to that to make sure that it's going to give you the right state. So for out oforder inserts or updates,

Starting point is 00:42:46 it'll deal with those as well as doing real-time. So basically the way that the continuous aggregate works is you have some of it is materialized and some of it, the query is happening in real-time. So the most recent data is the queries may happen in real-time and those get put together in a nice way. And then something like the hyperfunctions that we work on offer specific time series types

Starting point is 00:43:07 of functionality that then interact with these other things in nice ways. So for instance, one function that we've been working, that we've released a while ago, it's actually an aggregate called counter ag, and it's about dealing with resetting counters. Now, where does that come up? That might come up in an observability setting where you're tracking the number of visitors to a website. And sometimes that counter sort of rolls over. It might come up in an electricity meter setting where your utility, you have all these electricity meters.

Starting point is 00:43:38 Some of them might reset every once in a while. But you're also trying to figure out how you calculate the actual energy used in a period when what you're sending is the total amount of energy used over all time, essentially. And that's not necessarily the easiest thing to deal with in SQL. So that's the type of stuff that we write a function in Rust with the right types of stuff and interactions with the Postgres aggregate system, which has its own whole set of things that it does, right? So around parallel workers and around all sorts of other fun things that Postgres has, it has a whole aggregate framework where it allows you to create your own aggregates and deal with parallelism and all sorts of other fun stuff. But so we make that,

Starting point is 00:44:20 we make something that's easier to deal with, that gives you a way to interact with it in a nice way and define that query in a way that is more understandable, more repeatable and solves for the edge cases. That's like, okay, I have some resets of the energy meter. I don't want to have negative 10 million kilowatt hours used when my meter rolls over for some reason. That's a bad thing. My billing system is going to be completely broken. We also, you know, we interpolate to the edges of buckets so that you can say during this 15 minute period, even though I'm measuring every two minutes

Starting point is 00:44:55 and that every two minutes isn't exact, right? I want to interpolate between the points that I have so that I get the usage that was in that 15, that I can really attribute to that 15 minute period. So that all happens inside these aggregates in a way that then means you don't have to do it with like six different window functions and four sub queries. And, you know, you can at least hopefully make, you know, keep it down to two. And then we interact with things like gap filling. So what do I do when I'm missing buckets? I want data every 15 minutes, but sometimes the network was down and like, what happens there? How do I fill in for that period that I missed?

Starting point is 00:45:36 So those are the sorts of problems that we help people who are working with time series to figure out how to do and work within that sequel structure to even extend a little bit what the syntax can do i mean there's only so much in terms of extending the syntax that we can do because like we can't modify the parser that is a limitation of the postgres extension system it's also like at what point are you no longer SQL? But like there is a lot of stuff that we can do even just with types and other sorts of stuff. So we're even talking about like,

Starting point is 00:46:15 how do I give people access to, for instance, a data frame like syntax that I could write in SQL and have the database figure out what to do and like actually still be SQL that I've written, but it's actually now going to be implemented almost in a data frame so that people who are more familiar with that have a way to get started or have a way to,

Starting point is 00:46:34 to do some of these calculations that otherwise would be really hard. So that's another thing we're considering and thinking about how we can do quite there yet. But yeah. Yeah. That's a great point like i i hear you like talking and like i keep thinking about let's say the interface with the user and like the experience of the user has right and you said many times like again that's something i find like super interesting like sq SQL is something like everybody knows like or can start working with it but then you hit some limitations on like how

Starting point is 00:47:12 you express things and things start getting like ugly and suddenly like you need like to add stuff there so i'd like to like hear from you because you're one of the... There aren't that many people out there, to be honest. They are working in trying to build experiences around a data system and a database. And I'd love to hear from you how you do that. I think there's a lot of information and knowledge out there about, you know, like user experience, like how to design for, you know, like for an optimal experience, like on a web application.

Starting point is 00:47:53 But here we are talking about data systems. And actually, it's not even the system, actually, it's the data, right? It's the interface to the data. Like this, the user at the end doesn't really care, like what the engine is, right? Like, it cares about the data. The user at the end doesn't really care what the engine is, right? It cares about the data. So, how do you think about that and how you design

Starting point is 00:48:14 and build experiences in a system like TimescaleDB? Yeah. The answer is it's tough. You know, that's a to be tried or something. But yeah, I mean, so it is a difficult thing, I think. You figure out how you make something simple enough that someone can use it, while also powerful enough to really make it easier to work with.

Starting point is 00:48:47 And thinking about how you conceive of that distinction and what you're going to leave to the user, what you're going to have the database do, what you actually want to do somewhere maybe outside of the database, like in a driver or something like that, that's something that we're beginning to think more about where you're actually gonna like which parts of the system should be optimized for ease of use versus versus which parts of the system should be optimized for flexibility and sort of power

Starting point is 00:49:17 expressiveness um so that then you can build something on top of it that's going to be more performant or be easier to use or whatever else, right? So I think that there's a balance that we're still them experimentally before we release them fully so that we can get some feedback from someone often they're coming from some sort of user request like i have this problem how do i solve it well here's this horrible sequel query and it's not very performant and whatever else okay how could we make this better so like we can we and we give them we were like okay here's an experimental version, try it out. And we try to get some feedback and see, okay, were those the right parameters? And some of the things that are harder to change, we try and iterate through one or two releases and then actually release properly. So that's one thing that, that we've tried to do and has helped.

Starting point is 00:50:24 I think we've definitely, you and has helped, I think. We've definitely gotten feedback from early users and then gotten more users as we release it more fully. But people who have those problems, who have written that SQL query, they're often willing to try something else because they've seen that it's not great. Right. And then I think it really does help to have a pretty deep understanding of what's going on in the database backend so that you know a bit more where the leverage points that you can hit that really are going to impact performance. And this is one of the things that I think is still a struggle for us. I think, you know, it's something that we're trying to figure out if it's the right approach is like, sometimes people want their SQL to just be generic and work on any SQL database, but like, you're going to get a lot better performance in many cases. If you learn, take the time to learn the tool that you're working with and write

Starting point is 00:51:26 either use some if you're working with timescale like there's all of these functions that we have that really make that analysis easier if you can write less generic sql but like something that works with the stuff that we're good at yep right because we give that to you um means you're less portable but honestly who changes their database like once you're in production like that's hard no matter what yeah so like i think sometimes people have gotten obsessed with that in a way that's bad and so that's one thing that i think like as an agenda for me is like learn your tools write stuff in the way that takes advantage of the things that they do well and you'll be much happier if with the results

Starting point is 00:52:18 so but also we're trying to figure out how we reach people better and how we can interact with them more and have more of the types of documentation that reaches them where they need i mean one of the things that that we're finding is like something like this counterag right there are there are and the two use cases that i mentioned right the electricity meter use case and the website observability type of use case they have the same sort of problem and we have some functions that are going to really make that much easier to work with but if i'm the electricity meter person i don't use the same words to refer to the problem as the person who's on the observability like web side

Starting point is 00:53:08 of things. You're going to talk about a counter that's like depending on who you are, it's like a counter that's like measuring web traffic or measuring network traffic. These are counters or something like that. In the electricity sector,

Starting point is 00:53:23 it's meters and the problem of resets is a little more rare it's a little but it's really important that some of the terms and the problems that they're trying to solve are just subtly different that so that it makes it hard for people to to reach it so i think we're you know the strategy that we're trying out now and we'll see if it works is like we're going to write more shorter contents focused on specific sectors that just like our problem solution problem solution problem solution give people more like recipes so it's like if i have this problem i can find an answer to it is one of the things that we're trying to figure out. And like that almost becomes UI.

Starting point is 00:54:06 Yeah. Right. Like documentation and examples and articles, that's your UI as much as anything else. Yeah. Yeah. Right. And that's,

Starting point is 00:54:18 it's a really, it is a different thing. It's hard to figure out how to do. I'm not sure that we're doing it right. We're trying. If you have feedback, if you have thoughts, please let us know. So, yeah.

Starting point is 00:54:31 100%. All right, so one last question from me, and then I'll give the mic back to Eric. So how important do you think that the user or developer experience is for the success of a system like timescale to be for a database system i think it's pretty darn important because i mean it's a i think that in many companies now the developers and the users drive adoption as much as especially around databases they drive adoption as much as anything as much as anyone else yep performance matters a lot

Starting point is 00:55:12 but there's also like a lot of things are good enough yeah a lot of use cases and so then what you're trying to figure out is like, am I being productive? Yeah. Can I write, like, is this saving me time? Do I, can I write less application code because I can do more in the database? That's really valuable. Sometimes that also massively improves performance. Like if I don't have to send data over the wire and I can, I mean, this is like some of the stuff that we've built around largest triangle three buckets. I don't know if you know this. It's a visualization algorithm that does downsampling

Starting point is 00:55:51 of data essentially for graphing to maintain the shape of a curve. You can do that on the database side. Like it's actually, then I can hand it off and the client doesn't have to do as much. So there's things that you can do to make it easier for something to work in the database that then really improve performance down the rest of the stack. And so I think figuring out how to do that is something that is a challenge for database folks and something that I think if we can get right, can be a real competitive advantage. It's like, how do I make this UI easy enough for people to use, but also powerful enough that people want to use it for more things and move more calculations closer to the data? Because that's going to have compounding effects in terms of performance and in terms of all the other stuff that they need to do

Starting point is 00:56:41 with the rest of their time. Makes total sense. Eric, the stage is yours again. Time for one more question. I love that Brooks isn't here and we can push it a little bit. So great. Okay. You know, people who are technical and like solving problems, like to sort of push their tools to the limit, right? And you're exhibit A of that with Postgres. technical and like solving problems, like to sort of push their tools to the limit, right? And

Starting point is 00:57:05 you're exhibit A of that with Postgres, you know, sort of pre-time scale, which I love. How does someone know when they need to implement time series specific functionality? Because there are certainly companies who can probably do some things at a smaller scale than you were facing at Moat with Postgres and using some of those approaches. And that's fine. We're a data warehouse, right?

Starting point is 00:57:34 And that's fine. But like you said before, you can do a lot of things that eventually don't end up scaling because you don't know that a time series solution exists. So how do you know when you need to actually adopt eventually don't end up scaling because you don't know that the time series solution exists. So probably you know when you need to actually adopt a time series database. Yeah. I mean, I think there's a couple different reasons that I would think about

Starting point is 00:57:55 or things that I might see that would help me think that it's time. Actually, let's talk about two different things here. So one is the UI. Yep. Right. So, so if you're having trouble doing the thing that you want to do in bog standard SQL, like if you're finding that you're writing this thing and it's like the SQL just seems excessively hard um there might be a reason to just adopt something like timescale that makes that query easier for you to write like there right that might be it right because we're going to make that query easier more performant when you're seeing things like i have materialized views that are taking a long time to update because I need to cache some of this stuff because I want to serve my UI better.

Starting point is 00:58:51 And I'm doing aggregates in them. That's a great reason. And that's more somewhere between scale and user interface. So we help tweak that interface in a way that's like, okay, that helps me solve this problem. I need to not have my materialized view recalculate everything from the beginning of time. Every time I refresh it, that seems silly, and it's taking a long time. I'm like, that's another good reason. And then if you're seeing high ingest rates where the ingest is slowing down, if you're seeing really high disk usage.

Starting point is 00:59:28 I mean, one of the things that ends up holding a lot of systems with time series back is just that disk usage. It's why compression is so important for us. We're seeing in some cases 10 or 20, sometimes even a bit more, times less space used. So you're thinking you're now fitting your data into 10% of the space that has an impact on the rest of your system as well in terms of where your IO bandwidth can go and what could be used. Usually database systems aren't CPU bound. They're usually IO bound. And if you're finding that like your OLTP type workloads are struggling because of the IO boundedness of the stuff that you're writing for logs and,

Starting point is 01:00:10 you know, all of the time series analytics data that you're writing there, like it might be time to think about it and it might be okay to just move, just move the time series part over to start and maybe connect back to it with an fdw to the oltp data that you have in the other database maybe that's where you start and maybe eventually you have them all side by side and nice there or maybe you keep it that way and that's fine too so there are different those are some of the things that i would start seeing when you start seeing weird locking issues i mean a lot of the stuff that we've

Starting point is 01:00:45 had to deal with is like how do we do this in as lock free away as we can how do we do this operation so that we're not taking a random lock on this part but that is going to mean that my insert volume has to drop because i have this lock that's sequenced into my scheduling in a way that like oh this is weird what's going on yep so there's a way that like, oh, this is weird. What's going on? Yep. So there's a lot of stuff around that we try to manage and help make easier. I think also even just like dropping data. If you're doing things like, like it's really simple in some ways, like we have these functions called drop chunks where it's like chunks are the part of the hyper table, the partitions of a hyper table. That's how we call them. Or it's like chunks are the part of the hyper table partitions of a hyper table that's how we call them where it's like i just want to drop data that's older than two years in postgres that's

Starting point is 01:01:30 delete and then you need to vacuum and you might not reclaim space and it starts slowing down inserts and there's like all this stuff so as you start getting into those types of problems those are some of the things that people like you do want to start thinking a little bit ahead on, you know, when you have those data retention type policies, that's another thing that I think it really helps to have something like a time series database to work on. It's like, I want it, like Postgres doesn't think about it in the same way

Starting point is 01:01:59 because it's like, that's not the operation that you would normally do in an OLTP type context, but it is really important in a time series context. Some point things need to fall off the back. So those types of things, there's just like, this is a more time specific thing. If you have time data and it's like getting bigger,

Starting point is 01:02:17 you're trying to do these more time oriented operations. We've thought about those things in a way that Postgres, which is much more general purpose is just like, they're not going to do as much of. That's just how it works. Yeah. I love it. No, I think that's, we had a guest recently say, you know, we think about sort of the advancement of technology, like, well, does that mean people are worse at managing databases? You know? And it's like, well, no, it just means that they can focus on problems that are harder as opposed to, you know, trying to solve like difficult problems with an instrument that's not quite, you know, built for that specific purpose.

Starting point is 01:02:59 Yeah. And more specialized, right? You know, the goal is that you can solve the problems that are specific to your business and that you really need to solve, right? You know more about your customers than we're ever going to. We want to give you the tools to help you solve those problems in a simple way so that you're focused on that though. Like not on this weird, like, how do I drop these things and then have to vacuum and figure it like that's not your customers don't care about that, but they see the results if you don't do it right.

Starting point is 01:03:33 Yep. So yeah, I love it. I love it. Okay. So listeners who are running Postgres and want to dip their toe in the water here, where do they go? What do they do? Yeah, timescale.com is a great place to start.

Starting point is 01:03:52 Even if you're not running Postgres, I'd say that there's actually a fair number of MySQL. We see a fair number of people on MySQL and things like that where they're like, oh, I'm adding partitions in MySQL. Yeah, totally. And partitions in MySQL. Yeah, totally. And partitions in MySQL are like, they get weird. Sometimes they're not managed transactionally, so bad things can happen. So there's, you know,

Starting point is 01:04:13 and migrating from Postgres, from MySQL to Postgres, a little hard, but not crazy, usually. So definitely timescale.com. We have a cloud offering that you should definitely take a look at, especially if you're not into running Postgres yourself. And we have blog and docs and I have a YouTube series that you can look at that gives you an introduction to some of this stuff. I think it's like Foundations of Postgres and Timescale.

Starting point is 01:04:43 It's on the timescale YouTube channel. Yeah. So those are some of the places that you can look to get started, but I'd say, you know, give it a try. We have a free trial period as almost everyone seems to these days, uh,

Starting point is 01:04:56 give it a shot and see if it helps you. So love it. David, thank you so much for the time today. I mean, so many learnings actually fun to just dig into some Postgres guts, too. So thank you so much. And we'd love to have you back on the show sometime in the future. It's really fun. Good to meet you guys. Kostas, we just chatted with David Cohn from TimescaleDB. Fascinating technology.

Starting point is 01:05:29 I think one of the big takeaways that I had was how extensible really good technology is. You know, he talked about Postgres sort of originating 30 years ago, right? You know, we sort of tend to talk about the last decade of change in make timescale possible in terms of dealing with time series data were conceived of as sort of functionality at a base level in Postgres 30 years ago. Which is really cool. And I also think it just speaks to, A, not only David's ability in terms of how far he pushed it, which, you know, like, it's amazing how far he pushed Postgres. Yeah. You know, without additional technology on top of it.

Starting point is 01:06:38 But also, just to the wisdom of the people who architected the original system. So that's what I'm going to take. I mean, it really is. It is pretty incredible. Yep. Yep. 100%. I mean, there were like so many lessons learned from like this conversation with David. Like, first of all, it was amazing to hear about

Starting point is 01:07:03 like all the different like ways that you can work with something with Postgres and like the tools that it provides to extend and build on top of that to the point where like you pretty much end up like building almost like a completely different database system at the end. So that was like super, super interesting and gave like a lot of foot for thought and like to go and check a few things about Postgres. But what I also like enjoyed, because it's like one of these like rare occasions where we have someone who is really deep into like the database

Starting point is 01:07:41 technology, but at the same time, the product side of things. It was like the conversation around how important the developer experience is for database systems and their success and how it becomes more and more important at the end, like these days. And he said something like, I don't know, it was super interesting to hear that like where performance is like good enough and when performance is good enough the question turns

Starting point is 01:08:11 into like how I can become more productive and that's where like the experience becomes like super important yeah so I don't know that's, just like, let's say part of like the conversation that we had, but I think it's like a great indication of like the whole conversation that we had and hopefully we will have him back soon. I agree. All right. Well, thank you for listening. Subscribe if you haven't, tell a friend and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show.

Starting point is 01:08:47 Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com. Thank you.

Your Ad Here

The Data Stack Show - 129: Databases, Data Warehouses, and Timeseries Data with David Kogn of Timescale

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.