The Data Stack Show - 129: Databases, Data Warehouses, and Timeseries Data with David Kogn of Timescale
Episode Date: March 8, 2023Highlights from this week’s conversation include:David’s background and journey to Timescale (2:12)What are time series databases? (14:13)How Timescale would have impacted David’s trajectory ear...ly in his career (17:51)Innovation in postgreSQL (21:02)Why does Timescale build their timeseries databases differently? (27:08)The challenges of building a new database on top of an old software (32:22)Writing outside of SQL and Timescale’s secret sauce (37:47)The importance of the developer experience in Timescale (54:08)How does someone know when they need to implement time series functionality (56:51)Final thoughts and takeaways (1:04:57)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
The Data Stack Show is brought to you by Rudderstack.
As we said last week, March is data transformation month and there's a competition.
If you submit a transformation to our source library and win, you get a $1,000 cash prize.
You can catch all the details on Rudderstack's Twitter.
Good luck.
Welcome back to the Data Stack Show.
Today, we're going to talk to David Cohn from Timescale.
Costas, I love that we have so many different types
of databases on the show.
We've talked about some sort of time series flavored stuff,
but I don't know if we've actually talked
with a specific TimeSeries
database technology yet on the show, which is pretty exciting. I think what's even more exciting
is that Timescale is built on Postgres. And so we can kind of dig into some of that stuff there.
But I guess I'm most excited about getting a 101 on time series data and time series databases. I think so much of that happens in sort of your vanilla, you know, sort of relational warehouse, right? A lot of analysts are sort of building time series type stuff, which works fine.
So why does this, you know, why do we have time series specific technology?
So that's what I'm going to ask about.
How about you?
Yeah.
Well, so many things to ask, actually.
First of all, I'd love to learn more about the choice of Postgres.
Postgres is one of these database systems that has been the foundation of a lot of innovation.
For example, Redshift was based on Postgres
and many other systems. So I'd love to hear what's so special about Postgres that makes it a platform
for innovation and database systems and why they chose to go after this direction instead of building a storage and query engine
from scratch like most of their competitors are doing.
So yeah, I think we're going to have a lot to talk about there.
And it's going to be super interesting.
All right, well, let's dig in.
David, welcome to the Data Sack Show.
So excited to talk about lots of data things, but specifically time series data things.
Cool. Glad to be here.
Okay, you did not start in computer science or sort of the world of data in terms of technology, as we talk about on the show,
you started in environmental engineering. So can you tell us how you stumbled into the world
of databases? I guess it wasn't a stumble. It was an intentional step, let's say.
Yeah. So I was working at a little battery manufacturing startup in New York City.
Actually, they were based out of an old brewery in Harlem.
And we're actually beginning to produce some of the batteries there and had a test facility there.
It was actually really nice.
I was living in New York at the time.
I could walk to work.
It was great.
No one in New York has that. But I sort of felt like we needed something that was, that would work better to deal
with the battery data that we'd been gathering the test cell data. Like what were these things made
of as well as just the battery cycling data. So we had hundreds of batteries that were being cycled,
tested in different ways, sort of trying to do various types of tests to see how they would age,
all of that sort of stuff. And I was like, okay, well, someone
needs to work on this. I wasn't an electrochemist, or I wasn't an electrochemist PhD, I should say,
which the other electrochemists who were working there were. I'd studied some in undergrad and
done some, but I knew enough to be dangerous, but I wasn't that good at it. But I did have a
computer background that they didn't have a little bit from my environmental engineering days. So I was like, okay, well, maybe I can help build a database
to help us track this better and get people the things that they need to evaluate these cells
more quickly. So they're not spending all of their time figuring out where various Excel
spreadsheets are and then trying to write analyses on them and not doing it great. So I started looking into databases and, you know, honestly, like the first one that I
started with was FileMaker Pro at the, you know, woohoo.
I was using MATLAB to connect to it because, hey, engineering, you know, the hard science
side of engineering doesn't necessarily know the good technology.
And we were connecting to an Oracle database that the battery tester ran in
its backend.
That was K ish.
But,
you know,
there were problems accessing it.
We didn't have full control.
There were all sorts of things that we wanted to do,
or it was like,
well,
we really need our own thing.
And FileMaker was pretty quickly,
not pretty quickly showed itself to not be the right thing
Can I ask a question
about FileMaker though because
FileMaker is
is
totally it is an origin
story though for many things
like it's interesting how FileMaker
like is in and of itself
sort of like a like first step
catalyst almost
is there anything you remember about I know you hit the limitations but is in and of itself sort of like a first step catalyst almost.
Is there anything you remember about... I know you hit the limitations, but is there
anything you remember about FileMaker where you're
like, this is pretty nice? Because
for what it is,
or for what it was, I guess is
probably a better way to say it.
It had some cool stuff.
Yeah, I mean,
it wasn't like... mean it certainly didn't
scale all that well but i mean it's sort of it does feel a little bit like the gateway drug
to some extent where it's like well i'm used to working with spreadsheets but this is like
this can do a bit more but it also holds your hand a bit more yeah and so i i think that honestly i felt a little bit the
same with my so like the next step was my sequel for a little bit and then i moved on to postgres
pretty quickly after that and my sequel has a little bit more of the hand holding and but like
postgres is more powerful in my mind and it has a lot more like constraints and other sorts of stuff that I think are really good,
but it's a little bit less easy to get started, et cetera, et cetera, et cetera.
And so I sort of made my way down that pathway over like six, eight, 10 months,
and then just spent a lot of time reading the Postgres documentation.
Yeah.
And so like, sort of have made a bit of a career of being able time reading the postgres documentation yeah and so like sort of have
made a bit of a career of being able to read the postgres documentation better than other people
which is you know that works i guess hey look that's great it's a free education that drives
the career i love it yeah and so i mean but i think that the i mean the postgres documentation
is i mean if you're willing to dive into it, it's pretty technical.
You have to like really get in there, but it's really good.
It does describe a lot of stuff really well.
And like, I mean, but that was also, but it had ways.
So I quickly got into like using things in the ways that they weren't really meant to be.
Maybe that's my, you know, that's part of why I like engineering actually in general is like, that's the creative side of it is like you figure out how to turn this thing and shove it in there and it works.
And so I always sort of found myself doing that.
And I mean, even to try and deal with time series and Postgres at the beginning, like it wasn't great at it.
And so I started futzing around with arrays and all sorts of other stuff oh wow yeah that
postgres has and then that actually sort of and then I went to a postgres conference and that led
me to my next job in databases because I started talking to someone about toast and arrays and all
sorts of other stuff and what I was thinking about doing there and it was like you should
come work for us and that's how I ended up at Moat, which is an ad analytics company.
They had switched, you know, their backend from MySQL to Postgres and were, had all sorts of other data warehouse-y type things that they were working with. And, but there were a few people
there who were really into Postgres and thought it could do more. And the flexibility that Postgres
allows in terms of data types and some of the other interesting stuff that it could do allowed us to build some really interesting stuff. So we ended up
sort of taking out Vertica and Redshift from our backend, at least our customer facing backends,
and replacing them with some like a couple different actually like Postgres backends
that were pretty highly tuned for our specific use case. Like we had essentially a data warehouse,
but for some reason our app always wanted all the columns
and we had like 120 columns.
And our app always wanted all of them.
So you throw that at a column store
and also you're like trying to serve a front end off of this,
which means like a column store is usually made
for like relatively large columnar scans
and relatively few clients
and now you're trying to run a front end off of it where you have a fair number of clients and
you're always getting all the columns and the column store is just like nope sorry yeah that's
not what i was built for well that's kind of why i said like when you said before like you started
to go down this path of like trying to do time series-esque stuff it's like okay well things are gonna get like pretty gnarly like with vanilla postgres pretty
quickly but it sounds like you actually so you start you actually did some cool stuff
yeah so so at moat basically i helped build a data warehouse like a data warehouse replacement
that would that sort of stored all of its columns in a composite
type array and would like give you all of the columns back but it compressed it a bit it used
a lot less io in order to do it and like we ended up you know swap like getting rid of vertica and
redshift and just sticking with that and another one that we built that was like using
Postgres FDW to, and another person mostly worked on this to base, it was basically like Athena
light where we actually had compressed SQL light files and S3. And then we had like, we would kick
off all of these Lambda jobs to go and grab SQL light files and pull them up and run some queries
on them. And I mean, that was where I started using Timescale.
So I started using Timescale at like 0.5.
Oh, wow.
As you know, mostly because it was doing partition management.
And I had already started doing partitioning and all sorts of stuff to handle that data
warehouse that I was talking about.
How did you find Timescale, by the way?
Because, well, okay, so partitioning is like was that sort of the like the entry point where you
were like doing partitioning and you were trying to figure out a better way to do that and so then
you sort of that was like yeah so i gave a talk actually at pgconf about the data warehouse that
i'd built and i met the timescale folks there but But I was like, oh, I should start using this
because it's a pain in the ass to make all of my,
like do all of the partition set up manually.
And they were doing it more automatically.
And there were a lot of reasons why that was really nice.
And there were a few like key things
that really made that really good for us.
So I started using it very early.
And, you know, we weren't like,
it wasn't as full featured as it is now. And we weren't taking advantage of all of it, I'm sure.
But like, it was really helpful for what we were trying to do there. So yeah, it was all like,
but it was all sort of like the partition management type stuff that we needed to do.
Yeah, totally. And with moat analytics, can you explain like, what were you?
Was it a tool where people would like, run their own analytics? Was it a way for was it,
you know, companies who needed to serve analytics to their customers? What type of analytics were
you? Yeah, so it's ad analytics. So yeah, advertising. Okay, got it. Yeah, advertising
analytics. And so there was a whole side that was doing JavaScript and pixels and all of that stuff.
And there was a whole real-time processing thing that we had.
And then what I was mostly working on...
So you're ingesting that into Postgres.
Yeah. So once we had some of the, like, essentially the daily aggregates from that real-time pipeline,
we were ingesting that into Postgres for all of our customers across all of the segments
and other sorts of, we had slicers and roll-ups and slicers.
I don't remember exactly which one was which, but like, you know, a few different ways that
we could slice and dice data.
And we had a whole front end based on that and they were using it for their, so that
front end was what we were powering.
We had a whole thing also that was quite fun
where we actually took a lot of the PHP
that was powering that front end.
And we took a lot of the calculations out of PHP
and moved them into these entirely ephemeral Postgres instances.
So we basically...
Because PHP would, you know, some of these calculations,
it was like, I want all the memory in the universe.
And then it would die because PHP.
And so we basically said, well, Postgres could do all these calculations really nice.
And so we just built this thing where it was like, okay, we have these, like, they're basically stateless.
They would actually reach out to the other Postgres backends
and other types of backends that we would use via foreign data wrappers.
And then they would basically just serve as the computation engine for a backend.
And all the PHP was then doing, basically, was writing SQL queries
that would run, it would create some, like, temporary tables.
It wasn't exactly temporary tables, but basically temporary tables, put the data in and then run some queries on that.
And then, you know, those would go away after some time.
Yeah, yeah.
And that worked really nicely and really helped us.
I mean, that's really kind of like headless flavored architecture, which is super interesting.
Yeah.
Very cool.
Wow, that is fascinating.
Let's zoom
out and just talk about
time series for a minute.
You talked about partitioning, right?
That's one
component of this, but
generally people know and love Postgres, right?
You obviously, you know, you've studied the documentation
and built a successful career off of that.
And Postgres is, it's awesome.
Can you just give us a quick 101 on time series databases?
And then, of course, we have to talk about why Postgres. Actually,
Kostas, you probably need to take the why time series on Postgres, which I can't wait to hear
about. But can you just give us the 101? So time series database, how is it different? Why is it
different? Why do you use it, et cetera? Yeah. So I ended up leaving Moat and joining
Timescale about five years ago, partially because I saw the challenges there, right?
So I think the challenges and also the opportunity, right?
So the big thing with time series for me is that you're usually doing something that
traditional databases weren't really built for.
So you're usually very right-he heavy and they're usually inserts.
They're not updates.
So a traditional sort of transactional OLTP type workload is much more oriented towards updates than it is to like lots of inserts.
You're usually doing some sort of repeating value or you're tracking states like event
data, something like that.
So everything is associated with a timestamp. And for us, it's always, when we talk about this, like it's not, you know, right only or whatever,
it's like mostly. So every once in a while you want to go back and update something,
but it's a more rare occurrence. And so those are two of the defining characteristics where
you have this repeated measurement, something that you're measuring over time. The classic examples are of course, things like sensors, any sort of sensor where
you just like, you're like, okay, what was the temperature at this time? What was the temperature
at this time? What was the temperature? That's yeah. Or the bat, exactly. The battery example
where we had temperature sensors too, but also voltage, current power, whatever else we're
measuring. But in the observability use case in computer science,
in, I mean, some things that are more typical
data warehouse-y type bits,
but where you're really working on mostly
like more recent data, I would say,
that would be more time series-ish,
where I think that maybe distinguishing
between data warehouse and time series data is, where I think that maybe distinguishing between data warehouse and
time series data is slightly harder in some ways. I think one of the things that...
That's a great statement.
Yeah. So I think that often time series and data warehouse type data have a lot in common,
right? There's a lot to do with more columnar oriented data where you're doing you know with
a data warehouse you're often not doing updates you're instead like racking updates and then
treating them as inserts in the data warehouse so you have a point in time where a value changed
and that's like treated as a as an event essentially and so it looks a little bit more
like a time series. And I think that
those lines are blurry and it's something that we're trying to figure out where do we fall in
that? But I think in general, time series tend to be more read heavy in relatively recent periods
is a lot of the type of workload that you're doing where you have more clients where it's like,
I have dashboards for all of my clients to see what their sensors were doing.
They're mostly looking at the most recent data, but sometimes they want to go back and dive deep
into a certain period. Whereas more data warehouse workloads, I would say,
you usually have fewer clients. They want to scan most of the data to do a big machine learning job on basically everything.
And that's the sort of distinction that I would make.
But there is, I think, a little bit more overlap there, if that makes sense.
Yeah.
So when you go back to, like, let's go back to the Moat example, right, where you sort of adopted timescale for, you know, partitioning purposes.
But if you had to start over at moat and sort of rebuild everything, how would you build it today?
I think that some of the choices that we made were actually pretty good for what we were doing.
Because a lot of our queries just wanted everything we actually made some good
choices and some of the things that we would do would actually be somewhat similar right for right
now assuming that constraint is still there i mean even with our compression in timescale like we we
have implemented compression now on top of postgres and that could work nicely except that it's still
column oriented to some extent and if you're really trying to get everything on every query, like we still have some overhead to do the projection.
Sure.
And bring all the rows back.
So, you know, you'd have to think a little bit about that.
You could still use composite types in our compression.
There's various things that you could do but we had a lot of
things where we sort of self-implemented something like continuous aggregates that we have so i would
definitely use those that'd be nice uh to have different levels of roll-ups of this data over
time and that's one of the things that we do to help make some of those queries that might want to scan backwards in time or do something like that's a little more maybe data
warehousey to some extent where you're taking something and you want to say, compare it to
baseline over a year. I want to know this temperature, like how does it compare to the
other temperatures over the course of a year?
You know, the battery voltage at this hour, like how do I do that in a reasonably efficient manner
without having to rescan all of the individual points is a big question. That's really important
to try to answer and continuous aggregates and things like that really do help do that.
And that's something that we certainly like it was, that was a big
problem for us. Like that was something that like the data science team would sort of do once a
month or something. Right. And then they published the baselines and like, it wasn't all that
flexible. It was hard to do. It was, there were things where it was like, it wasn't all that flexible. It was hard to do it was there were things where
it was like, it was just difficult. And so I think that there are things there that we can make that,
you know, with some of the stuff that we have now, we can make a lot easier.
With that said, I mean, there's also other stuff that I think would be really interesting for some
of the real time stuff. I heard a little bit of some of the stuff you talked about with materialize
a little while ago.
Yeah.
And that sounded interesting to me for some of the stuff where we had a
whole in memory,
like crazy thing for doing the real time stuff back there.
And that seemed like it could be an interesting type of system.
Yeah,
totally.
Yeah.
I mean,
yeah.
Postgres,
they run,
run on Postgres natively.
Supergirl's up.
Okay, one more question
before I hand it off to Costas.
This is, it's so interesting to me
to hear about what you were doing
with Postgres.
Was your, sort of looking back on that now,
were you sort of doing some things
with Postgres that were maybe more rare?
And did you find, did you find, it's a little bit of a leading question because I
probably was. Did you find other people in the community who are just, Postgres is actually a
good way to do some of these things. I mean, of course, you're sort of stretching it to some of
its limits without something like timescale before you, you know, before you adopted that. But what was that
like to sort of be breaking new ground? I don't know the right term to use, but doing some things
with Postgres that, you know, were not really conventional. Yeah. So I think, I mean, one of
the things that we were doing was like packing rows into arrays. And this goes's definitely a characteristic of time series
data, I'd say. Like at some point when you have billions of rows, it's just like you have to,
in order to make sense of them. There's like, there's just no choice. So one of the things,
we were using TOAST, which is my favorite acronym in Postgres, at least, and maybe in
this world, which is the oversized
attribute storage technique, which is basically what Postgres does when you can't fit a row onto
a page. It takes it and it chunks it up and it stores it in another table. It does that for you
automatically. You don't have to do anything. It just, it says, I can't fit this on a page.
I have to store it somewhere else. So we were using that essentially to our advantage to to say okay
well we're usually going to access these rows together i don't want them randomly distributed
i don't want to have the overhead of postgres's row overhead and some of the indexing stuff i
actually want to be able to index to a group of rows so we're gonna store them in array and let
toast do its thing and like then take that out and like that was a thing that was not at all like
postgres was just trying to figure out how do we deal with this problem where like sometimes
rows are too big for a page right that's not what they were trying to build but it turns out it
solved our problem really nicely and some of the folks in the community were like okay cool that's
great yeah right and it's actually some of
the basis i mean we have more advanced techniques for how we compress the data and store it out but
it's still using the toast mechanism to store it out of line so like in timescale we use a modified
version of that's more performant and written in c rather than sure lpgl and all sorts of nice stuff but but in the end like that fundamental thing
that postgres built which is like how do i store this like a larger block of data out of line
of the main table and this sort of insight of okay i want to take that i want to have it
correspond to a set of something rather than to an individual row.
That's sort of the fundamental, that's still the fundamental basis for some of the, what
we're doing there.
And it also means that like the main table is very small, right?
So if I store a little bit of metadata about what's in that row, then I can avoid going
and looking in toast for that section.
And I basically implemented a skip list that's common in a columnar database,
but I've done it inside of Postgres.
Lickety split.
Yeah.
Right?
And so, you know, that's the sort of stuff
that we were doing a little bit of at Moat
and we've done a lot more of at Timescale.
Yeah.
And it's just taking what was built for something else and repurposing it a little bit and flipping it on its head.
And it's one of the things that I really do like about Postgres is that they have all these extension hooks and other sorts of things.
I mean, even just the way that it was originally built to accept different data types so you could define your own so that it was more extensible.
Like, that's been built into Postgres from its beginnings.
And so, yes, it's something new, but it's also part of what, and it's a slightly different way,
I think we're taking it a little further than maybe most people have.
But it is something that was imagined from the beginning of this project 30 years ago,
that they wanted it to be more extensible, wanted it to be something that was more powerful,
that you could build your own thing on top of it and customize it more. And it would be more
flexible to work with different kinds of data. And so that has been there. That has been a
through line of what Postgres has done for its entire history from the days back at Berkeley, I guess.
I love it. I love it. Spoken like someone who has been a diligent student of the Postgres docs.
No, no, super helpful. Okay, Costas, I've been monopolizing here, so please,
please jump in. I could keep going, but you're probably going to take a couple of questions
off the tip of my tongue.
Yeah, yeah.
Thank you, Eric.
So, David, I was listening to what you were saying about, like,
building a time series database, right, on top of Postgres.
And what I find, like, super interesting here is that, you know, it is that it almost sounds a little bit contradictive in a way
because we make the claim that, okay,
time series data is special enough to require their own,
let's say, way of working with it.
But at the same time, you decide to use a technology
that initially was not built for that
to do it, right? And I'd love to hear the why first, and then we can talk about the how.
But why not go and do what, let's say, all the time of the time series databases out there have done which is to
create their own dbms system right and go and do it like on top of postgres like why you chose that
yeah so i think a few things knowing a little bit about so i joined times going over about 11 people and it had been like
so it was it had been something else before that and timescale actually came out of
like one of our sort of very early employees trying like saying maybe i should just build
some postgres because we tried some time series databases and they weren't like really meeting our needs very well. And so like, that was the early history of timescale was like, this isn't working. Why isn't
it working? Well, I actually want some of the things that come with like a standard database
approach. Like I want to be able to use SQL. Like it turns out it's pretty powerful. There's a lot
of stuff it can do. And I kind of know how to use it.
I want to be able to do some of the things like a join to a reference table that has the information about the sensors that I'm using, right?
I want to just have that there.
So like some of what we wanted to do is that your OLTP workloads and all of the information about what the time series are, if that lives somewhere else,
then it actually becomes really hard to use your time series database effectively,
because you can't do something like a join. You can't ask the questions about all of this other
information that's out there very easily. you have to then essentially write your own application to do that so those two things
living side by side in some way or another was one of the big reasons to do that um there's also
all sorts of stuff around atomicity consistency the ability you know to it turns out like when
you write your own engine you have to make some hard choices. And like often in time series use cases,
it's like,
well,
you can't actually go back and rewrite history.
Like we,
we can't do out of order inserts.
And like in time series use cases,
out of order inserts are rare.
Like they're infrequent compared to the overall number of inserts usually,
but they can still be really important.
Yeah. Right. So there were some trade-offs that we were making where we're like, well,
we get this for free, right? We get the drivers for free, which I think is, you know, can be good and bad to some extent. And I think that there are trade-offs in all of these approaches. And like,
there are times when Postgres has held us back to some extent,
right?
Like it is a bear.
Yeah.
You can't work with a 30 year old C code base with however many million
lines of code in it and figure out all of the interactions and the weird
shit that you have to deal with.
But it also means that you get 30 years of them seeing all of these use
cases in the wild where that weird shit like there's a reason for it someone needed to do that
and so it gets you somewhere in terms of like what the real world use case for a lot of the
stuff is and figuring out where you can deviate from that and where
you shouldn't has i think been one of the big challenges for us but i think there's a lot of
value there um it also means that you're less likely to lose data which like like is important
not always the most important thing in some time series use case where it's like a duplicate
it's a duplicate store but like you already have various knobs that you can twiddle
in Postgres where you can say, well, I'm okay. Maybe if I lose a little bit of data here, like
there are things that you could do, but I think having some of that consistency built in all of
that stuff really helps you get started and build something out that is easier to work with.
So I don't know if I actually answered the question there, if that was helpful, but yeah.
Oh, 100%.
I mean, I think the history of database systems is like a history of three doves in general.
I mean, you take it from an abstract point of view,
you can define a database system,
then that is universal, right?
At the end, it's all about, like,
storing and managing states,
allowing the user, like, to perform, like,
some computation on top of that, right?
The things start, like, getting interesting
when we need to start making trade-dos.
And these trade-dos are what is causing all these flavors.
And of course, all the opportunities to build products and businesses on top of that.
So definitely, I think if there's something that has been proven,
is that there's no one database system that can manage all workloads at all times.
But a follow-up question to that.
Okay, you answered the why,
but let's talk about the how, right?
How do you take something as complex as a database system
that has been built for 30 years, more than that,
and you build on top of it a completely different
database
system.
What I
would like to hear, I think
it's both from the
timescale
has done to make
this happen, but also what Postgres
offers that allows
timescale to do that, right?
Yeah.
So there are a few things that I think are key.
And Postgres has had an extension system for a long time.
And that extension system has sort of, I'd say, two different sorts of interfaces that
the extensions can use.
So there's one set of interfaces that is basically entirely in SQL. And it's basically like being able to create types, being able to create
functions, being able to create aggregates, being able to create operators. And actually,
the part of the product at timescale that I work on, the toolkit team is written in Rust,
and it mostly interfaces with Postgres at that level. So types,
functions, operators, all mostly in SQL. Very little work behind the scenes on the C code.
And then there's another set of hooks that Postgres provides that are essentially places in
the actual C code where it allows an extension to take over control and provides an API inside
the C layer, where you can sort of take control and do something different. And the main timescale
DB extension works more heavily on that side. And so that's a, it's, and that's, you know,
that's, that is, that can be hard. There were some clear things that we wanted to do to improve the partition management stuff that Postgres has gotten much better at over time.
So when we started, it was on 9.6, and declarative partitioning wasn't even really a thing in Postgres.
So we're still working with the old inheritance-based partitioning system.
And over time, Postgres has gotten a lot better there. We've also gotten a lot better and some of the hooks are a little easier to work with. But then there's all this stuff around compression
and continuous aggregates and the other sorts of hooks that we need to look into to then take over
processing and certain types of just optimizations
that we do that are more specific to time series data, the way that we order the partitioning
and can do various types of constraint exclusion or partition exclusion inside the planner and
inside the optimizer. Sometimes at execution time, we're working on some things around
aggregation and making that faster at execution
time so like there's various things that you can do in various ways that you can hook into postgres
it actually offers a lot of these hooks that i think are really helpful for both like i think
that we helped also drive the community forward right like in some of these senses and it's a
great way for the postgres community to have innovation happen
and let them sort of at least see that there's something going on here that they need to respond
to but you know they are often a little bit slower just because they're a huge project that has 30
years of history and all that stuff they release once a year with a new version and like so some things that you if
you want to contribute back upstream you might not see it for two or three years which is a long
time yeah and so i think but that you know so we now have a team that's contributing back upstream
more which i think is great it's really good for us to be
sort of giving back in that way. But you see that it just takes more time. But of course,
there's a lot of benefit to that for us too, because there are a lot more people working on
Postgres than are working on Timescale still, right? That's going to be the way it is for a while probably and so we
get the benefit of their of all of that work because we're an extension it takes some time
also for us to make sure that we're compatible with the next version and all that stuff right
but that's uh so those are the trade-offs that we see we very much wanted to be an extension not a
fork there have been a lot of successful projects. I think Vertica is one of them.
Redshift's another one that have sunk Postgres in the past, of course.
But we didn't want to go down that route.
We wanted to make sure that we stayed an extension so that we could keep benefiting from the community as well as giving back.
And feeling like we had a, you know, because a lot of the projects started unforking at some point.
So they're like, oh my god, all these things that we want have gotten into
mainstream Postgres and now we're screwed
because we're a fork. So...
Makes sense. Okay, you mentioned there are two
extension systems in a way and you're working
one of those has to do with extending SQL.
And you also mentioned that you write in Rust there. So can you get a little bit more into
the details of that? First of all, why do you need a language outside of SQL to do that?
What's the reason for that? And how does it work with something like Rust?
Because I would assume like, you know,
the system was built a lot earlier than Rust itself, right?
So how does this work?
Yeah.
So a Postgres can work with lots of different languages.
It can sort of hand control over to them.
As long as you sort of stay within some of its memory allocation type stuff, it works fine.
And it even has some languages that you can Postgres through the sort of SQL interface. Most of why we're writing in something like Rust is because we are writing new types of, say, aggregates or functions that we want to have a bit more control than SQL.
Like you could, I mean, SQL is turn complete,
so it's possible,
but like it's not necessarily efficient to write it that way.
We want to have more control
over what's going on in the innards,
whether that's,
and so we've worked with,
so a lot of the stuff
that we do on the toolkit team
is like we write some things around
like data sketches,
like hyperlog,
and some of the frequency type
accounting, countman sketch. We do things with things like counters and other sorts of things
that just make analysis easier. So when we write a custom aggregate or a custom type,
we're giving something that then you can use inside of SQL.
And it makes the interaction with SQL and with the rest of the database easier somehow to solve
a particular problem. That's really what we're trying to do there. So, and why did we choose Rust?
Essentially because it has some nice qualities around memory safety and
other bits that we thought would be good to take advantage of as well as some nice libraries and stuff that we've used for some of these to solve some of these problems yeah yeah
100 so okay i think get an idea of like one part of what's like timescale dcaleDB is adding on top of Postgres.
What other secret sauce
is in there, right? What are
the constructs that Timescale
introduces that they are not native
to Postgres, right?
So
one construct
I talked about a bit, which is
our concept of a hypertable,
which actually comes from technically the way that we do some of the partition management. So it's a way of
treating a set of partitions more like a normal table. The partitions are created automatically
as data comes in. And so that's one thing that we add on. And the name actually comes,
hyper table comes from the concept of a hypercube and multi-dimensional spaces and how that hypercube sort of gets segmented as an individual partition has something like compression, which really reduces the size of the time series data set. It takes advantage of some of
the qualities of time series data, reduces the size and can actually increase performance on
a number of types of queries, especially where you're IO bound, which is often the case.
So you can compress a hyper table by adding a compression policy. You can also add things like continuous aggregates, which are essentially materialized views that have some sort of aggregation involved.
Like we said, an aggregate is something that we often do with time series data.
We want to have some summary statistics of some sort that we're adding on there that allow you to better understand what's going on. So continuous
aggregates are basically an incrementally updated materialized view that also keeps
track of any inserts to the underlying hyper table. And we'll go back and do essentially
eventually consistent updates to that to make sure that it's going to give you the right state.
So for out oforder inserts or updates,
it'll deal with those as well as doing real-time.
So basically the way that the continuous aggregate works
is you have some of it is materialized
and some of it, the query is happening in real-time.
So the most recent data is the queries may happen in real-time
and those get put together in a nice way.
And then something like the hyperfunctions that we work on
offer specific time series types
of functionality that then interact with these other things in nice ways. So for instance,
one function that we've been working, that we've released a while ago, it's actually an aggregate
called counter ag, and it's about dealing with resetting counters. Now, where does that come up?
That might come up in an observability setting where you're tracking the number of visitors
to a website.
And sometimes that counter sort of rolls over.
It might come up in an electricity meter setting where your utility, you have all these
electricity meters.
Some of them might reset every once in a while.
But you're also trying to figure out how you calculate the actual energy used in a
period when what you're sending is the total amount of energy used over all time, essentially.
And that's not necessarily the easiest thing to deal with in SQL.
So that's the type of stuff that we write a function in Rust with the right types of stuff
and interactions with the Postgres aggregate system, which has its own whole set of things that it does, right? So around parallel workers and around all sorts of other
fun things that Postgres has, it has a whole aggregate framework where it allows you to create
your own aggregates and deal with parallelism and all sorts of other fun stuff. But so we make that,
we make something that's easier to deal with, that gives you a way to interact with it
in a nice way and define that query in a way that is more understandable,
more repeatable and solves for the edge cases.
That's like, okay, I have some resets of the energy meter.
I don't want to have negative 10 million kilowatt hours used when my meter rolls over for some
reason. That's a bad thing. My billing
system is going to be completely broken. We also, you know, we interpolate to the edges of buckets
so that you can say during this 15 minute period, even though I'm measuring every two minutes
and that every two minutes isn't exact, right? I want to interpolate between the points that I
have so that I get the usage that was in that 15, that I can really attribute to that 15 minute period.
So that all happens inside these aggregates in a way that then means you don't have to do it with like six different window functions and four sub queries.
And, you know, you can at least hopefully make, you know, keep it down to two.
And then we interact with things like gap filling.
So what do I do when I'm missing buckets?
I want data every 15 minutes, but sometimes the network was down and like,
what happens there? How do I fill in for that period that I missed?
So those are the sorts of problems that we help people who are working with time series to figure out how to do and work within that sequel structure
to even extend a little bit what the syntax can do i mean there's only so much in terms of
extending the syntax that we can do because like we can't modify the parser that is a limitation of
the postgres extension system it's also like at what point are you no longer SQL?
But like there is a lot of stuff
that we can do even just with types
and other sorts of stuff.
So we're even talking about like,
how do I give people access to,
for instance, a data frame like syntax
that I could write in SQL
and have the database figure out what to do
and like actually still be SQL that I've written,
but it's actually now going to be implemented almost in a data frame so that
people who are more familiar with that have a way to get started or have a way
to,
to do some of these calculations that otherwise would be really hard.
So that's another thing we're considering and thinking about how we can do
quite there yet.
But yeah.
Yeah.
That's a great point like i i hear you like talking and like i keep thinking about let's say the interface
with the user and like the experience of the user has right and you said many times like again that's
something i find like super interesting like sq SQL is something like everybody knows like or can start working with it but then you hit some limitations on like how
you express things and things start getting like ugly and suddenly like you need like to
add stuff there so i'd like to like hear from you because you're one of the...
There aren't that many people out there, to be honest.
They are working in trying to build experiences around a data system and a database.
And I'd love to hear from you how you do that.
I think there's a lot of information and knowledge out there about, you know, like
user experience, like how to design for, you know, like for an optimal experience, like
on a web application.
But here we are talking about data systems.
And actually, it's not even the system, actually, it's the data, right?
It's the interface to the data.
Like this, the user at the end doesn't really care, like what the engine is, right? Like, it cares about the data. The user at the end doesn't really care what the engine is,
right? It cares about the data.
So,
how do you think about
that and how you design
and build experiences
in a system like TimescaleDB?
Yeah.
The answer is
it's tough.
You know, that's a to be tried or something.
But yeah, I mean, so it is a difficult thing, I think.
You figure out how you make something simple enough that someone can use it, while also powerful enough to really make it easier to work with.
And thinking about how you conceive of that distinction
and what you're going to leave to the user,
what you're going to have the database do,
what you actually want to do somewhere maybe outside of the database,
like in a driver or something like that,
that's something that we're beginning to think more about
where you're actually gonna like which parts of the system should be optimized for ease of use
versus versus which parts of the system should be optimized for flexibility and sort of power
expressiveness um so that then you can build something on top of it that's going to be more performant or be easier to use or whatever else, right?
So I think that there's a balance that we're still them experimentally before we release them fully so that we can
get some feedback from someone often they're coming from some sort of user request
like i have this problem how do i solve it well here's this horrible sequel query and it's not
very performant and whatever else okay how could we make this better so like we can we and we give
them we were like okay here's an experimental version, try it out. And we try to get some feedback and see, okay, were those the right parameters? And
some of the things that are harder to change, we try and iterate through one or two releases and
then actually release properly. So that's one thing that, that we've tried to do and has helped.
I think we've definitely, you and has helped, I think.
We've definitely gotten feedback from early users and then gotten more users as we release it more fully.
But people who have those problems, who have written that SQL query, they're often willing to try something else because they've seen that it's not great.
Right. And then I think it really does help to have a pretty deep understanding of what's going on in the database backend so that you know a bit more where the leverage points that you can hit that really are going to impact performance.
And this is one of the things that I think is still a struggle for us.
I think, you know, it's something that we're trying to figure out if it's the right approach is like, sometimes people want their SQL to just be generic and work on any SQL database,
but like, you're going to get a lot better performance in many cases.
If you learn, take the time to learn the tool that you're working with and write
either use some if you're working with timescale like there's all of these functions that we have
that really make that analysis easier if you can write less generic sql but
like something that works with the stuff that we're good at yep right because we give that to you um means you're less portable
but honestly who changes their database like once you're in production like that's hard no matter
what yeah so like i think sometimes people have gotten obsessed with that in a way that's bad
and so that's one thing that i think like as an agenda for me is like learn your tools
write stuff in the way that takes advantage of the things that they do well
and you'll be much happier if with the results
so but also we're trying to figure out how we reach people better and how we can interact with
them more and have more of the types of documentation that reaches them where they
need i mean one of the things that that we're finding is like something like this counterag
right there are there are and the two use cases that i mentioned right the electricity meter use
case and the website observability type of use case they have the same sort of problem and we have
some functions that are going to really make that much easier to work with but
if i'm the electricity meter person i don't use the same words to refer to the problem as the person who's on the observability
like web side
of things.
You're going to talk about a counter
that's like
depending on who you are, it's like a
counter that's like measuring web traffic
or measuring network traffic.
These are counters or something like that.
In the electricity sector,
it's meters and the problem
of resets is a little more rare it's a little but it's really important that some of the terms and
the problems that they're trying to solve are just subtly different that so that it makes it hard for
people to to reach it so i think we're you know the strategy that we're trying out now and we'll
see if it works is like we're going to write more shorter contents
focused on specific sectors that just like our problem solution problem solution problem solution
give people more like recipes so it's like if i have this problem i can find an answer to it
is one of the things that we're trying to figure out. And like that almost becomes UI.
Yeah.
Right.
Like documentation and examples and articles,
that's your UI as much as anything else.
Yeah.
Yeah.
Right.
And that's,
it's a really,
it is a different thing.
It's hard to figure out how to do.
I'm not sure that we're doing it right.
We're trying.
If you have feedback, if you have thoughts,
please let us know.
So, yeah.
100%.
All right, so one last question from me,
and then I'll give the mic back to Eric.
So how important do you think that
the user or developer experience is
for the success of a system like timescale to be for a
database system i think it's pretty darn important because i mean it's a i think that in many
companies now the developers and the users drive adoption as much as especially around databases they drive adoption as much as anything as much as anyone else yep performance matters a lot
but there's also like a lot of things are good enough yeah a lot of use cases and so then what
you're trying to figure out is like, am I being productive?
Yeah.
Can I write, like, is this saving me time? Do I, can I write less application code because I can
do more in the database? That's really valuable. Sometimes that also massively improves performance.
Like if I don't have to send data over the wire and I can, I mean, this is like some of the stuff
that we've built around largest triangle
three buckets. I don't know if you know this. It's a visualization algorithm that does downsampling
of data essentially for graphing to maintain the shape of a curve. You can do that on the database
side. Like it's actually, then I can hand it off and the client doesn't have to do as much. So there's things that you can do to make it easier for something to work in the database
that then really improve performance down the rest of the stack.
And so I think figuring out how to do that is something that is a challenge for database
folks and something that I think if we can get right, can be a real competitive advantage.
It's like, how do I make this UI easy enough for people to use, but also powerful enough that people want to use it
for more things and move more calculations closer to the data? Because that's going to have
compounding effects in terms of performance and in terms of all the other stuff that they need to do
with the rest of their time. Makes total sense. Eric, the stage is yours again.
Time for one more question.
I love that Brooks isn't here and we can push it a little bit.
So great.
Okay.
You know, people who are technical and like solving problems,
like to sort of push their tools to the limit, right?
And you're exhibit A of that with Postgres. technical and like solving problems, like to sort of push their tools to the limit, right? And
you're exhibit A of that with Postgres, you know, sort of pre-time scale, which I love.
How does someone know when they need to implement time series specific functionality?
Because there are certainly companies
who can probably do some things at a smaller scale
than you were facing at Moat with Postgres
and using some of those approaches.
And that's fine.
We're a data warehouse, right?
And that's fine.
But like you said before,
you can do a lot of things
that eventually don't end up scaling
because you don't know that a time series solution exists.
So how do you know when you need to actually adopt eventually don't end up scaling because you don't know that the time series solution exists. So
probably you know when you need to actually adopt a time series database.
Yeah. I mean, I think there's a couple different reasons that I would think about
or things that I might see that would help me think that it's time.
Actually, let's talk about two different things here. So one is the UI.
Yep. Right. So, so if you're having trouble doing the thing that you want to do in
bog standard SQL, like if you're finding that you're writing this thing and it's like
the SQL just seems excessively hard um there might be a
reason to just adopt something like timescale that makes that query easier for you to write
like there right that might be it right because we're going to make that query easier more
performant when you're seeing things like i have materialized views that are taking a long time to update because I need to cache some of this stuff because I want to serve my UI better.
And I'm doing aggregates in them.
That's a great reason.
And that's more somewhere between scale and user interface.
So we help tweak that interface in a way that's like, okay, that helps me solve this problem.
I need to not have my materialized view recalculate everything from the beginning of time.
Every time I refresh it, that seems silly, and it's taking a long time.
I'm like, that's another good reason.
And then if you're seeing high ingest rates where the ingest is slowing down, if you're seeing really high disk usage.
I mean, one of the things that ends up holding a lot of systems with time series back is just that disk usage.
It's why compression is so important for us.
We're seeing in some cases 10 or 20, sometimes even a bit more, times less space used.
So you're thinking you're now fitting
your data into 10% of the space that has an impact on the rest of your system as well in terms of
where your IO bandwidth can go and what could be used. Usually database systems aren't CPU bound.
They're usually IO bound. And if you're finding that like your OLTP type workloads are struggling because
of the IO boundedness of the stuff that you're writing for logs and,
you know,
all of the time series analytics data that you're writing there,
like it might be time to think about it and it might be okay to just move,
just move the time series part over to start and maybe connect back to it with an fdw
to the oltp data that you have in the other database maybe that's where you start and maybe
eventually you have them all side by side and nice there or maybe you keep it that way and that's
fine too so there are different those are some of the things that i would start seeing when you
start seeing weird locking issues i mean a lot of the stuff that we've
had to deal with is like how do we do this in as lock free away as we can how do we do this
operation so that we're not taking a random lock on this part but that is going to mean that my
insert volume has to drop because i have this lock that's sequenced into my scheduling in a way that
like oh this is weird what's going on yep so there's a way that like, oh, this is weird. What's going on?
Yep. So there's a lot of stuff around that we try to manage and help make easier.
I think also even just like dropping data. If you're doing things like, like it's really simple
in some ways, like we have these functions called drop chunks where it's like chunks are the part of
the hyper table, the partitions of a hyper table. That's how we call them. Or it's like chunks are the part of the hyper table partitions of a hyper table that's how we call them where it's like i just want to drop data that's older than two years in postgres that's
delete and then you need to vacuum and you might not reclaim space and it starts slowing down
inserts and there's like all this stuff so as you start getting into those types of problems those
are some of the things that people like you do want to start thinking a little bit ahead on,
you know, when you have those data retention type policies,
that's another thing that I think it really helps
to have something like a time series database to work on.
It's like, I want it,
like Postgres doesn't think about it in the same way
because it's like,
that's not the operation that you would normally do
in an OLTP type context,
but it is really important in a time series context.
Some point things need to fall off the back.
So those types of things, there's just like,
this is a more time specific thing.
If you have time data and it's like getting bigger,
you're trying to do these more time oriented operations.
We've thought about those things in a way that Postgres,
which is much more general purpose is just like, they're not going to do as much of. That's just how it works.
Yeah. I love it. No, I think that's, we had a guest recently say,
you know, we think about sort of the advancement of technology, like, well, does that mean people
are worse at managing databases? You know? And it's like, well, no, it just means that they can focus on problems that are harder
as opposed to, you know, trying to solve like difficult problems with an instrument that's
not quite, you know, built for that specific purpose.
Yeah.
And more specialized, right?
You know, the goal is that you can solve the problems that are specific to your business and that you really need to solve, right? You know more about your customers than
we're ever going to. We want to give you the tools to help you solve those problems
in a simple way so that you're focused on that though. Like not on this weird, like,
how do I drop these things and then have to vacuum and figure it
like that's not your customers don't care about that,
but they see the results if you don't do it right.
Yep.
So yeah,
I love it.
I love it.
Okay.
So listeners who are running Postgres and want to dip their toe in the
water here, where do they go? What do they do?
Yeah, timescale.com is a great place to start.
Even if you're not running Postgres, I'd say that there's actually a fair number of MySQL.
We see a fair number of people on MySQL and things like that where they're like, oh, I'm adding partitions in MySQL.
Yeah, totally.
And partitions in MySQL. Yeah, totally. And partitions in MySQL are like,
they get weird.
Sometimes they're not managed transactionally,
so bad things can happen.
So there's, you know,
and migrating from Postgres,
from MySQL to Postgres,
a little hard, but not crazy, usually.
So definitely timescale.com.
We have a cloud offering
that you should definitely take a look at, especially if you're not into running Postgres yourself.
And we have blog and docs and I have a YouTube series that you can look at that gives you an introduction to some of this stuff.
I think it's like Foundations of Postgres and Timescale.
It's on the timescale YouTube channel.
Yeah.
So those are some of the places that you can look to get started,
but I'd say,
you know,
give it a try.
We have a free trial period as almost everyone seems to these days,
uh,
give it a shot and see if it helps you.
So love it.
David,
thank you so much for the time today.
I mean,
so many learnings actually fun to just dig into some Postgres guts, too. So thank you so much. And we'd love to have you back on the show sometime in the future.
It's really fun. Good to meet you guys. Kostas, we just chatted with David Cohn from TimescaleDB.
Fascinating technology.
I think one of the big takeaways that I had
was how extensible really good technology is.
You know, he talked about Postgres
sort of originating 30 years ago, right? You know, we sort of tend to talk about the last decade of change in make timescale possible in terms of dealing with
time series data were conceived of as sort of functionality at a base level in Postgres 30
years ago. Which is really cool. And I also think it just speaks to, A, not only David's ability in terms of how far he pushed it, which, you know, like, it's amazing how far he pushed Postgres.
Yeah.
You know, without additional technology on top of it.
But also, just to the wisdom of the people who architected the original system.
So that's what I'm going to take.
I mean, it really is.
It is pretty incredible.
Yep. Yep. 100%.
I mean, there were like so many lessons learned
from like this conversation with David.
Like, first of all, it was amazing to hear about
like all the different like ways that you
can work with something with Postgres and like the tools that it provides to extend
and build on top of that to the point where like you pretty much end up like building
almost like a completely different database system at the end.
So that was like super, super interesting and gave like a lot of foot for thought
and like to go and check a few things about Postgres.
But what I also like enjoyed, because it's like one of these like rare
occasions where we have someone who is really deep into like the database
technology, but at the same time, the product side of things.
It was like the conversation around how important the developer experience is
for database systems and their success
and how it becomes more and more important
at the end, like these days.
And he said something like, I don't know,
it was super interesting to hear that like
where performance is like good enough and when performance is good enough the question turns
into like how I can become more productive and that's where like the experience becomes like
super important yeah so I don't know that's, just like, let's say part of like the conversation that we had, but I think it's like a great indication of like
the whole conversation that we had and hopefully we will have him back soon.
I agree.
All right.
Well, thank you for listening.
Subscribe if you haven't, tell a friend and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rutterstack,
the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com. Thank you.