Postgres FM - pg_duckdb
Episode Date: January 3, 2025Michael and Nikolay are joined by Joe Sciarrino and Jelte Fennema-Nio to discuss pg_duckdb — what it is, how it started, what early users are using it for, and what they're working on next.... Here are some links to things they mentioned:Joe Sciarrino https://postgres.fm/people/joe-sciarrinoJelte Fennema-Nio https://postgres.fm/people/jelte-fennema-niopg_duckdb https://github.com/duckdb/pg_duckdbHydra https://www.hydra.soMotherDuck https://motherduck.comThe problems and benefits of an elephant with a beak (lightning talk by Jelte) https://www.youtube.com/watch?v=ogvbKE4fw9A&list=PLF36ND7b_WU4QL6bA28NrzBOevqUYiPYq&t=1073spg_duckdb announcement post (by Jordan and Brett from MotherDuck) https://motherduck.com/blog/pg_duckdb-postgresql-extension-for-duckdb-motherduckpg_duckdb 0.2 release https://github.com/duckdb/pg_duckdb/releases/tag/v0.2.0~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I am Michael, founder of PgMusted, and I'm joined as usual by Nikolai, founder of Postgres.io.
Hey Nikolai.
Hi Michael. How are you?
No, not falling for it.
Today we are delighted to be joined by two guests.
Firstly, Joe from Hydra. Welcome Joe.
Thank you.
And also Jota from Motherduck. Welcome Jota.
Hi. Hi all.
Thanks so much both for joining us.
We are going to be discussing a new-ish project called PGDuckDB,
which very interestingly is a cross-organisation collaboration.
So, Yalta, I saw the recording of your excellent lightning talk
introducing PGDuckDB to PGConf EU.
A nice job.
Perhaps you could start us off with how you describe PGDuckDB to pgConf EU. A nice job. Perhaps you could start us off
with how you describe pgDuckDB
to a Postgres-savvy audience
who may or may not know what DuckDB is.
All right, so it's an extension for Postgres,
and it embeds DuckDB.
And DuckDB is an in-process analytics database,
like SQL database.
So it's a bit like SQLite, but then more for analytics,
like big queries with lots of energy gets
and stuff like that and lots of joints
and less like SQLite or like Postgres in general
for like smaller dates,
like getting a hundred rows or something like that.
It's index scans, stuff like that.
Like indexes are not something that's commonly used
in the DuckDB land.
It's like, oh, we just scan the whole table
and we compute stuff very fast on it.
Yeah, and that is then embedded into Postgres
so that you can use that for some of your
more hazard queries that don't rely on indexes.
Nice, and I'm sure we're going to come back
to indexes in a bit once we dive into the details.
But Joe, I think the origin story of this project is super interesting.
I'd be fascinated.
I was reading the Mother Duck blog post and it mentioned that Hydra kicked off the effort.
So I'd love for you to share a little bit more about that if you can.
Sure, sure.
So our team had been working on a crazy weekend project that we initially were calling PGQuack.
It was called PGQuack because, well, ducks go quack, but also because we didn't necessarily expect this to work.
We kind of were writing it off as a quacky project.
So it was a little kinds of analytics queries and aggregates like Yelta mentioned.
And we had this aha moment that we could actually take pgQuack pretty seriously and do it for real in collaboration with folks that know a whole lot about DuckDB.
You know, we're Postgres guys by background mostly. So not to get
too far ahead, but we decided to build PGDuckDB for real as a production grade extension together
with the folks over at DuckDB Labs, the creators of DuckDB, which is Hannes and Mark, as well as the folks at
MotherDuck, like Yelta. So we've been working together with them on it,
oh, for at least the last six, seven months now.
Yeah, super cool. And I saw a few other organizations mentioned as well.
How do they all fit in? Yeah, so this is an open source, it's an MIT licensed extension.
So for folks that want to participate or can see pgDuckDB being really helpful in their use cases,
they can either tell us directly what they'd like to add or open a PR or open issues. I mean,
certainly there are a good number of those as the project is currently in beta. So be gentle with it, but also tell us the things that could be improved on.
One of the things that was surprising to me was the immediate reaction that we saw to
how DuckDB could be useful to accelerating analytics queries in Postgres. And there are
a lot of organizations that went oh yes okay we want to
support this we want to get involved so that's probably some of the names you were seeing
yeah i think i saw was it neon and microsoft maybe some others as well yeah there's a whole
there's a whole range of folks that that are interested in us you know moving this forward
and they're providing either direct or indirect support for the project.
Yes. Jotter, from your side, what are you, actually I'd be interested in both of your thoughts on this, but what are you seeing early users most excited about? Like what features are
they most keen to see added or improved? I think one of the big things is that you can sort of read Parquet files or
recsv files from blob storage, like from S3 or from Azure or from all the big clouds,
basically, and Cloudflare as well. And I think that's a big feature that people are very
interested in. And I mean, there's so many open issues of like people wanting a specific thing
that it's hard to sort of pick one that's like, oh, that's the thing people want. But I mean,
one that does come up a few, has come a few times is like support for spatial stuff. Like,
so post GIS things. And that works slightly different in DuckDB. So we haven't started on
that. We're first sort of building a bit more the base,
but that's definitely one of the features that comes up.
Like, people wanting to do spatial analytics
and let's do spatial data, basically.
Yeah.
But that's not supported right now.
Yeah.
So one of the biggest features is that data can be...
I mean, it's not a feature.
It's like a requirement here. Data is on object storage like S3, right? In Parquet format, remember some attempts to have access to Parquet files on
object storage as extension before pgDuckDB, right? And it's also like it could be done in FZW way,
foreign data wrapper way. I think some SQL interface over those files, and that's it.
But DuckDB is like bringing something else.
I've never tried myself,
but I've heard many opinions about very good performance, right?
Yeah, yeah.
So that's one of the big differences
between VGDuckDB and some of the other
foreign data wrapper-based
ones is that we sort of flip it around a bit instead of saying like, oh, this table is stored
somewhere else, we'll read that using DuckDB or something, or something else. Instead,
we're saying like, okay, we push the whole query, we put it in DuckDB, and then let DuckDB read the Postgres data.
So the Postgres sort of becomes a foreign data wrapper
kind of inside DuckDB.
Does that make sense?
It's more like a query stealing approach.
It's what Mario from Hydra,
one of the developers there, called it.
Instead of letting the Postgres execution engine
do a lot of things, we don't want letting the Postgres execution engine do a lot of things,
it was like, no, we don't want the Postgres execution engine
to do a lot of things
because the DuckDB execution engine
is much better for this analytic stuff,
like the sorting and the grouping
and stuff like that that are joining.
We want to do as much as possible inside DuckDB.
Yeah, that's interesting.
Yeah, and also, you know,
kind of back to Michael's question a little bit too too. A really common pattern for a lot of application developers is to have Postgres, but also to have an know, has 2.6 terabytes of event data
that they really shouldn't be storing necessarily on Postgres,
but they could offload into Parquet format in S3.
But that data still needs to be accessible by Postgres
because one of the important things is that pgduckdb can perform the join
between a heap table and Postgres
together with the Parquet files in object storage so that there's not this siloing or
needing for ETL processes between Postgres and external analytics databases and S3.
So, you know, how do you smooth over the differences between columnar format and
row-based format? And I think pgDuckDB goes a long way to do that because ultimately when you're
developing applications, thinking about table format is pretty low level. So if you can avoid
that as much as possible, that's probably just easier. Yeah, but here I'm confused a little bit
because this thought is like,
let's have tiering of storage.
And for example, I remember several cases,
big companies where this was needed
and the decision was made
for scaling purposes of main clusters,
we need to offload archive data somewhere else, maybe to different cluster
or somewhere else. Maybe sometimes we just delete it. It could be, for example, e-commerce and
old orders, order history exceeding, for example, a couple of years. We don't need that.
But offloading this data to S3, this sounds great. And I remember there's also extension
PGT from Tembo. Again, I haven't tried it yet.
It's in my to-do list. But offloading that, I'm thinking, if I also receive column storage for this kind of data, is it right? Because I still need only one row in these cases, right?
In some cases, I do need aggregates. I don't know, time series, I would love to offload to a three
in column store format, but
not in these cases when I just need one row.
Yeah.
I mean, I think it's different needs for different users.
If you are a, you know, application developer just trying to build something quickly, then,
you know, if we can tell you, hey, dump a lot of your event data into Parquet format
in S3 or archive it out of
Postgres on some increment every seven days, every 30 days or something, then that's easy enough.
And then maybe the thinking about columnar versus row should subside. But yeah, I mean,
data tiering and data lake is not as different as, you know, they're really not that different.
It's just a function of what the user interacts with.
Right, but my concern is if I need the row store, like I need to find all columns for one row to present it.
For example, again, orders, right?
I need one order.
If we pull it from archive, I expect not good performance.
But if it's column store, I will be concerned about performance to collect all the rows versus row storage, right? Let's formulate the question
differently. Let's talk about the use cases where this particular pattern, when we need
to upload a lot of data to S3 and benefit from column store analytical queries like aggregates, what kind of data it could be
in existing Postgres database?
It can be SAS or something.
I don't know.
What kind of data is beneficial
for us to offload to this?
A very common case is time series data.
I think that's by far
one of the most common cases
where it's like, okay,
at least the old data we store in, I mean you're not going to look up a single row in some series of events
or even if you do, if it's like your week old it's not so terrible that you don't look that up too
often so it's like okay it's fine if that takes a little bit longer because you're fetching a bit
too much more data that you actually need. If that means that you can do these aggregates much quicker i think that's that's one of the most
common cases i would say yeah that makes sense and that means that for example timescale cloud they
have it bottomless timescale of loading some data transparently so we don't they don't see it but it
goes some archive data goes to s3, but this is not open source.
PGWDB is open source, which I like.
You mentioned MIT license, that's great.
But I see some kind of, not competition,
but alternative here.
I can choose between going to Timescale Cloud if I need this, or just use it if I can, right?
So, I mean, if I can install extension,
which is not available in managed service yet, right?
But yeah, it makes sense to me.
Indeed, especially if data is old
and I expect it to go to object storage,
I definitely not,
usually I'm not interested in single row,
I'm interested in aggregates, right?
Yeah, this makes sense for me.
Cool.
That's a good question though are there any i'm
sure you're talking to lots but are there any postgres managed services that already offer it
in like a preview or developer mode i mean hydra is offering it but other than that yes no not yet
but yeah we're definitely in box with with a bunch of them yeah It's a project that many people are interested in. Yeah.
Yes.
Yeah.
So Hydra currently has a closed beta of a fully managed Postgres service
that has pg.db extension installed.
Nice.
Yeah.
Let's talk about some numbers.
If you can,
if you can pull them from top of your head,
for example,
if I need to,
I need to aggregate over many million rows which
are sparsely distributed, we know in the worst case, for example, if I touch, I don't know,
one million rows which each one is stored in separate buffer in the worst case, for
example, very roughly, I need to touch, in worst case, it's not hits, it's reads. I need to get from disk 1 million buffers, I mean, pages, right? And in the worst
case, it can be absolutely worst. You can imagine it may be like 1 millisecond for each, if latency
is bad, and we do it without parallelization. It's 1 million seconds. It's terrible, right? And this is what I can
see in Postgres in some cases. What should I expect here if I need, for example, to calculate
sum of rows, like 1 million rows stored in this format? What, like, is it seconds, minutes?
You mean if it's sort of block How many rows? One million rows and some simple aggregates like count, summon or something.
What do you see?
I would say it's generally within a second, I would say, or within a few seconds at least.
Including data transfer from S3, everything, right?
Yeah, yes.
So because it's in per cave format, you only get the columns that you actually requested.
So even if it's a lot of data,
you generally just because you have
a bunch of columns.
So you only
grab the data from S3 that you actually need
and that is then sort of
streamed through the.tb engine
to get these aggregates.
Yeah, that's super cool.
But I mean,
like with all these questions, it always depends a bit on what you're doing exactly,
but for instance,
there's
Clickbench results,
Clickhouse has a benchmark,
and their DigitWP
is quite, from the Postgres
compatible ones,
I don't think it's at the top the one with
mother duck support is almost at the top i think but the one with frog parquet is is a bit lower
but still like much higher than tuned postgres i think that's like it's uh i don't know by heart
but it's i think it's at least like 10 times faster than than the tuned postgres version but
maybe even more yeah that, that's great.
And I can imagine this opens possibilities to stop saying,
don't do analytics in Postgres, right?
I mean, if we...
So this can be a transaction which does aggregate in Postgres.
If it's within seconds,
we don't have negative consequences of long-running transactions anymore, right?
It's not only about user experience.
It's also about how Postgres performs vacuuming still and so on,
this XminHorizon problem and so on, right?
Yeah, yeah.
Yeah, I mean, pgDuckDB was really kicked off
to try and solve the known limitations of analytics in Postgres.
You know, not even at huge scales,
Postgres can be fairly inefficient for doing analytical processing.
So, you know, DuckDB, that's what it excels at.
We're trying to get the project to the point where
the user doesn't necessarily need to know very much about DuckDB at all.
Since these are Postgres users, the service should look and feel exactly like Postgres
for the most part. And then we make those differences clear. If you were to run an
explain plan, it will show you exactly where and when DuckDB is taking over for that execution,
if you want to look a little bit deeper.
Would that be like a custom scan node, or how would it show?
Yeah, that's a custom scan node.
Basically, it's always like it either takes completely over
or it doesn't take over at all.
I mean, that's sort of one of the things we'll probably want to add
at some point, like to say, like, oh, this part actually needs be executed in a postcode because that it doesn't know how to do it
right right now it's it's either sort of everything or nothing if if you want to do something that
ductv doesn't support the film need ductv execution and just like okay this query it's it's not
supported at the moment yeah but yeah it's a it's it's one big custom scan node basically that does
like the ductv execution and do you see
situations when you need to tune something like there are some tuning capabilities inside duck db
right but if we see only one node in the plan like we cannot do it from there like what's the
what's the workflow to understand like how efficient it is so So, I mean, at the moment,
the main things you can tune in BG DuckDB
are the amount of threads that DuckDB uses
and the amount of memory that it's allowed to use.
Because it's, I mean,
because you're running it inside post,
normally DuckDB really likes to take all the resources
of the system and go as fast as it can,
so that your laptop tries to fly away
and overheats.
That's what it wants to do.
But, I mean, if you run it inside Postgres, that's probably not what you want
because you also want to run some other queries.
So that's why we have some limits.
So I think by default it will try to use as many threads as are available
just so that it completes
quickly, but at least the memory is
trying to limit. But you can lower
both or increase both.
And we can run this query
on Postgres standby nodes, right?
On replicas. Yes,
you should be able to. I don't think
I've perfectly tried yet, but I
think that should be fine.
And if it doesn't work, please open a BR or like an issue.
I'm already thinking about like common approach,
like I would say traditional approach right now
is to have LTP database Postgres for everything,
but for analytics, we offload data to some another system sometimes.
Or like with Hydra, it's a different table with column storage
as remember it's it's it's also like additional effort to maintain data synchronized here if some
data okay again maybe maybe also here it maybe makes sense in some cases to have original table
stored in postgres and row storage but this table maybe with
maybe projections some columns only go to this additional table which is stored in this
format in s3, parquet, duckdb, processed right and this is also interesting question do you
see do you see it but I wanted to ask question. So if we do this, we can run analytical
queries right inside and I already see that it makes sense if it just selects to do it on replicas,
not to like primaries. We have only one single primary in the single cluster, so it's not good to
take memory for this and so on and affect all TPE workload processing. In this case, it's great that if it's fast because
host and buy feedback is on, Xmin Horizon is almost not affected, great.
But there I probably would like to give all memory,
all CPU power for this processing,
maybe it's dedicated replica.
Since I'm new to this, sorry if my question is not right,
but what about compression?
So do I have any capabilities
to choose compression method and ratio and so on
in case of Parquet?
Well, if you were to copy the heap table
to Parquet format using pg.db, then, okay, you know, I think you'd see what,
five times compression.
Depending on data, right?
At worst, depending on the data, five times, sometimes you see 10 times compression.
But this can also be the case for the other columnar formats as well. pg.pp supports the iceberg reader as well as the delta table reader.
You could use delta tables or iceberg files in object storage as well.
All of those will have better compression than just standard Postgres heap tables.
Yeah, this is what I expected.
As I understand, there are no controls.
You just get it.
That's it.
For now, you can...
It might be that the copy...
I mean, we sort of expose all the copy options
that DuckTV exposes.
So I think they might expose
like a compression option
that you can say
like which compression type
you want to use
to write the Parquet file.
Yeah, because usually
it's a trade-off
between CPU utilization
and the size.
Yeah, yeah.
This discussion of CPU
triggers this question in my head
and I know column storage is like beating postgres like completely in terms of compression i remember
a funny case when i said like one billion rows is one terabyte but victoria victoria matrix
cto said how come it's just one gigabyte? And we could not understand each other, right?
Because in Postgres, it's indeed one terabyte. Yeah, but we have five, five x, you said it's
the lowest, that's great. And I think it also cost saving and, and it's not only disk saving
in terms of if you're offloaded to s3, but also memory, right? If we have one primary
multiple standby nodes, we like many cases, the shared buffer pool and the page cache
status mirror between standby nodes. And if it says data,
which we don't need often, it's not good in terms of cache
efficiencies, right? If you're floated, if you're floated to
this, you like less memory to process your workload. And this
is probably a huge cost saving as well, right?
If you offload the computer to standby notes, you mean?
Or are you offloading the-
In general, we take some data, we offload it.
Like if it's archive data, we don't need it often.
And it also have some patterns
where common storage makes sense.
We offload it to S3 using this approach.
And if it's processed by DuckDB,
well, it still needs memory, right?
It still needs memory if we often execute some aggregates. So we consume memory on Postgres standby anyway, right?
Yes.
Yeah, but even then the memory can still be less
than it would be on Postgres tables
because also in memory sometimes it's compressed. Like if there's many values that are the same, sometimes
it will just be stored as like the amount of values and the actual value. It's just like
repeats, repeat n times basically. So that also makes the compute, that's one of the tricks that
this column of storage engine is generally used to use to improve this compute and also the memory usage.
Right, and again, if you do it in row storage, we pull all those blocks with all columns and very inefficient in terms of memory consumption.
Yeah, and they all have the exact same value, basically.
Yeah, yeah, yeah. So this makes sense completely to me.
You know, the common pattern for so many SaaS companies is that really only the last 30 days of data or so is being updated very frequently.
But then after that, the data maybe shouldn't be deleted, but it should be accessed. And it, you know, it can live in object storage, which is two cents per gigabyte per month, as opposed to on disk and Postgres, it's can be 10
cents, 15 cents per gigabyte per month for some services. So not only is it cheaper from a cost
perspective of the the raw gigabyte per cost, but also the compression, as you mentioned just earlier, is quite good.
So you can substantially archive data, still have it accessible to the application.
And pgDuckDB also does support caching. So you would be able to cache the Parquet files from S3 to where Postgres is.
So that you can avoid running over the network to connecting to S3.
Because that would be really one of the major costs is running over from Postgres to S3 to retrieve those files.
Are you talking about caching in memory or on volumes, on disk, like a BS volume or something?
On disk, on disk, like a BIS volume or something. On disk caching.
Basically, it downloads the file,
and then it's like, oh, this one,
we're going to use it a lot of times, it doesn't change.
Yeah, that's
interesting. Also, I noticed
that, of course, writes are
supported, right? So, we
can update this data,
insert, update,
and so on. And in this case, right? Insert, update, and so on.
And in this case, I'm very curious
about performance of writes as well.
Is it like, how to compare it?
What's the sense of it?
I think writes are, I mean,
you can store in a parquet file,
but I don't think we currently support updating values
in a Parquet file.
We do support the real life you copy.
You put a query and you put all the results in one big Parquet file.
Or that's the main writing to Blobsource that we support.
What we do support is also writing to Mother mother duck, which is like a managed duck DB. That's I mean, it's the company I work for. It's like we have we have
like WP the clouds basically. So there there we do support inserting into tables that live
there and updating them like like sort of more similar to how you wouldn't insert it
to Postgres table. Does that make sense? Yeah, makes
sense. And additional
question, if I insert data
and can I do it inside
like complex transaction
which probably inserts data to
regular table in Postgres?
So right now
you cannot write to both
Postgres table and a
Mother Dev table in the same transaction, specifically because
when a crash happens, it kind of works.
It's just that when a crash happens, maybe you only wrote to one of the places.
So that's something we want to protect against. Because 2PC would protect here.
Yes, 2PC would protect here, but 2PC is not trivial to implement.
And it's super slow as well. It's not fun to use as well.
Yes, so there's definitely something you're considering for a future version,
but for now it's like, okay, you can write to MotherDuck and Postgres. You can write to
MotherDuck in a transaction, you can do multiple writes, but you cannot write to Postgres in that same transaction.
Will it just produce an error or just you mean it's not good?
Yeah, no, because right now it'll produce an error, like it will say like oh it's not,
it's a protection, yeah, against like maybe we'll lift, like we have a setting to say like okay,
you know what you're doing or you don't care about slight inconsistencies in case of crash because it's it will still roll
back like it's sort of the like the happy rollback path it will or like the query fails or something
it will still roll back that that would that would that still works but actual crash cases where
pulse curves really really goes away without away without doing its normal cleanup.
Yeah, you have some race conditions
where then you write only to MotherDuck
and not to Postgres.
Mm-hmm.
One core feature we haven't talked about yet
is creating temporary tables within Postgres
using the Duck using the duck DB
columnar format that see like what when would when would it
make sense to do that versus like caching a parquet file
locally? Or like what's Yeah, when are you seeing people use
that?
I don't know if we've seen many people use that. I like the main
reason. The one of the main reasons it implemented
is because we wanted to add MotherDuck support.
And this was sort of just really easy to also,
I mean, it was like, oh, this we sort of get for free.
So that was,
it wasn't necessarily like,
oh, this is a feature we super need,
but it was like, oh, this is like 10 lines of code extra,
and now we have temporary table support.
And it also makes the testing a bit easier
because then you don't have to, like, for the mother-of-the-duck testing,
you need, like, I mean, you need an account and you need everything.
So you can have these DuckDB tables, you can have them locally.
Cool.
Yeah, but one of the things that could be useful for is, like,
for, like, ETL kind of jobs where you have
you create some temporary tables
you do one query, you do one query
you create a temporary table to do some local transforms
you create another temporary table
to do some local transforms
and then you store somewhere else.
So it's also not
completely useless or something.
It's just that it wasn't our main, like, oh, we need this.
Yeah, so that's it.
I think that's what I find.
I found this project quite confusing to get.
And not just this one, like all DuckDB Postgres-related projects.
I find it hard to get my head around,
partly because it feels like a bag of features like firstly i can copy to parquet format and then i can offload queries like i can query through
postgres to parquet format but i can also execute a query that's completely local to postgres using
the wdb engine like that's like a totally different feature, but it's also the same extension. Like, it took me a while to get
my head around that these like different use cases, but all
provided by the same extension, if that makes any sense.
Yeah, that's definitely that's definitely, I understand
the fact that it's a bit confusing. One reason, I guess,
is because DuckDB can do so many things. So it's like, okay, and
you kind of want to be able to combine
all of these things. So it's like,
you want to join
a Parquet file with...
You want to read the Parquet file and join it
with some Posker's table to, for instance,
sort of hydrate the user IDs with
the recently
updated user information
so that the username is
actually what it is now and not what it was like two months ago when when the record was
created in the.pk file and that sort of means that DGDuckDVD needs to be
able to read BOSGIRS tables but then I mean you know I mean if you want you
could also just make it read BOSGIRS tables also fine so it's I think both
I mean both Hydra and Maledoc,
we had some sort of,
we had some idea about
what features we think are important.
And I mean,
and sort of,
you start a bit from there
and like, oh, this is,
these are the use cases we want to support.
And then how can we
pollute that all,
all those use cases sort of together,
you know, in a way that
that works nice.
And then you also end up with some other use cases you didn't even think about.
Yeah, it's a really good point.
I've seen Postgres itself described by many people,
even myself, as a multi-tool type of database.
With adding DuckDB into Postgres makes it probably even more of a multi-tool database
where the average user is not going to use every feature, nor should they try to. But if there's
something there that's useful to them that just kind of comes along with DuckDB, certainly there's
no reason for us to try and remove it or disable it. you you kind of get all these duck db cool features along with
now postgres just as in a complete kind of all-in-one multi-tool type database in fact
i've kind of started to use the term like multi-store a little bit um i think single
store and unistore was taken so you know but really what's what's happening is you have two different databases
kind of working working together and the data is also can be stored in multiple formats both in
you know heap as well as parquet so i mean that could be another way to think about it
yeah yes because of time starting to think about wrapping up do you guys want to talk a little bit
about what you're working on at the moment
or what your plans are for the future?
Sure.
So I think there's a...
I mean, we released the latest version a week ago or something.
Yeah.
And the things we're working on now,
we're working basically on two big things.
One is redoing the whole way we read from Postgres.
Because before we were doing low-level reads of the actual bytes
and re-implementing the heap reader, in a sense.
But that means you don't get index support at MFP,
then you also need to implement the index reader and stuff like that so so now we're trying to more to build a query again to do to do that
like to build a Postgres query again and then give it to the Postgres planner and
then make Postgres figure out and that that's that's working very well because
then then you get foreign data wrapper support sort of you can call for data
wrappers from Postgres from within duck DB again. So it's definitely sort of all the way down.
So that's one big feature.
And the other is one thing people have been running into,
sort of people are running into in the first two versions,
is that to be able to read from Parquet,
like you use the DuckDB functions to read Parquet
files and read Iceberg files, Iceberg catalogs, you need to know what the columns are.
And right now you need to specify those columns and the types in your query, which is a bit
of a hassle.
So I'm working on a way to make it so that you don't have to do that anymore.
Because Postgres wants to be very strict about these things.
That's one of the big differences between DuckTV and Postgres.
Postgres is very strict about certain things.
And DuckTV is like, oh, you typed this.
You probably meant to do this thing.
This kind of makes sense.
I can understand what you're doing, what you're trying to do here.
Let me give you the sensible results.
And Postgres is much more like, oh, this is
not exactly how I expected it.
So, that's sort of
where there's a bit of trouble.
So, I'm trying to make the UX
there a lot nicer.
I have a way now
that's working quite well, but it needs a bit
more polish.
Probably after
Christmas or after New Year's, I'll also create a BR
to get that easier to work with. So you could just do select star from read CSV and don't
have to read Parquet and don't have to specify the columns. That's sort of the user experience
we want. But right now you need to do select star from read Parquet and then specify all
the columns that you actually want to read from the parquet file,
which kind of defeats the purpose of the star.
Makes perfect sense. And how about from your side, Joe?
Yeah, I think Yelta covered it well.
As far as, you know, the way that Postgres scan tables, you know, that should be merged in, I think, relatively soon.
And then it will let it bake before the next official release.
As Yelta mentioned, we released the past version just a week ago, so we have a pretty good
clip at this point of shipping new releases.
If there's something that folks want to see, certainly head on over to the repo and open
an issue or let us know.
And then as far as Hydra goes, as I mentioned, we have a managed cloud Postgres service
that's in beta that has integrated pgDuckDB.
So we're looking to roll that up out into public beta
so people can begin to use pgDuckDB
without having to self-manage it.
Today, you can obviously take pgDuckDB
and put it on an EC2 box with Postgres
or whatever you want to do.
So yeah, looking forward to shipping that out.
Yeah, that's great.
I just thought 20 years ago,
MySQL was dominating the open-source database field,
and they have several engines. I remember you needed to pick up like one was not crashing, another was full-text search, I mean MySum and InnoDB.
It was a trade-off. And now, like 20 years later, we talk about various engines, basically, cloud-friendly, because you can upload it to object storage.
And it's amazing that we can combine these in Postgres right now, and it looks like it's only starting, right?
So a lot of stuff can be added, developed, and so on.
It's only the beginning as I see it.
Yeah, it's pretty interesting to see Data Lake come to Postgres.
It goes to further your point, which is many of the layers of the modern databases are being pulled apart.
There's multiple engines.
There's object storage.
There's compute with Postgres.
So I think that trend is continuing to play out.
As far as the average app developer is concerned,
like I was indicating before,
I don't think they necessarily need to know
or care so much about the differences
between a database versus a data lake.
So hopefully the interface being Postgres
helps bridge the gap of using something really cool
like DuckDB, but, you know,
together with what you're used to using in Postgres.
Yeah, that's cool.
Nice one.
Well, thank you so much both for joining.
Thank you, Nikolai, as always.
And catch you soon.
Yeah, thank you, guys.
Thank you.
Thank you.