Postgres FM - pg_duckdb

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PgMusted, and I'm joined as usual by Nikolai, founder of Postgres.io. Hey Nikolai. Hi Michael. How are you? No, not falling for it. Today we are delighted to be joined by two guests. Firstly, Joe from Hydra. Welcome Joe. Thank you.

Starting point is 00:00:19 And also Jota from Motherduck. Welcome Jota. Hi. Hi all. Thanks so much both for joining us. We are going to be discussing a new-ish project called PGDuckDB, which very interestingly is a cross-organisation collaboration. So, Yalta, I saw the recording of your excellent lightning talk introducing PGDuckDB to PGConf EU. A nice job.

Starting point is 00:00:43 Perhaps you could start us off with how you describe PGDuckDB to pgConf EU. A nice job. Perhaps you could start us off with how you describe pgDuckDB to a Postgres-savvy audience who may or may not know what DuckDB is. All right, so it's an extension for Postgres, and it embeds DuckDB. And DuckDB is an in-process analytics database, like SQL database.

Starting point is 00:01:03 So it's a bit like SQLite, but then more for analytics, like big queries with lots of energy gets and stuff like that and lots of joints and less like SQLite or like Postgres in general for like smaller dates, like getting a hundred rows or something like that. It's index scans, stuff like that. Like indexes are not something that's commonly used

Starting point is 00:01:24 in the DuckDB land. It's like, oh, we just scan the whole table and we compute stuff very fast on it. Yeah, and that is then embedded into Postgres so that you can use that for some of your more hazard queries that don't rely on indexes. Nice, and I'm sure we're going to come back to indexes in a bit once we dive into the details.

Starting point is 00:01:44 But Joe, I think the origin story of this project is super interesting. I'd be fascinated. I was reading the Mother Duck blog post and it mentioned that Hydra kicked off the effort. So I'd love for you to share a little bit more about that if you can. Sure, sure. So our team had been working on a crazy weekend project that we initially were calling PGQuack. It was called PGQuack because, well, ducks go quack, but also because we didn't necessarily expect this to work. We kind of were writing it off as a quacky project.

Starting point is 00:02:17 So it was a little kinds of analytics queries and aggregates like Yelta mentioned. And we had this aha moment that we could actually take pgQuack pretty seriously and do it for real in collaboration with folks that know a whole lot about DuckDB. You know, we're Postgres guys by background mostly. So not to get too far ahead, but we decided to build PGDuckDB for real as a production grade extension together with the folks over at DuckDB Labs, the creators of DuckDB, which is Hannes and Mark, as well as the folks at MotherDuck, like Yelta. So we've been working together with them on it, oh, for at least the last six, seven months now. Yeah, super cool. And I saw a few other organizations mentioned as well.

Starting point is 00:03:21 How do they all fit in? Yeah, so this is an open source, it's an MIT licensed extension. So for folks that want to participate or can see pgDuckDB being really helpful in their use cases, they can either tell us directly what they'd like to add or open a PR or open issues. I mean, certainly there are a good number of those as the project is currently in beta. So be gentle with it, but also tell us the things that could be improved on. One of the things that was surprising to me was the immediate reaction that we saw to how DuckDB could be useful to accelerating analytics queries in Postgres. And there are a lot of organizations that went oh yes okay we want to support this we want to get involved so that's probably some of the names you were seeing

Starting point is 00:04:09 yeah i think i saw was it neon and microsoft maybe some others as well yeah there's a whole there's a whole range of folks that that are interested in us you know moving this forward and they're providing either direct or indirect support for the project. Yes. Jotter, from your side, what are you, actually I'd be interested in both of your thoughts on this, but what are you seeing early users most excited about? Like what features are they most keen to see added or improved? I think one of the big things is that you can sort of read Parquet files or recsv files from blob storage, like from S3 or from Azure or from all the big clouds, basically, and Cloudflare as well. And I think that's a big feature that people are very interested in. And I mean, there's so many open issues of like people wanting a specific thing

Starting point is 00:05:05 that it's hard to sort of pick one that's like, oh, that's the thing people want. But I mean, one that does come up a few, has come a few times is like support for spatial stuff. Like, so post GIS things. And that works slightly different in DuckDB. So we haven't started on that. We're first sort of building a bit more the base, but that's definitely one of the features that comes up. Like, people wanting to do spatial analytics and let's do spatial data, basically. Yeah.

Starting point is 00:05:36 But that's not supported right now. Yeah. So one of the biggest features is that data can be... I mean, it's not a feature. It's like a requirement here. Data is on object storage like S3, right? In Parquet format, remember some attempts to have access to Parquet files on object storage as extension before pgDuckDB, right? And it's also like it could be done in FZW way, foreign data wrapper way. I think some SQL interface over those files, and that's it. But DuckDB is like bringing something else.

Starting point is 00:06:30 I've never tried myself, but I've heard many opinions about very good performance, right? Yeah, yeah. So that's one of the big differences between VGDuckDB and some of the other foreign data wrapper-based ones is that we sort of flip it around a bit instead of saying like, oh, this table is stored somewhere else, we'll read that using DuckDB or something, or something else. Instead,

Starting point is 00:06:59 we're saying like, okay, we push the whole query, we put it in DuckDB, and then let DuckDB read the Postgres data. So the Postgres sort of becomes a foreign data wrapper kind of inside DuckDB. Does that make sense? It's more like a query stealing approach. It's what Mario from Hydra, one of the developers there, called it. Instead of letting the Postgres execution engine

Starting point is 00:07:24 do a lot of things, we don't want letting the Postgres execution engine do a lot of things, it was like, no, we don't want the Postgres execution engine to do a lot of things because the DuckDB execution engine is much better for this analytic stuff, like the sorting and the grouping and stuff like that that are joining. We want to do as much as possible inside DuckDB.

Starting point is 00:07:39 Yeah, that's interesting. Yeah, and also, you know, kind of back to Michael's question a little bit too too. A really common pattern for a lot of application developers is to have Postgres, but also to have an know, has 2.6 terabytes of event data that they really shouldn't be storing necessarily on Postgres, but they could offload into Parquet format in S3. But that data still needs to be accessible by Postgres because one of the important things is that pgduckdb can perform the join between a heap table and Postgres

Starting point is 00:08:26 together with the Parquet files in object storage so that there's not this siloing or needing for ETL processes between Postgres and external analytics databases and S3. So, you know, how do you smooth over the differences between columnar format and row-based format? And I think pgDuckDB goes a long way to do that because ultimately when you're developing applications, thinking about table format is pretty low level. So if you can avoid that as much as possible, that's probably just easier. Yeah, but here I'm confused a little bit because this thought is like, let's have tiering of storage.

Starting point is 00:09:11 And for example, I remember several cases, big companies where this was needed and the decision was made for scaling purposes of main clusters, we need to offload archive data somewhere else, maybe to different cluster or somewhere else. Maybe sometimes we just delete it. It could be, for example, e-commerce and old orders, order history exceeding, for example, a couple of years. We don't need that. But offloading this data to S3, this sounds great. And I remember there's also extension

Starting point is 00:09:41 PGT from Tembo. Again, I haven't tried it yet. It's in my to-do list. But offloading that, I'm thinking, if I also receive column storage for this kind of data, is it right? Because I still need only one row in these cases, right? In some cases, I do need aggregates. I don't know, time series, I would love to offload to a three in column store format, but not in these cases when I just need one row. Yeah. I mean, I think it's different needs for different users. If you are a, you know, application developer just trying to build something quickly, then,

Starting point is 00:10:19 you know, if we can tell you, hey, dump a lot of your event data into Parquet format in S3 or archive it out of Postgres on some increment every seven days, every 30 days or something, then that's easy enough. And then maybe the thinking about columnar versus row should subside. But yeah, I mean, data tiering and data lake is not as different as, you know, they're really not that different. It's just a function of what the user interacts with. Right, but my concern is if I need the row store, like I need to find all columns for one row to present it. For example, again, orders, right?

Starting point is 00:10:57 I need one order. If we pull it from archive, I expect not good performance. But if it's column store, I will be concerned about performance to collect all the rows versus row storage, right? Let's formulate the question differently. Let's talk about the use cases where this particular pattern, when we need to upload a lot of data to S3 and benefit from column store analytical queries like aggregates, what kind of data it could be in existing Postgres database? It can be SAS or something. I don't know.

Starting point is 00:11:31 What kind of data is beneficial for us to offload to this? A very common case is time series data. I think that's by far one of the most common cases where it's like, okay, at least the old data we store in, I mean you're not going to look up a single row in some series of events or even if you do, if it's like your week old it's not so terrible that you don't look that up too

Starting point is 00:11:56 often so it's like okay it's fine if that takes a little bit longer because you're fetching a bit too much more data that you actually need. If that means that you can do these aggregates much quicker i think that's that's one of the most common cases i would say yeah that makes sense and that means that for example timescale cloud they have it bottomless timescale of loading some data transparently so we don't they don't see it but it goes some archive data goes to s3, but this is not open source. PGWDB is open source, which I like. You mentioned MIT license, that's great. But I see some kind of, not competition,

Starting point is 00:12:33 but alternative here. I can choose between going to Timescale Cloud if I need this, or just use it if I can, right? So, I mean, if I can install extension, which is not available in managed service yet, right? But yeah, it makes sense to me. Indeed, especially if data is old and I expect it to go to object storage, I definitely not,

Starting point is 00:12:57 usually I'm not interested in single row, I'm interested in aggregates, right? Yeah, this makes sense for me. Cool. That's a good question though are there any i'm sure you're talking to lots but are there any postgres managed services that already offer it in like a preview or developer mode i mean hydra is offering it but other than that yes no not yet but yeah we're definitely in box with with a bunch of them yeah It's a project that many people are interested in. Yeah.

Starting point is 00:13:25 Yes. Yeah. So Hydra currently has a closed beta of a fully managed Postgres service that has pg.db extension installed. Nice. Yeah. Let's talk about some numbers. If you can,

Starting point is 00:13:38 if you can pull them from top of your head, for example, if I need to, I need to aggregate over many million rows which are sparsely distributed, we know in the worst case, for example, if I touch, I don't know, one million rows which each one is stored in separate buffer in the worst case, for example, very roughly, I need to touch, in worst case, it's not hits, it's reads. I need to get from disk 1 million buffers, I mean, pages, right? And in the worst case, it can be absolutely worst. You can imagine it may be like 1 millisecond for each, if latency

Starting point is 00:14:17 is bad, and we do it without parallelization. It's 1 million seconds. It's terrible, right? And this is what I can see in Postgres in some cases. What should I expect here if I need, for example, to calculate sum of rows, like 1 million rows stored in this format? What, like, is it seconds, minutes? You mean if it's sort of block How many rows? One million rows and some simple aggregates like count, summon or something. What do you see? I would say it's generally within a second, I would say, or within a few seconds at least. Including data transfer from S3, everything, right? Yeah, yes.

Starting point is 00:15:02 So because it's in per cave format, you only get the columns that you actually requested. So even if it's a lot of data, you generally just because you have a bunch of columns. So you only grab the data from S3 that you actually need and that is then sort of streamed through the.tb engine

Starting point is 00:15:20 to get these aggregates. Yeah, that's super cool. But I mean, like with all these questions, it always depends a bit on what you're doing exactly, but for instance, there's Clickbench results, Clickhouse has a benchmark,

Starting point is 00:15:38 and their DigitWP is quite, from the Postgres compatible ones, I don't think it's at the top the one with mother duck support is almost at the top i think but the one with frog parquet is is a bit lower but still like much higher than tuned postgres i think that's like it's uh i don't know by heart but it's i think it's at least like 10 times faster than than the tuned postgres version but maybe even more yeah that, that's great.

Starting point is 00:16:05 And I can imagine this opens possibilities to stop saying, don't do analytics in Postgres, right? I mean, if we... So this can be a transaction which does aggregate in Postgres. If it's within seconds, we don't have negative consequences of long-running transactions anymore, right? It's not only about user experience. It's also about how Postgres performs vacuuming still and so on,

Starting point is 00:16:32 this XminHorizon problem and so on, right? Yeah, yeah. Yeah, I mean, pgDuckDB was really kicked off to try and solve the known limitations of analytics in Postgres. You know, not even at huge scales, Postgres can be fairly inefficient for doing analytical processing. So, you know, DuckDB, that's what it excels at. We're trying to get the project to the point where

Starting point is 00:17:00 the user doesn't necessarily need to know very much about DuckDB at all. Since these are Postgres users, the service should look and feel exactly like Postgres for the most part. And then we make those differences clear. If you were to run an explain plan, it will show you exactly where and when DuckDB is taking over for that execution, if you want to look a little bit deeper. Would that be like a custom scan node, or how would it show? Yeah, that's a custom scan node. Basically, it's always like it either takes completely over

Starting point is 00:17:35 or it doesn't take over at all. I mean, that's sort of one of the things we'll probably want to add at some point, like to say, like, oh, this part actually needs be executed in a postcode because that it doesn't know how to do it right right now it's it's either sort of everything or nothing if if you want to do something that ductv doesn't support the film need ductv execution and just like okay this query it's it's not supported at the moment yeah but yeah it's a it's it's one big custom scan node basically that does like the ductv execution and do you see situations when you need to tune something like there are some tuning capabilities inside duck db

Starting point is 00:18:11 right but if we see only one node in the plan like we cannot do it from there like what's the what's the workflow to understand like how efficient it is so So, I mean, at the moment, the main things you can tune in BG DuckDB are the amount of threads that DuckDB uses and the amount of memory that it's allowed to use. Because it's, I mean, because you're running it inside post, normally DuckDB really likes to take all the resources

Starting point is 00:18:39 of the system and go as fast as it can, so that your laptop tries to fly away and overheats. That's what it wants to do. But, I mean, if you run it inside Postgres, that's probably not what you want because you also want to run some other queries. So that's why we have some limits. So I think by default it will try to use as many threads as are available

Starting point is 00:19:04 just so that it completes quickly, but at least the memory is trying to limit. But you can lower both or increase both. And we can run this query on Postgres standby nodes, right? On replicas. Yes, you should be able to. I don't think

Starting point is 00:19:19 I've perfectly tried yet, but I think that should be fine. And if it doesn't work, please open a BR or like an issue. I'm already thinking about like common approach, like I would say traditional approach right now is to have LTP database Postgres for everything, but for analytics, we offload data to some another system sometimes. Or like with Hydra, it's a different table with column storage

Starting point is 00:19:47 as remember it's it's it's also like additional effort to maintain data synchronized here if some data okay again maybe maybe also here it maybe makes sense in some cases to have original table stored in postgres and row storage but this table maybe with maybe projections some columns only go to this additional table which is stored in this format in s3, parquet, duckdb, processed right and this is also interesting question do you see do you see it but I wanted to ask question. So if we do this, we can run analytical queries right inside and I already see that it makes sense if it just selects to do it on replicas, not to like primaries. We have only one single primary in the single cluster, so it's not good to

Starting point is 00:20:38 take memory for this and so on and affect all TPE workload processing. In this case, it's great that if it's fast because host and buy feedback is on, Xmin Horizon is almost not affected, great. But there I probably would like to give all memory, all CPU power for this processing, maybe it's dedicated replica. Since I'm new to this, sorry if my question is not right, but what about compression? So do I have any capabilities

Starting point is 00:21:12 to choose compression method and ratio and so on in case of Parquet? Well, if you were to copy the heap table to Parquet format using pg.db, then, okay, you know, I think you'd see what, five times compression. Depending on data, right? At worst, depending on the data, five times, sometimes you see 10 times compression. But this can also be the case for the other columnar formats as well. pg.pp supports the iceberg reader as well as the delta table reader.

Starting point is 00:21:50 You could use delta tables or iceberg files in object storage as well. All of those will have better compression than just standard Postgres heap tables. Yeah, this is what I expected. As I understand, there are no controls. You just get it. That's it. For now, you can... It might be that the copy...

Starting point is 00:22:17 I mean, we sort of expose all the copy options that DuckTV exposes. So I think they might expose like a compression option that you can say like which compression type you want to use to write the Parquet file.

Starting point is 00:22:33 Yeah, because usually it's a trade-off between CPU utilization and the size. Yeah, yeah. This discussion of CPU triggers this question in my head and I know column storage is like beating postgres like completely in terms of compression i remember

Starting point is 00:22:51 a funny case when i said like one billion rows is one terabyte but victoria victoria matrix cto said how come it's just one gigabyte? And we could not understand each other, right? Because in Postgres, it's indeed one terabyte. Yeah, but we have five, five x, you said it's the lowest, that's great. And I think it also cost saving and, and it's not only disk saving in terms of if you're offloaded to s3, but also memory, right? If we have one primary multiple standby nodes, we like many cases, the shared buffer pool and the page cache status mirror between standby nodes. And if it says data, which we don't need often, it's not good in terms of cache

Starting point is 00:23:35 efficiencies, right? If you're floated, if you're floated to this, you like less memory to process your workload. And this is probably a huge cost saving as well, right? If you offload the computer to standby notes, you mean? Or are you offloading the- In general, we take some data, we offload it. Like if it's archive data, we don't need it often. And it also have some patterns

Starting point is 00:24:01 where common storage makes sense. We offload it to S3 using this approach. And if it's processed by DuckDB, well, it still needs memory, right? It still needs memory if we often execute some aggregates. So we consume memory on Postgres standby anyway, right? Yes. Yeah, but even then the memory can still be less than it would be on Postgres tables

Starting point is 00:24:24 because also in memory sometimes it's compressed. Like if there's many values that are the same, sometimes it will just be stored as like the amount of values and the actual value. It's just like repeats, repeat n times basically. So that also makes the compute, that's one of the tricks that this column of storage engine is generally used to use to improve this compute and also the memory usage. Right, and again, if you do it in row storage, we pull all those blocks with all columns and very inefficient in terms of memory consumption. Yeah, and they all have the exact same value, basically. Yeah, yeah, yeah. So this makes sense completely to me. You know, the common pattern for so many SaaS companies is that really only the last 30 days of data or so is being updated very frequently.

Starting point is 00:25:26 But then after that, the data maybe shouldn't be deleted, but it should be accessed. And it, you know, it can live in object storage, which is two cents per gigabyte per month, as opposed to on disk and Postgres, it's can be 10 cents, 15 cents per gigabyte per month for some services. So not only is it cheaper from a cost perspective of the the raw gigabyte per cost, but also the compression, as you mentioned just earlier, is quite good. So you can substantially archive data, still have it accessible to the application. And pgDuckDB also does support caching. So you would be able to cache the Parquet files from S3 to where Postgres is. So that you can avoid running over the network to connecting to S3. Because that would be really one of the major costs is running over from Postgres to S3 to retrieve those files. Are you talking about caching in memory or on volumes, on disk, like a BS volume or something?

Starting point is 00:26:23 On disk, on disk, like a BIS volume or something. On disk caching. Basically, it downloads the file, and then it's like, oh, this one, we're going to use it a lot of times, it doesn't change. Yeah, that's interesting. Also, I noticed that, of course, writes are supported, right? So, we

Starting point is 00:26:40 can update this data, insert, update, and so on. And in this case, right? Insert, update, and so on. And in this case, I'm very curious about performance of writes as well. Is it like, how to compare it? What's the sense of it? I think writes are, I mean,

Starting point is 00:26:59 you can store in a parquet file, but I don't think we currently support updating values in a Parquet file. We do support the real life you copy. You put a query and you put all the results in one big Parquet file. Or that's the main writing to Blobsource that we support. What we do support is also writing to Mother mother duck, which is like a managed duck DB. That's I mean, it's the company I work for. It's like we have we have like WP the clouds basically. So there there we do support inserting into tables that live

Starting point is 00:27:36 there and updating them like like sort of more similar to how you wouldn't insert it to Postgres table. Does that make sense? Yeah, makes sense. And additional question, if I insert data and can I do it inside like complex transaction which probably inserts data to regular table in Postgres?

Starting point is 00:27:58 So right now you cannot write to both Postgres table and a Mother Dev table in the same transaction, specifically because when a crash happens, it kind of works. It's just that when a crash happens, maybe you only wrote to one of the places. So that's something we want to protect against. Because 2PC would protect here. Yes, 2PC would protect here, but 2PC is not trivial to implement.

Starting point is 00:28:24 And it's super slow as well. It's not fun to use as well. Yes, so there's definitely something you're considering for a future version, but for now it's like, okay, you can write to MotherDuck and Postgres. You can write to MotherDuck in a transaction, you can do multiple writes, but you cannot write to Postgres in that same transaction. Will it just produce an error or just you mean it's not good? Yeah, no, because right now it'll produce an error, like it will say like oh it's not, it's a protection, yeah, against like maybe we'll lift, like we have a setting to say like okay, you know what you're doing or you don't care about slight inconsistencies in case of crash because it's it will still roll

Starting point is 00:29:09 back like it's sort of the like the happy rollback path it will or like the query fails or something it will still roll back that that would that would that still works but actual crash cases where pulse curves really really goes away without away without doing its normal cleanup. Yeah, you have some race conditions where then you write only to MotherDuck and not to Postgres. Mm-hmm. One core feature we haven't talked about yet

Starting point is 00:29:37 is creating temporary tables within Postgres using the Duck using the duck DB columnar format that see like what when would when would it make sense to do that versus like caching a parquet file locally? Or like what's Yeah, when are you seeing people use that? I don't know if we've seen many people use that. I like the main reason. The one of the main reasons it implemented

Starting point is 00:30:05 is because we wanted to add MotherDuck support. And this was sort of just really easy to also, I mean, it was like, oh, this we sort of get for free. So that was, it wasn't necessarily like, oh, this is a feature we super need, but it was like, oh, this is like 10 lines of code extra, and now we have temporary table support.

Starting point is 00:30:28 And it also makes the testing a bit easier because then you don't have to, like, for the mother-of-the-duck testing, you need, like, I mean, you need an account and you need everything. So you can have these DuckDB tables, you can have them locally. Cool. Yeah, but one of the things that could be useful for is, like, for, like, ETL kind of jobs where you have you create some temporary tables

Starting point is 00:30:49 you do one query, you do one query you create a temporary table to do some local transforms you create another temporary table to do some local transforms and then you store somewhere else. So it's also not completely useless or something. It's just that it wasn't our main, like, oh, we need this.

Starting point is 00:31:08 Yeah, so that's it. I think that's what I find. I found this project quite confusing to get. And not just this one, like all DuckDB Postgres-related projects. I find it hard to get my head around, partly because it feels like a bag of features like firstly i can copy to parquet format and then i can offload queries like i can query through postgres to parquet format but i can also execute a query that's completely local to postgres using the wdb engine like that's like a totally different feature, but it's also the same extension. Like, it took me a while to get

Starting point is 00:31:47 my head around that these like different use cases, but all provided by the same extension, if that makes any sense. Yeah, that's definitely that's definitely, I understand the fact that it's a bit confusing. One reason, I guess, is because DuckDB can do so many things. So it's like, okay, and you kind of want to be able to combine all of these things. So it's like, you want to join

Starting point is 00:32:10 a Parquet file with... You want to read the Parquet file and join it with some Posker's table to, for instance, sort of hydrate the user IDs with the recently updated user information so that the username is actually what it is now and not what it was like two months ago when when the record was

Starting point is 00:32:28 created in the.pk file and that sort of means that DGDuckDVD needs to be able to read BOSGIRS tables but then I mean you know I mean if you want you could also just make it read BOSGIRS tables also fine so it's I think both I mean both Hydra and Maledoc, we had some sort of, we had some idea about what features we think are important. And I mean,

Starting point is 00:32:52 and sort of, you start a bit from there and like, oh, this is, these are the use cases we want to support. And then how can we pollute that all, all those use cases sort of together, you know, in a way that

Starting point is 00:33:02 that works nice. And then you also end up with some other use cases you didn't even think about. Yeah, it's a really good point. I've seen Postgres itself described by many people, even myself, as a multi-tool type of database. With adding DuckDB into Postgres makes it probably even more of a multi-tool database where the average user is not going to use every feature, nor should they try to. But if there's something there that's useful to them that just kind of comes along with DuckDB, certainly there's

Starting point is 00:33:38 no reason for us to try and remove it or disable it. you you kind of get all these duck db cool features along with now postgres just as in a complete kind of all-in-one multi-tool type database in fact i've kind of started to use the term like multi-store a little bit um i think single store and unistore was taken so you know but really what's what's happening is you have two different databases kind of working working together and the data is also can be stored in multiple formats both in you know heap as well as parquet so i mean that could be another way to think about it yeah yes because of time starting to think about wrapping up do you guys want to talk a little bit about what you're working on at the moment

Starting point is 00:34:26 or what your plans are for the future? Sure. So I think there's a... I mean, we released the latest version a week ago or something. Yeah. And the things we're working on now, we're working basically on two big things. One is redoing the whole way we read from Postgres.

Starting point is 00:34:47 Because before we were doing low-level reads of the actual bytes and re-implementing the heap reader, in a sense. But that means you don't get index support at MFP, then you also need to implement the index reader and stuff like that so so now we're trying to more to build a query again to do to do that like to build a Postgres query again and then give it to the Postgres planner and then make Postgres figure out and that that's that's working very well because then then you get foreign data wrapper support sort of you can call for data wrappers from Postgres from within duck DB again. So it's definitely sort of all the way down.

Starting point is 00:35:31 So that's one big feature. And the other is one thing people have been running into, sort of people are running into in the first two versions, is that to be able to read from Parquet, like you use the DuckDB functions to read Parquet files and read Iceberg files, Iceberg catalogs, you need to know what the columns are. And right now you need to specify those columns and the types in your query, which is a bit of a hassle.

Starting point is 00:35:59 So I'm working on a way to make it so that you don't have to do that anymore. Because Postgres wants to be very strict about these things. That's one of the big differences between DuckTV and Postgres. Postgres is very strict about certain things. And DuckTV is like, oh, you typed this. You probably meant to do this thing. This kind of makes sense. I can understand what you're doing, what you're trying to do here.

Starting point is 00:36:21 Let me give you the sensible results. And Postgres is much more like, oh, this is not exactly how I expected it. So, that's sort of where there's a bit of trouble. So, I'm trying to make the UX there a lot nicer. I have a way now

Starting point is 00:36:37 that's working quite well, but it needs a bit more polish. Probably after Christmas or after New Year's, I'll also create a BR to get that easier to work with. So you could just do select star from read CSV and don't have to read Parquet and don't have to specify the columns. That's sort of the user experience we want. But right now you need to do select star from read Parquet and then specify all the columns that you actually want to read from the parquet file,

Starting point is 00:37:05 which kind of defeats the purpose of the star. Makes perfect sense. And how about from your side, Joe? Yeah, I think Yelta covered it well. As far as, you know, the way that Postgres scan tables, you know, that should be merged in, I think, relatively soon. And then it will let it bake before the next official release. As Yelta mentioned, we released the past version just a week ago, so we have a pretty good clip at this point of shipping new releases. If there's something that folks want to see, certainly head on over to the repo and open

Starting point is 00:37:41 an issue or let us know. And then as far as Hydra goes, as I mentioned, we have a managed cloud Postgres service that's in beta that has integrated pgDuckDB. So we're looking to roll that up out into public beta so people can begin to use pgDuckDB without having to self-manage it. Today, you can obviously take pgDuckDB and put it on an EC2 box with Postgres

Starting point is 00:38:08 or whatever you want to do. So yeah, looking forward to shipping that out. Yeah, that's great. I just thought 20 years ago, MySQL was dominating the open-source database field, and they have several engines. I remember you needed to pick up like one was not crashing, another was full-text search, I mean MySum and InnoDB. It was a trade-off. And now, like 20 years later, we talk about various engines, basically, cloud-friendly, because you can upload it to object storage. And it's amazing that we can combine these in Postgres right now, and it looks like it's only starting, right?

Starting point is 00:39:00 So a lot of stuff can be added, developed, and so on. It's only the beginning as I see it. Yeah, it's pretty interesting to see Data Lake come to Postgres. It goes to further your point, which is many of the layers of the modern databases are being pulled apart. There's multiple engines. There's object storage. There's compute with Postgres. So I think that trend is continuing to play out.

Starting point is 00:39:30 As far as the average app developer is concerned, like I was indicating before, I don't think they necessarily need to know or care so much about the differences between a database versus a data lake. So hopefully the interface being Postgres helps bridge the gap of using something really cool like DuckDB, but, you know,

Starting point is 00:39:52 together with what you're used to using in Postgres. Yeah, that's cool. Nice one. Well, thank you so much both for joining. Thank you, Nikolai, as always. And catch you soon. Yeah, thank you, guys. Thank you.

Starting point is 00:40:06 Thank you.

CODACE Plant Stand

Postgres FM - pg_duckdb

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Postgres FM - pg_duckdb

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.