The Data Stack Show - 249: Quacking Through Data: Duckdb's Emerging Ecosystem

Starting point is 00:00:00 Hi, I'm Eric Dotz. And I'm John Wessel. Welcome to the Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. How to Create a Data Team with RutterSack Before we dig into today's episode,

Starting point is 00:00:30 we want to give a huge thanks to our presenting sponsor, RutterSack. They give us the equipment and time to do this show week in, week out, and provide you the valuable content. RutterSack provides customer data infrastructure and is used by the world's most innovative companies to collect, transform, and deliver their event data wherever it's needed, all

Starting point is 00:00:49 in real time. You can learn more at ruddersack.com. Welcome back to the Data Stack Show. We've got another special episode here with Matt, the cynical data guy. Welcome back to the show, Matt. Ah, so I see everyone's canceled. Big trivia on the show. Matt. Ah, so I see everyone's canceled. Make sure you're on the show.

Starting point is 00:01:06 We have a fun topic today. We normally at least have a segment here reacting to posts, talking about current events. We're going to kind of bypass that today and we're gonna talk about one of my favorite topics, DuckDB and a recent announcement. And then we're gonna zoom out a little bit and talk about the ecosystem.

Starting point is 00:01:25 Basically, this is a chance for John to teach me what Duckly. Yes, live, I'm gonna do it live. All right, this will be fun. Matt, tell me about the... This is not fair. But Matt, what have you heard about the announcement and then we'll nerd out on that.

Starting point is 00:01:41 So what I ever heard, what I read was that the DuckDB is coming out of this thing called Duck Lake, which is supposed to be a lake house format, open table format, and it's going to be unlike Iceberg, you don't need a separate catalog. So I'm assuming it's going to stick it in SQL tables or something like that. Otherwise it's I'm still not exactly sure where DuckDB fits into everything with this, so you're gonna have to explain that to me. I haven't had a chance to work with it.

Starting point is 00:02:10 Now, Matt, you haven't had a chance to work with it. And we're recording this on 5.27, and this post is on 5.27, so it's been out for at least 30 minutes. Yes. And you're not an expert yet. What? I know, I'm disappointed.

Starting point is 00:02:29 All right. So nobody's actually an expert on this, right? At this point. Anyways, I think we'll zoom out a little bit. So what I think one of the interesting things here is you have active development out there on technologies like DuckDB that are kind of the SQL light of analytics databases. Lightweight, open source, all in the one package. And by the way, if you pay attention, it's crazy to see how many, just like SQLite did, how many apps or have duck DB embedded in them

Starting point is 00:02:57 like at this point. So that's the one component here. The other component here is Iceberg, which we have to talk about where it's like, okay, we've got this new, cool, standard format that essentially sits on top of Parquet for analytics workloads, and you can bring your own catalog, you can say, hey, Databricks, go look at the Iceberg table. Hey, Snowflake, go look at the Iceberg table, you know, or whatever, whatever tool, and, you know, BI tools are getting direct connections to Iceberg or sometimes you're going through,

Starting point is 00:03:28 I guess usually you have to go through some kind of compute layer. But point being Iceberg is this modern data stack evolution. We've talked about this before, but it seems to be Iceberg and then like question mark. People are like, I don't know what goes alongside this. I don't know if we just like copy and paste most of the modern data stack over here.

Starting point is 00:03:46 And then iceberg is something to do with it. Or there's some AI thing that like is a major component. Once again, we see the underpants known problem pop up. Iceberg, step two questions, step three, big profits. It's like, how are we going to use this? I don't know, but we think it has to do with iceberg. All right, so back to the Duck Lake announcement. And I'm actually going to start from the end,

Starting point is 00:04:11 which I think is always fun. So essentially, I think in a lot of people's minds, it's like, wait a second, I thought Iceberg was the next thing, like, what is Duck Lake? I want to read it so I don't mess it up here. But essentially, okay, you ready? Yep. So the question is like compatibility. All right. So the data and

Starting point is 00:04:28 the positional delete files that Duck Lake writes to storage are fully compatible with Apache Iceberg, allowing for metadata only migrations. So that's one component here. And the other component, which I think is interesting, the availability of the Duck Lake extension augments and does not replace

Starting point is 00:04:44 DuckDB's existing and continuing support for Iceberg and Delta and the associated catalogs. Duck Lake is well positioned to serve as a local cache or acceleration feature for these formats. Okay. I only have one question for you. Okay. What is Duck Lake?

Starting point is 00:05:01 Great question. What does that mean? Help me here. I know. I think what that means is, hey, we've got this new cool thing. We don't want to compete with you, Icebreak. We want to still support that. Competing with you.

Starting point is 00:05:17 We're totally competing with you. I don't know. I don't think that's actually true. Do you think the distinction here is like look like the well-known DuckDB like problem they need to solve long term is like what if I run out of memory? Yeah. Essentially because just like SQLite it's like alright you can only like cram so much RAM still even today you can cram a lot but so much RAM into a machine. So my take on this is they're trying to solve that problem.

Starting point is 00:05:45 And like specifically they're well positioned to serve as local cache acceleration feature. So they're gonna be here, I don't know, like a Redis, like think about Redis like conceptually on a stack. And then you're still gonna have your underlying formats of like Parquet and Iceberg. I think that's the thought literally released today. So I will not claim to.

Starting point is 00:06:06 Okay. Though fully what they're thinking. Let me step back a second here. Cause I'm still, like I said, I've not had a lot of time with Duck in DuckDB in general. I have not had the opportunity to spend much time with it. So DuckDB, where, what is the use case for that? Yeah.

Starting point is 00:06:24 Great question. I feel like I'm missing where this is going. Yeah, great question. So I think, let me come in from kind of an analytics angle and it's essentially like, hey, I've got this really powerful MacBook Pro with a ton of RAM in it and I'm like doing this analytics project and I am tired of making server calls

Starting point is 00:06:49 for every single thing. Like I can run Duck locally, I can have a full, like, mostly fully functional SQL, there's limitations. But I can run all these SQL commands and it can be local and it can be super fast. That's one. So it's basically utilizing your computer as like compute.

Starting point is 00:07:06 Yeah, right, right. So that's one, if you're running local. Two, there's companies like Mother, DocFitter, like taking this, making it a SaaS product and handling the like, just like most SaaS products, the complexity of like managing the compute, all the things for you. I think a third interesting one,

Starting point is 00:07:22 that actually at data council, there's a neat presentation about this of like, hey, okay, what if we can, what if we're, say we're actually writing a query and we can like auto sample the data for you. And while you're writing the query, you don't actually need hundreds of thousands of results every time.

Starting point is 00:07:41 Maybe we can speed up your workflow that way. So I think that's an interesting use case. And then a third, that's very interesting, is all these companies making BI tools or tools that have a component of BI. Like, hey, let's just like, we have to get this like really neat fast store we can bake into it.

Starting point is 00:07:58 And I don't know all the like technical details of it, but there's some neat stuff you can do with the browser and having DuckDB like fuel your like end browser experience with like a BI tool, essentially. Okay, so it feels a little bit like we're talking about, it's this way of, it's like another form of compute almost that makes for, and in one hand, it's like kind of local development-ish type idea.

Starting point is 00:08:22 I'm not having to hit like Snowflake every time and work with them as long as I can get to where the data is. And then possibly some type of cache layer or some type of web app or something like that. Yeah, I think it's very simplistic. And we've had people from TuckTV and I think Mother Duck both on the show. We really need to have them back on the show.

Starting point is 00:08:45 But those are the two practical applications that I'm seeing is that the cache layer problem and the local dev problem or like a CI to CD pipeline problem, there's some neat workflows where like, hey, I need to run this pipeline. Like I can actually like use Duck to like test the pipeline instead of having to like hit my production snowflake saves me some money. Yeah. And you know, it's fast as well.

Starting point is 00:09:11 I'm seeing this as partially a stem away from the I have to pay for every time I want to do anything with my data. Right. Exactly. But yeah, back to the duck, I think it'll be interesting to see, one, are there going to be more people that come out with this similar type, because everybody's like iceberg, iceberg. Is there going to be more in this space where it's like, all right, we're a local cash or accelerator on top of iceberg or we're like other alternative local things like you know how much of it gets built around iceberg as the given of like hey icebergs gonna be here we're gonna build around it right and how much of it gets built

Starting point is 00:09:54 kind of alongside with a little bit of a hedge and how much of it's like hey we're gonna be direct competition I've seen zero direct competition essentially so far well I could see a spot where because one of the things with iceberg having had to dive deep into this for some stuff professionally essentially so far. Well I could see a spot where because one of the things with Iceberg having to dive deep into this for some stuff professionally is most of the biggest gains come on very large scales because that's what it was designed for. It was designed for like terabytes of data. So I could see there being something where if you can come in there with a more local version or something like that, where for smaller, smaller sizes, you can get better efficiencies out of that because that is something that, you know, you think

Starting point is 00:10:33 iceberg, it's going to be great. And then you get into it and you're like, Oh, look, the overhead involved with it causes it to actually be three times slower than if I just used Snowflake. Right. Like that, like if you don't, if you don't know, you're not optimizing it correctly. And if you're not, your data isn't at the right point, you're not configuring it right, it will be slower. I mean, it will be slower on smaller data sets a lot anyways.

Starting point is 00:10:57 So something that could help with that problem too, I could see as a compliment rather than a. Right. So I do wonder what that looks like as a, well, we've got our catalog in this in SQL and you've got it in file and I can migrate it, but now am I having to keep those two synced up and I can see some issues with it. Well, and the thing that's most interesting to me in this whole world is the standards

Starting point is 00:11:24 adoption is what I would call it. So, okay. We've pretty much agreed on for all this daylight stuff. We've pretty much agreed on parquet. And I don't- Better for worse. Yeah, for better for worse. Sure.

Starting point is 00:11:36 Pretty much agreed on parquet. And then on top of that, I said like, that's the core. That's your like, I mean, CSV parquet. People are like, all right, well Parquet is better than CSV for this. There's other options. It's not the only option, but for whatever reason, it seems like it's got the vast majority of adoption. Okay, so step up from that.

Starting point is 00:11:53 Like, all right, what are we gonna do next? How are we gonna like store metadata around it? How are we gonna do like the table concept, database table concept, iceberg, at least from open standards, like, all right. I think people are like, we're doing iceberg. Yeah. Okay, so we got two open standards, all right, I think people are like we're doing iceberg Yeah, okay, so we got two open standards per K iceberg great Then from here is a mess of like people

Starting point is 00:12:12 With catalogs. It's like everybody has their own like additional catalog and like people are scrambling in the catalog space Yeah, okay cool. And hey, that's where the permissions are set like that. There's money there I get it and access control and all this this stuff that you have to have as a company and you have to pay money for, or you pay money for. The other given for all of this, which is easy to skip over, is the storage layer, is we've also kind of agreed on like S3, S3 equivalent. Yes. It's some type of blob type of storage.

Starting point is 00:12:41 But we already agreed on in the modern data stack, we already decided like Snowflake, it's all S3 backed or Azure Blob or whatever, but it's like that Blob object storage, like whatever brand or variety you want, object storage. So we've agreed on a lot. We've agreed on like this storage layer, the underlying file format, the, call it table format and with metadata with let's say iceberg. the underlying file format, the, call it table format and with metadata with let's say iceberg.

Starting point is 00:13:07 And then we've not agreed at all on like catalogs or we just, we've kind of agreed there's gonna be a billion different catalogs. What we need is the Kubernetes or catalogs. I mean, yeah. But then this is what I'm interested though in the Lake House thing is like, okay, what else? We know there's no agreement on catalogs.

Starting point is 00:13:23 Well, there's high agreement here but then there's that middle space. And these guys like with the Duck Lake seem to be in that middle space is like, okay, what else? We know there's no agreement on catalogs. Well, there's high agreement here, but then there's that middle space. And these guys like with the duck lake seem to be in that middle space of like, hey, we're like compatible, but we're like kind of off to the side here. And like, we can do some like caching and stuff. We're totally not inbred at all. We just want to be a little fish that like goes by the sharp and you know, just picks up on our leftovers there. Yeah. I mean, I don't know. For positioning, I don't know. I think it's a real space. And I think it's like there's a use case, for sure

Starting point is 00:13:51 a use case here. It's a neat use case with thinking about having that kind of like caching layer. And again, like the, oh, well, I'm sorry, we missed one of the most important layers, the compute layer. Yeah. That is also in that like catalog and compute are in the like fight over, you know,

Starting point is 00:14:10 a bunch of different solutions are gonna fight over that forever. But I think for a certain extent, like the compute, by design, it's supposed to be more diverse because it can get certain use cases and stuff like that. The catalog is the one that's still kind of this weird and more for this. Yeah. That's the one that I still kind of this weird morph. Yeah.

Starting point is 00:14:25 That's the one that I feel like if you're going to consolidate on any of these, it's going to be kind of that catalog space. Right. Or it's going to be, like I said, like a Kubernetes state thing where it's like, hey, yeah, there's 12 of these and we're just going to abstract over it. You don't have to think about it. Yeah. And I honestly think that is more likely to happen. There's the abstraction layer,

Starting point is 00:14:45 because just with the governance and security and all of this stuff that's going to be built into, you know, these various things, I think that is most likely to happen. And then I would then just allow you to migrate your catalog if you need to, if you need to be multi-cloud, it's not going to matter. It has all the same advantages.

Starting point is 00:15:01 It's kind of that like metadata orchestration type idea. Yep. Yeah, and we're like metadata orchestration type idea. Yeah, and we're still super in all of this. I think that's the thing it's easy to forget about a lot of this stuff with open table formats and everything is like, we're very early in all this. There's still a lot of runway to smooth out these problems. It took a while for a lot of cloud stuff

Starting point is 00:15:22 to kind of smooth out a lot of rough edges. Right. Well, and the other thing I do like about this, though, is the simplicity piece here, because it is still painful to work through, like, all right, what catalog am I going to use? How do I set up the catalog? Especially if it's not, like, quite compatible with your, like, standard stack. Like, that's still painful. So I definitely see the use case here of, like, okay, I got everything in Parquet or Parquet Iceberg. Like, all right, I just have this thing and I don't have to think about the catalog

Starting point is 00:15:50 separate from the compute, separate from, you know. So that I see the value. I wonder, I could see others following suit, having this same concept. We're gonna take a quick break from the episode to talk about our sponsor, Rudder Stack. Now I could say a bunch of nice things as if I found a fancy new tool, but John has been implementing RutterStack for over half a decade.

Starting point is 00:16:12 John, you work with customer event data every day, and you know how hard it can be to make sure that data is clean and then to stream it everywhere it needs to go. Yeah, Eric. As you know, customer data can get messy. And if you've ever seen a tag manager, you know how messy it can get. where it needs to go. you have implemented the longest-running over all the years and with so many RutterStack customers including your data infrastructure tools, head over to ruddersac.com to learn more. Now the most important DUP-related question. Yeah.

Starting point is 00:17:28 Have you seen... It was on LinkedIn the other day. I didn't flag it for you, but it's a video of this guy doing a demo. It's with DUPDB, I think, or something. And it's like you talk, and then it quacks and writes SQL. Have you seen this?

Starting point is 00:17:41 No, okay. I don't even really understand what was going on with it. It was that a guy was showing it to Zach with Vasilim, the data engineer. Oh, okay. Yeah, yeah. And I just saw the clip of it and the guy's talking and then the thing goes quacking, right? SQL as it's quacking.

Starting point is 00:17:59 It's just like a CLI tool. I'm imagining like Calce, if you know that from like... Well, no, there was a UI there because it had basically, you know, where you could see the voice kind of going... Okay, yeah. And then off to the side, it was writing the sequel there. I don't know what it was writing for sequel or how it determined it. All I know is it was quacking and writing at the same time. This is all just duck moving going on right here.

Starting point is 00:18:19 Wow. All right, that sounds like something we're gonna have to link at the show notes if we can find it again In the show notes we can't put anything in the show notes. Yeah. Yeah, does he? Yeah I must have to go back and look at this. It feels like something you just saved like Don't ask me a question. There'll be any shit. No, you're always welcome to ask questions Cool. Well, I think this wraps our little segment Well, we will have to get some experts on here to actually do a deep dive But I wanted to just call this out because we saw it today and it looks pretty neat.

Starting point is 00:18:47 You'll have to have me on so I can sit there and just go, yeah, but what is that, baby? What is the data like? I'd be like, forky ask a question right there. Yeah, perfect. All right, that is it for our segment here. Matt, thanks for being here. Stay cynical.

Starting point is 00:19:03 All right, see for being here.

The Data Stack Show - 249: Quacking Through Data: Duckdb's Emerging Ecosystem

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.