Disseminate: The Computer Science Research Podcast - Haralampos Gavriilidis | SheetReader: Efficient spreadsheet parsing

Starting point is 00:00:00 Disseminate the Computer Science Research Podcast, the DuckDB in Research series. Hello and welcome to another episode in the DuckDB in Research series. Today I'm going to be joined by Harry Gavrilidis, who is a PhD student at the Technical University of Berlin. And I should say, Harry's been on the show before, so you should, there's a quick plug for Harry's previous episode. You should go and check it out where we talked about in situ cross database query processing. It's episode 27. So you should go and find that as well. Listen, if you like what we're talking about today, I'm sure you're going to like that as well. But yeah, welcome back to the show, Harry. Yeah. Thanks for having me, Jack. It's always fun talking with you. Cool. So today we're going to be talking about

Starting point is 00:00:40 SheetReader, efficient specialized spreadsheet parsing, which was some work that you published in the Journal of Information Systems. And then you have taken this and written an extension for DuckDB as well, bringing the power of SheetReader to DuckDB. Cool. So let's start off and yeah, give us the TLDR on what SheetReader is. Yeah. SheetReader is actually what the name suggests. So we actually parse spreadsheets very fast.

Starting point is 00:01:11 That's what we do. Nice. Cool. I mean, that sounds, that's exactly what it says on the tin, right? Cool. So yeah, I guess kind of following on from that then, the kind of obvious question is why is reading spreadsheets efficiently a challenge? And yeah, how does SheetReader come in and address these challenges

Starting point is 00:01:29 and improve over the existing state of the art? Right. Yeah. So a very good question to start off. So let me start by telling you what actually spreadsheets are. So spreadsheets essentially are zipped XML files, nothing more. And why existing spreadsheet parsers are so slow and memory and they consume a lot of memory is because they rely on generic XML parsers, right? And so the idea is we decompress the documents and then we use these generic parsers to parse our XML content. Now, why this can be slow is because we have two main XML parsing approaches. There is a DOM approach where we actually generate the whole XML and we generate the equivalent tree in memory. And then from there, we can extract the elements that we want.

Starting point is 00:02:27 Of course, you can imagine this is a time-consuming process and also consumes a lot of memory because we have to first decompress, then generate the tree, and then go into the tree and look for the elements we want. Then for the SACS parsing approach, this is kind of event-driven where the parser starts and generates several events for specific tag types. And depending on that, we can get the elements that we want. And we analyze performance behavior. and for the DOM approach, we could see that memory consumption is actually three times the original spreadsheet size, which is quite a lot. And this is not what we want when we do spreadsheet parsing on commodity machines. So that's why spreadsheet parsing,

Starting point is 00:03:26 existing spreadsheet parsers are so slow. Essentially, they rely on generic XML parsers, which do not really leverage the spreadsheet structure. And we can come later to that. Yeah, sure, we can dig into the technical, the little technical deep dive into SheetReader in a moment. I guess my next question is then, you have this standalone project, SheetReader.

Starting point is 00:03:51 What was the motivation for bringing that to DuckDB and writing the extension for it? Right. Yeah. So it's funny how the story with SheetReader actually started. So I remember Felix, who was the main collaborator on this project. He was a student back then and he came into our office and was like, oh, you know, I have this problem at work. We have these data scientists and they try to load their spreadsheets in R and you know, we have this little spreadsheet of one or two gigabytes and yeah, their laptops are freezing while

Starting point is 00:04:26 loading the spreadsheets. I was like, okay, this is bad. It's not big data, right? Why should this be happening? So then we actually started the project developing an R package, actually. This was our first and most famous package in the beginning. And then we decided to also go further, because it's not only R users who are parsing spreadsheets. We have also a wider audience. And then we continued and extracted our main module, which is built in C++. So we kind of, yeah, isolated that one so that we can have multiple bindings. And then we continued with Python. So we also have a Python binding, which can be used to load

Starting point is 00:05:17 your spreadsheets into data frames. We also have Postgres binding. I mean, it's a foreign data wrapper that lets you use... Yeah, essentially you can register a spreadsheet as a table. And we also have the equivalent for DuckDB. And why this is really cool is because now you can have SQL on spreadsheets, which is not supported out of the box,

Starting point is 00:05:40 out of most spreadsheet tools like Excel, right? And on the other hand, what's also really cool is that you can then interleave your spreadsheet data with other tables that you might have or other formats. And DuckDB nowadays is really used for any kind of analytics also like EDA, like if I want to quickly look into some files or quickly do some analysis, and then I might want to integrate different types of data.

Starting point is 00:06:08 This is why we thought it's really cool to also have a DuckDB extension. I must say the DuckDB extension ecosystem is also a really nice effort from the DuckDB folks. For us, it was pretty straightforward to integrate SheetReader into DuckDB. Nice. Yeah. It's funny. You mentioned there about how you originally started doing it as an R package and my own experience, I can totally like, I get that, the painful experience of trying to use spreadsheets and R, this is, I think was probably like a three or four year ago I tried to do this. Honestly, it was so painful. And yeah, so this is, yeah, definitely can be, I can see do this and honestly it was so painful and yeah so this is yeah definitely can be I can see why this is a game changer and the the need for it was there basically yeah that's cool yeah

Starting point is 00:06:54 it's it was amazing I mean not yeah amazing for us to see that the the spreadsheet parsing performance is so bad I mean we compared it to like cs parsing, which of course is totally different, right? I mean, it's a different format, totally different, but still like the difference of in performance is huge, right? So existing approaches were really not suitable. And this is, I think, a nice gap that we fill with SheetReader. Yeah, it's amazing the sort of bad performance people will live with, right? I mean, but yeah, it obviously said that you kind of you have these various different sort of bindings and different ways you can use it. You can use it with like Python, R and obviously the DuckDB.

Starting point is 00:07:37 Do you have any sort of gauge as to across the various different ways you can use SheetReader, which is the most popular? Right, this is difficult to say. So, I can say that, so we can, it's difficult to track the package user, right? But last time I checked, our package had around 8,000 downloads. So, I mean, it's not like a huge number, but you see people are actually using it. And then for Python, I don't have numbers yet because we just published this one. Postgres, we cannot track it. And DuckDB, they also recently added like this

Starting point is 00:08:22 package statistics or like extension statistics on the website. And there I can just see weekly stats and it's like around four, 500 downloads per week, which is also not too bad. And to be honest, we also haven't done a great job advertising it. And I think the podcast now is a great chance. Yeah, hopefully so. If you can see a spike, we should do a change point analysis. After the podcast gets released, if we see a spike in downloads, right?

Starting point is 00:08:50 It should be pretty visible. We had this actually when we first published the extension on the DuckDB extension ecosystem. Someone posted it on Twitter and suddenly, yeah, we saw very many downloads. That was, I mean, it's encouraging to see that people actually use it. Yeah. Yeah. I mean, it's one of those things where you do research and you kind of want your research. It's so much more rewarding when you kind of, you can see your research and your work having impact.

Starting point is 00:09:21 Right. So it incentivizes you to kind of keep going. And that's for sure. Cool. So let's talk a little bit more in detail then about, about SheetReader and have a little bit of a deep dive into the actual technical components of it. And then we can start off by just giving it, if you can give us a breakdown of the various, of the key components in there. Right.

Starting point is 00:09:45 Yeah. So we, I mean, from a technical perspective, we implement SheetReader in C++, and then we provide it as a shared library. So, you know, people can just use the shared library and create their own bindings. own bindings. And then the main approach of SheetReader is to leverage the specialized spreadsheet XML structure. Spreadsheets are standardized. For example, in Excel, there is this standard that defines how an Excel document should look like. So essentially, we can leverage this structure to only parse what we need. So we don't need to rely on generic XML parsers.

Starting point is 00:10:36 We know exactly how spreadsheet documents look like. So we can directly go to those places and yeah, get the elements that we need. And then, yeah, we apply also some other optimization techniques like, yeah, parallelism or, yeah, efficient conversion of the different values. So yeah, all in all is leveraging the spreadsheet structure. Nice. That's the secret source to make this thing go a lot faster then. I mean, I guess the question that naturally follows from that then is,

Starting point is 00:11:10 is how much faster? I guess you've benchmarked this tool against the various ones. What's the speed up we're talking here? Right. So yeah, we actually have two metrics. So we are interested in runtime performance and memory consumption. And so for example, in R, we are an order of magnitude faster than OpenXLSX and like three or four times faster than ReadXL.

Starting point is 00:11:35 These are two state of the art spreadsheet parsers in R. But then also the interesting thing is that now the faster library in R, it consumes a lot of more memory. So for example, when we look at memory usage for this file, that's like, what is it? Like 800 megabytes or something. The R, so read Excel consumes 12 gigabytes of memory for parsing it. Yeah, that's a lot. memory for parsing it. Yeah, that's a lot. Then OpenXLSX, which was the slower library, consumes around five gigabytes of memory. And SheetReader actually stays at one gigabyte of memory. So that's also a big difference in memory consumption, right? And while this

Starting point is 00:12:20 is not so much so it doesn't have a large impact on, I don't know, cloud environments or whatever. Usually you will pass spreadsheets on your laptop, right? So, and laptops, I mean, nowadays you also get more RAM, but yeah, still when you can save some. And as you heard, these differences are really enormous. And then I also have some numbers here from DuckDB. So DuckDB also has like, I mean, it's in the spatial extension. This is also very funny, actually.

Starting point is 00:12:53 So both in Postgres and in DuckDB, they use spatial extension. And I think underlying, they use GDAL for spreadsheet parsing. I don't know how historically spreadsheet parsing got into the spatial extensions. I guess spatial analysts somehow use spreadsheets. Here we have some numbers from the TPC-H benchmark. With the spatial extension, I can parse the, or I can load, let's say the orders table in scale factor one in ductiB, it takes around one minute while with sheet reader, with one thread, it takes under 20 seconds.

Starting point is 00:13:34 And if we increase thread count, we can even reach around six seconds. So you see, we can reach like also an order of magnitude faster spreadsheet parsing in DuckDB itself. Yeah, I mean, if that's not a sort of a message to listeners to go and they don't ever use anything other than Street Reader again, then I don't know what is right. It's pretty clear that. Of course, we don't have so many features yet. So for example, we just read spreadsheets, right?

Starting point is 00:14:09 So the special extension can do many more things. But yeah, we have actually received many requests over GitHub once the extension got published. So yeah, we're also trying to incorporate many more features now. Yeah, we can probably talk about that in a second. Actually, I just kind of wanted to ask you how the performance. I mean, you said that the use case here that the thing is primary targets are like people on the running on the local machine, right?

Starting point is 00:14:37 They're there, they're laptops, whatever that's because that's typically where most people are using spreadsheets, right? I'd have I'd have thought anywhere. So I guess there is a sort of a natural upper limit to how big the actual spreadsheet will be. But did you do any sort of performance measurements measuring whether the performance was invariant to the size of the file or what that's what that sort of scaling performance was like for SheetReader? I mean, for OpenExcel, for the one that loads it all into memory and creates this DOM, then that's going to quickly exceed, like, I'm going to blow up my laptop pretty quick, right? But yeah, how did SheetRead perform that sort of scale, I guess? Yeah, so spreadsheets have this limit, as you say,

Starting point is 00:15:18 and we also have some experiments, yeah, where we actually scale the number of rows in spreadsheets and columns. Sheet Reader actually scales pretty well. We have almost linear scaling. Sheet Reader can handle spreadsheets of any size, let's say. And we scaled in early both in terms of runtime and memory usage. Awesome. Cool. That's great. Let's chat about, you said that you get an example, which you like to ask about impact and things and kind of feedback from users.

Starting point is 00:15:58 And you said that since it kind of went public, you've had a lot of requests on GitHub. And there is, as you admitted to there, there is certain things, well, I've said featured mission, so it's maybe it's not a fair fight to totally trash some of the other ways of doing this yet. But yeah, so tell us about what these requests are and what the mission features are and kind of where you see this going longer term. Yeah, so it was also really nice at.Con this year in Amsterdam, where I also had the chance of showcasing SheetReader. So you know, it's crazy.

Starting point is 00:16:33 So you say, okay, you know, I do this spreadsheet pausing library and I put it out there and you know, people will just use it and read their spreadsheets and you know, what more is there to do? And then people, you know, they come with all sorts of crazy things, which are really interesting. So for example, of course, so one thing is loading online spreadsheets, right? Then another crazy, but very interesting thing is, of course, I mean, Streetreader deals with structured spreadsheets, right?

Starting point is 00:17:06 So we assume that you just have a table in there, which is clearly defined, and we parse it. But then, you know, there are users that have multiple tables in their spreadsheet, right? And I'm not talking about multiple sheets in your spreadsheet. So you can have multiple sheets in your spreadsheet, but you can have also like within one sheet, you can have multiple tables at different points.

Starting point is 00:17:32 Right. And then how, how do you deal with that? So right now we, we don't, but there has actually been going on some research on detecting those tables. I think that there are some papers on that. So that's a really interesting direction. So for example, one could think of, yeah, actually detecting those many different tables

Starting point is 00:17:55 and then outputting as many tables or data frames or whatever you want. Another thing could be letting users specify their table ranges, right? Because you have these location information within the spreadsheet. So you can say, this is my bounding box of my three tables and I want them parsed. This we could also do. Then another crazy thing was like, oh, we know, we have these users that, you know, they color some of their lines and they only want the red lines parsed. Like, yeah, of course we don't do that right now, but this is also something interesting.

Starting point is 00:18:38 Actually, this is also how we can gain some performance because we ignore all of the irrelevant things. So really when we actually parse those chunks, we really just go and take the cell values. We like throw away everything else. But it would be also very interesting to see how we can include more information like, you know, coloring or even formulas, right? Sometimes you have a simple formula that maybe, I don't know, multiplies the cell by some value or something. It would be also interesting to see if we can deal, you know, with formulas in spreadsheets. What else? Of course, different data types. Spreadsheets are pretty limited in data types. You just have numbers, text, dates, and Booleans. I think these are the main types. And then many, many times, And many, many times users want to map them to different types in their tables.

Starting point is 00:19:58 So yeah, for example, one feature request was like, okay, I want all my columns as strings. So yeah. No, that's cool. I mean, as you kind of enumerate in those different things, I was thinking, yeah, I have multiple tables in my sheets. My go-to normally is to rather than open another sheet, I'll probably just put another table in the same sheet. So I'm kind of part of the, do you want to try and do mine? I realize my spreadsheet, the way I do spreadsheets, I'm a bit more of the extreme spectrum of like I do weird things in my spreadsheets. But also colouring as well. Yeah, like I love a bit of formatting in there.

Starting point is 00:20:25 So yeah, I guess having some sort of structured way to deal with formatting and then taking that and then parsing that and actually using that in some useful way would be a really cool sort of feature to add in as well. Where you went with that, because you mentioned this earlier on about being able to write back to the spreadsheet. Where, what's the sort of, there's any sort of work on that that you're thinking about doing? So yeah, not yet.

Starting point is 00:20:48 Maybe we would then have to rename our project to sheet reader and writer. Yeah, which is, yeah, I guess the name, but yeah, or you maybe only need a sheet writer, no, yeah, something else. I, yeah, I didn't thought that's, that's pretty. No, no, yeah. I mean, it's a very valid question. So, we, we haven't done any work on that. I think this is not too challenging and DuckDB's extension, spatial extension works pretty good in that.

Starting point is 00:21:13 So we actually used it for benchmarks, right? So we actually generated the TPC-H tables with DuckDB, which also has very nice built-in support for the TPC-H benchmark. And then we use a spatial extension to write those. And this is actually very interesting because we wanted to have the line item table, which is like 6 million rows in Scale Factor 1, I think. And so with Excel, we can't use conventional spreadsheet tools for that because it exceeds the maximum amount. But then with DuckDB, we could just do it. I think the challenge is more in reading rather than writing because essentially you just have to write this document and then just zip it.

Starting point is 00:22:01 I haven't looked into it, but I think that should be quite straightforward. Nice. Yeah. Yeah. My next question, Harry is about, so obviously the, the way the sheet reader, sheet reader project was started, it was a standalone thing and then you've, you've taken it to various different ecosystems. Is there any opportunities you think sort of further down the line to be more tightly coupled with either with that with the with WDB obviously, for example, and sort of, yeah, space for co-design that might lead to further either performance improvements or feature availability or anything like that? Absolutely.

Starting point is 00:22:43 So, and now I can link back to the feature requests that you said before and I forgot about it. So there was this person that wanted to use the spreadsheet parser in Wasm. So in the browser. Okay. Yeah. And I think right now it's not really possible. So, but if we would have a deeper integration in DuckDB,

Starting point is 00:23:06 I think this would be possible. And another thing is that, so right now we have this intermediate structure. So we kind of take the spreadsheet content and we put it into our intermediate structure. And from there on, the bindings can actually pull it and construct their internal data structures. So for example in DuckDB, in the extension we have a table function actually and you know we have to fill those batches and how we do it is we take, we read from our structure and we fill the batches and give them to DuckDB. And this could be avoided if we had a deeper integration, right?

Starting point is 00:23:48 So, you could directly read the spreadsheet into those DuckDB batches without having this intermediate structure. Yeah. Nice. It's interesting kind of how the, just thinking from a maintainability point of view, though, going forward, that would make it harder to, you're forking the existing project, I guess, which probably maybe you would have maintainability challenges in the future, right?

Starting point is 00:24:15 I guess. I mean, I don't know how easy that same thing would work for maybe a post-credits or thing like that. It would be very coupled towards DuckDB, I guess, which would be good, but then challenges as well, I guess. Right. But this is the, I think this is the trade-off between this, you know, abstraction and performance. So, but what we also want to look into is actually providing arrow batches so that you don't need to convert anything because nowadays most systems can efficiently or like analytic systems at least can work more efficiently with Arrow data.

Starting point is 00:24:49 So you have kind of native Arrow I.O. support and then providing our data or even using for ourselves Arrow as an intermediate. I think we could more easily be integrated into the existing systems by keeping some of the abstractions. Yeah. Yeah. That sounds kind of a good sort of middle ground. Yeah. Nice.

Starting point is 00:25:13 Yeah. So my next sort of set of questions, Harry, are going to sort of move focus a little bit more about the DuckDB experience. And my first question then would be, what has been your overall experience of working with DuckDB and integrating SheetReader into DuckDB? Talk to me about that whole sort of experience. Right. When we saw this DuckDB extension ecosystem, we were quite excited to see how difficult it will be to integrate it. And then we collaborated with some students at TU Berlin, in particular Jonas Ben.

Starting point is 00:25:56 He was the main driver of this integration effort. And I hope this reflects his feelings. but I think it was pretty straightforward and so yeah I think there was this nice DuckDB extension template which we could reuse and then I think it was quite straightforward to integrate our forward to integrate our extension or our sheet reader in there. And yeah, so because we also are in C++ and you know, sheet reader, sorry, DuckDB extensions are currently also offered in C++, like the template is also offered in C++. This was, the integration was quite smooth. I think there were a couple of issues, but the DuckDB folks were also supportive with regards to that.

Starting point is 00:26:52 Then how to put it into this repository so that it's available for everyone. I would say the experience was quite smooth. I'm really glad it worked so smoothly. Yeah, I think for Postgres, it was a bit more difficult, let's say, to run, even though Postgres is around for a much longer time. And there are foreign data wrappers like there or like their extension ecosystem is around for quite a longer time there. It was in C and you know, you had to deal with all the, some of the internals of Postgres, which are not so much straightforward.

Starting point is 00:27:29 And, you know, as they say, Postgres documentation is the code. So, um, and yeah, for, for DuckDB, this was not the case. So I think you didn't need to mess with too many DuckDB internals. They just, you know, register your function, you provide your data in the batches and that's it. Nice. Yeah. It sounds like a really positive experience then, Harry. Yeah.

Starting point is 00:27:55 I kind of, my next question, because obviously you're the expert on spreadsheets now, Harry, you're my go-to guy. If you've ever got any spreadsheet problems, I'm going to come to you. So I kind of want to get your opinion on where do you think, like what the future's like for spreadsheets in the sort of the wider data analytics sort of landscape kind of going forward, but then also as well where you think DuckDB fits into that landscape

Starting point is 00:28:17 over the next few years and sort of like, if you put in your goggles on and thinking about the future kind of what do you see the world looking like for spreadsheets and for DuckDB over the next off three to five years? Yeah. Yeah. Good question. So with regard to spreadsheets, I'm not sure they're, I mean, I don't want to

Starting point is 00:28:38 claim that we have reached fastest spreadsheet parsing possible, but I think we're not too far away. So with regards to performance, I'm not sure there can be done too many more things. But what we did, for example, so because, so one thing I didn't mention before is that we cannot parallelize the decompression step because of specific compression algorithm that the standard specifies right so this is a standard you cannot change it you need to follow the standard. change in the standard that would make spreadsheet interoperability be more performant is to change the compression type or the compression algorithm of the standard so that we can parallelize even the compression step. Because right now we either decompress, so the sheet reader has two modes, so we either decompress the whole document into memory,

Starting point is 00:29:45 and then we parallelize the parsing. So this is what we call the consecutive mode. And then we also have the interleaved parsing, where this is designed for memory-constrained environments where I can define a budget. And then I decompress only as much memory as I have, and then I parallelress only as much memory as I have and then I parallelize the parsing. So in the end if I could also parallelize the

Starting point is 00:30:10 decompression step this would allow me to be even faster. So maybe this is something that the standard designers could think about. I'm not involved but this was just you know an experiment that we did. We decompressed all the content and then we compress it with a different compression algorithm and we saw even better performance. So this is one thing. And yeah, and actually what's funny is that, you know, we talk about big data all the time, you know, scalable computing. Of course there are these use cases

Starting point is 00:30:45 there, right? So big tech has a lot of data. But then when you talk to, yeah, if I may say normal people, they have all kinds of data spread around and many of them are spreadsheets. So many people in finance, for example, or in healthcare, I don't know, they use these spreadsheets as their standard tool. Then, of course, data, like one dataset alone doesn't really help you when it comes to analyzing or like now with machine learning where you want to combine all your different data and train AI models or whatever. And I think there is a lot of information in spreadsheets that can also be leveraged. And spreadsheets are, I don't think they're going away. Like SQL systems and relational databases are not going away.

Starting point is 00:31:37 I think they will be there. And yeah, DuckDB is a great tool. So they have very many integrations. And I really love this about DuckDB. So you can read Parquet files, CSV files, you know, you can read from S3, you can read from local storage, you can read from FTP, you can read from wherever. And you can, you can use it for different purposes for ETL, for pre-processing, for just, you know, analyzing quickly some data on your laptop, but also in the

Starting point is 00:32:06 cloud. And I think, so our idea with providing these spreadsheets parser to the DuckDB community was that, yeah, it's now easier to integrate your spreadsheet with the rest of your data in an efficient manner. Yeah. Integrate your spreadsheet with the rest of your, of your data in an efficient manner. Yeah. Yeah. Awesome. I'm going back to your point there about spreadsheets. So I've been every, it's amazing how many sort of multimillion, even probably billion dollar companies run literally on spreadsheets.

Starting point is 00:32:36 Like I have like friends where they're like, I don't know, they're working, I don't know if a company X saying, and I kind of, you see what their day to day is, it'd be like in finance and just sending it, sending spreadsheets to each other constantly. That's how the whole thing runs. And it's like, this seems so flimsy. But yeah, anyway, yeah, no, they're everywhere. They're ubiquitous. They aren't going anywhere.

Starting point is 00:32:53 Um, so yeah, at least now we can read them a lot faster. Yeah. Awesome. So yeah, I guess where I always get to where can people go and try out Sheet Reader, but I guess they can just go on the DuckDB extension page, right? And find it there or on one of the various of the different, does it have an actual home page, like website where kind of links to all the different bindings that it has?

Starting point is 00:33:17 Yeah. Well, we're on GitHub, but you can find us, of course, as you mentioned on the DuckDB extension page. But I also have a website of like my general project where SheetReader is also linked. It's polydbms.org. And there you can find the SheetReader core repository and then all our bindings for R, for Python, for Postgres, and of course for DuckDB. Awesome stuff. Kind of on that there, are there any other projects that you're working on at the moment that we should be keeping an eye out on, maybe DuckDB related or not?

Starting point is 00:33:52 Yeah, actually, we have, so the SheetReader project is quite old, and in the meantime, we have been working on several different things. So one of them, which I think is pretty interesting and DuckDB is also involved is concept of composable database, database management systems, right? Have you heard like the idea now? Yeah, go for it. The idea is that nowadays, you have this monolithic systems, right? So you have You have these monolithic systems, right? So you have Postgres or Spark or whatever for data processing. Let's start from monolithic database systems like Postgres. So you have your query interface, you have your optimizer, you have your processing engine, and you have your storage

Starting point is 00:34:41 layer all in one system. And this started being kind being decomposed in the MapReduce era where you said, I will decouple my storage from my compute. And then MapReduce happened, Spark happened, and this is the principle that also Snowflake and modern data warehouses are building on. And now people try to go one step further and say, okay, I also want to decompose my optimizer from my execution engine. And then we have projects like, for example, Substrate, which is kind of this intermediate representation standard. I may call it like this, which allows you...

Starting point is 00:35:20 So the idea is that I can use different optimizers, or like I can compose a system and choose between different optimizers and different execution engines. A project that we're working on lately, we're actually showing it at EDBT this year as a demo. There we built a simple prototype, which allows you to compose those systems on the fly. You can So you can say, you know, I want to use DuckDB's optimizer and the data fusion engine. And then the main idea there is to check, you know, the different combinations for different queries and see if I can achieve better performance with those different combinations. So and how this works, for example, I give my SQL query to DuckDB

Starting point is 00:36:07 and DuckDB has a substrate integration and it will produce a substrate plan for me. And then if I have other execution engines that can consume substrate plans, yeah, I can feed them into there and execute my query. So, we have a set of different optimizers and a set of different execution engines, among them also DuckDB, which has also experimental substrate integration, but it's enough for us to play around with. That's pretty interesting.

Starting point is 00:36:41 Another project that I've been working on and I will present it at SIGMOD this year. This is called XTBC. So you see the connection. The previous project was XTB. Now it's XTBC for cross database connectivity. And it's also about interoperability. So it's about transferring data between different data systems and there we haven't looked into DuckDB yet but it would be great also source or target. So the main idea is how do I let's say transfer data between different systems for example if I have a Postgres database and a Pandas target and I want to load my data from the database into Pandas, I would use maybe something like SQL Alchemy or TurboDbc,

Starting point is 00:37:33 or for example, the DuckDB reader. They also have a specialized reader for Postgres. But then this is like a pretty tightly coupled pipeline. So I will connect to a server and fetch my batches. But what we also observe is that we have different environments. So your server might be located in the cloud or somewhere on premises, and then your client might also be located in the cloud or on a laptop, and the network might be different. And such a tightly coupled approach doesn't give you many optimization opportunities. So what we do is we also decompose this pipeline so that we can scale the different steps of the transfer pipeline independently. So I can use different levels of parallelism for each step,

Starting point is 00:38:20 for example reading, deserializing, compressing or sending over the network, For example, reading, deserializing, compressing or sending over the network. What type of compression I will use. So yeah, and we see good performance gains from this approach. If you're interested, you can probably soon see the paper at the SIGMA proceedings. That's a good, the sequel to the last time he came on the podcast, right? Because we chatted about XDB then. So it'd be good to see how XDB sees going. So yeah, on the substrate project you were talking about, I find it fascinating

Starting point is 00:38:56 to sort of see how you can decompose these monolithic systems into sort of functionality, into the little components. And then you can try the different combinations of them. It's a very appealing idea. I guess you kind of need on some level a coordinator outside of that, that's kind of managing this and measuring it and say, okay, well, actually, for this given query, using this optimizer and this storage engine actually is better, and maybe that changes over time, right? So you might want to change these things out and see how that compares to the monolithic approach, right? So you might want to change these things out and seeing how that compares to sort of a

Starting point is 00:39:26 monolithic approach, right? Where if it's tight the couple and seeing how performance differs if at all. So yeah, that's some two really interesting projects, I'm sure. Yeah, I look forward to learning more about them. Yeah, yeah. Cool. Cool. Yeah, I guess that's a, that brings us to the end then, then Harry.

Starting point is 00:39:42 So I just have one last question and that is the usual, what's the one thing you want the listener to get from this podcast today? Yeah. So spreadsheets are not going anywhere. And so why existing spreadsheet parsers are inherently slow because they rely on XML parsing techniques. With SheetReader, you can parse your spreadsheets really fast and without consuming a lot of memory because of our specialized spreadsheet parsing routines. You can also do more stuff like, for example, combining your spreadsheets with tables in your databases or writing SQL directly on spreadsheets through our extensions,

Starting point is 00:40:29 like for example, DuckDB's extension. Fantastic. Yeah, you've been told, listen, don't waste time reading your spreadsheets anymore. You're a sheet reader, you can do it like miles faster. So yeah, and I love that they've been able to query it, create with SQL as well, and that's a fantastic feature as well. So yeah, it's been a lovely chat, Harry. Thank you very much for coming on the show. As I said in the beginning, it's always a pleasure with you, Jack.

Starting point is 00:40:51 Thanks for having me.

CODACE Plant Stand

Disseminate: The Computer Science Research Podcast - Haralampos Gavriilidis | SheetReader: Efficient spreadsheet parsing

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Disseminate: The Computer Science Research Podcast - Haralampos Gavriilidis | SheetReader: Efficient spreadsheet parsing

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.