Disseminate: The Computer Science Research Podcast - Mateusz Gienieczko | AnyBlox: A Framework for Self-Decoding Datasets | #69

Starting point is 00:00:00 Disseminate the Computer Science Research Podcast. Hey everyone, welcome to Dismanate the Computer Science Research Podcast. I'm your host, Jack Wardby. This is the first episode of 26, so a very belated, happy new year to all of you computer science research enthusiasts out there. In today's show, we are going to be talking about a paper that was published at VLDB last year that actually won the best research paper award, and that is any blocks, a framework for self-decoding datasets. And I'm really pleased to say I've got one of the authors of that paper with me today,

Starting point is 00:00:35 Matt Jen Yetchko. So welcome, Matt. Hello. Cool. So we always start off as customer on the podcast with you telling the listeners a little bit about your journey and how you kind of arrived at where you are today. And I did do a little bit of research, and I know that you wrote your first line of code in 2013 and to quote from your homepage, this made a lot of people very angry

Starting point is 00:00:58 and it's been more widely regarded as a bad move. So, yeah, maybe take us through your story, Matt. Yeah, so, yeah, it started coding in high school that was around that time. I did my bachelor's and master's in Warsaw in computer science. My master's thesis was Arsendaph, which is the Blazing Fast Jason Path engine. It was published in Aspos 23. The title of the paper is supporting descendants in SIMD accelerated JSONPath. It was a fun project with an intersection of like some theory of automata,

Starting point is 00:01:40 which always fascinated me, like computation theory, and, you know, low-level tinkering. Because SIMD, you have to use low-level code, which I didn't do much before. I had industry experience as well, mostly doing dot-net, like, back in development, cloud stuff, talking to databases, but not writing databases.

Starting point is 00:02:02 I went to work at Microsoft for a year in Dublin identity. It was not fun. And I started looking for a PhD because I thought that I should try with an academic

Starting point is 00:02:22 career before the industry sucks out the rest of my soul. And I had I had the opportunity to work at at UM under, well, it's complicated actually. Like I,

Starting point is 00:02:38 our labs are pretty entente wide between Professor Neumann, Professor Kichever and Professor Lise, but now I'm supervised by Professor Yanekeva at the database share and data processing, so that's what I do. Yeah, 90 blocks was my first paper.

Starting point is 00:02:56 now I'm Jesus, I was two years already of the PhD, this is my third year. Time flies, man. Yeah, time goes quick. That's soon we graduated. Yeah, but I guess the plan is a long time then to stay in academia, right?

Starting point is 00:03:10 And that's the decree, like, to pursue. Industry wasn't for you. That's how I feel right now. Of course, my change, the landscape and, like, the job market evolves so quickly nowadays that it's really hard to make long-term plans. but I feel much better doing what I'm doing now

Starting point is 00:03:29 than I did at any company that they work then so this is much more exciting. Fantastic, yeah, well I see your research career definitely has started strong because this is a very good paper and obviously winning a research award for the VLDBBBS paper straight away. It's a good start.

Starting point is 00:03:45 So yeah, things are starting wealthier so hopefully the long way that continue for you. So let's let's speak about that then. So yeah. I joked at the conference that, well, now it's all down here, right? I want to go by it down. There'll be a test of time award, surely, right?

Starting point is 00:03:58 In 10 years time as well, Matt. So, yeah, we've got that to look forward to at least, yeah. Hopefully. But yeah. Cool. Yeah, so let's talk about any blocks then. So first of all, actually, let's give the listen some context and that kind of the domain that we're talking about here a little bit.

Starting point is 00:04:13 So this is all about kind of, well, there's a lot of encoding innovations that happen or happen in the research world that very rarely kind of find the way into industry into being adopted in practical settings. So why is this? And yeah, just really set the context for us for what we're going to be talking about today and where any blocks fits into this whole landscape

Starting point is 00:04:37 of computing and, yeah, encoders and decoders and things. Yeah, so as I said, if you look at research at Sigbot, for example, there's a lot of encodings and a lot of new data formats. in practice, no one uses them. In practice, just everyone uses Parquet, basically. It's like, you know, everything else is like an outlier percentage-wise. Yeah, Parquet is popular, but it's very hard to extend with those new encodings. This problem is, let's say, semi-famous blogpost from DougDB,

Starting point is 00:05:19 that they cannot enable those better encodings for per kare by default when that DB writes data because there are still some readers out there that don't support it and you want to be able to support everything because if there is one thing that you hate as a user is dealing with compatibility problems anyone who had to manage any kind of dependencies

Starting point is 00:05:44 of packages knows that very well so it's hard to add new stuff and then if you create a new data format, it's kind of hard to break through because if you create a new data format and it's supported nowhere, basically, unless you have your own database and you have to support, you have to convince some big system like DACDB

Starting point is 00:06:07 or, I don't know, Spark would be like the main platform for data science to support your format. They're not going to do that because why would they, no one uses it. But of course, no one is going to use it if it's not supported anywhere. So you're in this kind of a loop. Classic catch-22, right? Classic catch-22, right?

Starting point is 00:06:27 Exactly. So, yeah, so Parquet is there, but a lot of people will tell you that it sucks and needs to be updated. People create new file formats, but they're not really used. And there is like a systemic issue here that we kind of, identify as an instance of this N times M problem which is I know it from compilers

Starting point is 00:06:57 right it's like this idea that you have n different dialects of a language and then end different architectures or platforms that you want to target and if you do this naively you have to do n times M work because for every language you need a separate

Starting point is 00:07:13 compiler for every platform and that sucks and is unsustainable and all of those problems here you have a similar problem where you have N systems and different databases and then you have M different data formats

Starting point is 00:07:29 that you would like to support because nowadays people just put all of their data into a data lake or into a Google Drive and they are just expected to be able to read it with whatever they want to use so again you have you have this total maintenance and development efforts of N times M

Starting point is 00:07:47 which is not sustainable So we look at how those problems are solved in compilers, for example, we, well, people invented VM. I'm not going to take credit to that. So we basically put a layer of abstraction in the green. We say that from those languages, we create an intermediate thing, and from this intermediate thing, we compile to end different targets, and then instead of n times and you have n plus m.

Starting point is 00:08:16 So we are wondering if we can do something similar here. And in particular, like, the paper more than about the product, the end product is like an exploration of what shape should this abstraction have and what are the desired properties of it so that it works and is nice. And our answer is any blocks. We believe that our paper substancy is that this is the way to go. because, yeah, of course, layers of abstraction introduce overheads and in the world of databases and like, you know, VLDB and Sigma,

Starting point is 00:08:58 we mostly care about performance a lot. So, yeah. That's nice, yeah, it's a good summary of kind of where, of the need for any blocks, right? And, yeah, I guess from that then, we should probably talk about, you can maybe give us, I mean, I kind of always like to ask,

Starting point is 00:09:16 like the sales pictures were like, okay, you've got me in an elevator here. And, or maybe I'm, I think I would say, what are they right when I see? The question is like, how would you explain it to your grandma maybe? But like, so yeah, imagine how have your grandma and like, give me like a brief like, this is what we do, which is what it is basically, which is what any blocks is. I mean, you give a good description there before, but like, like, yeah, concise. Like, how, how would you describe it? I don't think my grandma knows what parquet is.

Starting point is 00:09:40 So that might be a little hard. But the idea that we bring is that instead of the systems being in charge of how data is read, we should inverse this and it's the data that should tell you how to read itself. Or as we say, your data should decode itself. Instead of disseminating just... Very nice.

Starting point is 00:10:06 Instead of disseminating just your file and some exotic encoding, you also distribute the decoder and then as long as we have the proper abstraction in place it should be possible for any system to read this and any blocks basically achieves this. The sales pitch that I would give is like, look, we put some work and now in just like this figure in the paper in all of those various systems,

Starting point is 00:10:37 Umbra, DDB, Spark, Data Fusion, whatever you want, you can read all of those different formats, some of which you never heard of, and you don't need any support from either side, really. You don't have to ask the data system maintainers for permission, and the only thing that you need to understand is how the system works or how the data format works, and that's it. There are, of course, complications. There's a lot of details hidden in this, but this is the pitch.

Starting point is 00:11:06 That's the core idea. That's very nice. Yeah, it's the idea of self-decoding data, I think you phrase it in your paper, which is really cool. We're going to talk about some more in some depth. So let's get into the meat of things. And so obviously, the end, the end, the end result was any blocks, but there must have been there's a journey from going from kind of what was the,

Starting point is 00:11:24 kind of the state of the art and what people do and how people have solved this are like, what the end, time, then problem kind of looks like today in the world, to get into this end result. So you approach this in kind of by defining four properties that any sort of system like or thing like any blocks would would need to satisfy and so yeah tell us what these four properties are and then we can run through and the kind of the approaches and how they compare up against these four properties yes so there's a question about what do you want from this abstraction like what how do we evaluate right because you know i'm kind of learning how to how to

Starting point is 00:12:06 properly do research but the way you're supposed to do the science is like, okay, like how do we even test that we got what we wanted? So the properties we described is, first of all, portability. This is like the whole thing, right? Like we want to be able to use the same solution across all the different systems and the cost of integrating it shouldn't be too big. Because first of all, if we just allow you to do it only on some architectures or only with some kinds of systems, no one's going to use it.

Starting point is 00:12:40 If it requires you rewriting your whole database stack, no one's going to use it. So how hard is it to do that? The second thing is security. This is something that an astute listener when they hear we should distribute code alongside data immediately thinks. It's like, what do you want me to just execute random code on my machine? So the solution has to be somehow robust for bad decoders, not necessarily even malicious decodes, but just bad code.

Starting point is 00:13:14 Then performance, again, I think it's self-explanatory, you don't want to pay a lot for this abstraction. And finally, this is, I think, the only non-obvious one is extensibility, which is, like,

Starting point is 00:13:30 the other side of this end-time problem, right? We have portability here for the systems. And for formats, how hard is it to actually make the format conform to the abstraction because you could put so many restrictions on what is an appropriate decoder that some of the existing data formats just wouldn't be able to be written into any blocks, for example, or it would be extremely hard.

Starting point is 00:13:56 You would have to hire a PhD student to do it, which would also be bad. And then we kind of look at the landscape of what is out there, what are people looking into, and we come into the conclusion that none of the current directions are suitable. They all suffer in at least one of those dimensions and we need something different. Nice, yeah. Let's run through those then. So the first one is the native code approach. So tell us kind of what that actually is and then why it's bad relative to, I mean,

Starting point is 00:14:33 it doesn't perform poorly all of these measures, right, but on pretty much most of them. So yeah, tell us how it does. So this is kind of a solution. It's like the baseline that we have, which is how data formats evolve now, which is someone just, you know, someone at DougDB just goes and writes the code that reads, I know, orc and pushes it to the main branch,

Starting point is 00:14:56 and that's it. It just runs natively in the database. As other examples, we use, like, Jason. This is somehow like, through my lifetime. This is like a shift that has happened where people kind of embraced JSON in databases, right?

Starting point is 00:15:17 When I was starting out, it was like this format in the web and then people started that in support of it into different databases. Now you can read JSON files basically in anything. So like into Posgars, into MySQL. Yeah, so this is is of course not really an abstraction, but it's a sensible baseline because for performance, this is the best you can get through it, because we're assuming that you're taking someone

Starting point is 00:15:46 very smart that understands the system in and out, and they can write the most efficient code for that particular decoding, and does the best performance you can get. But of course, this runs straight into the end-time problem. It is not portable nor extensible, because a decoder that you're right for one host cannot be really reused in another. For some hosts, of course, this is not even like a possible thing. Because for DACDB, for example, it's at least possible that you can go and propose a change.

Starting point is 00:16:24 But if you want to use, I don't know, SQL server, you would have to, realistically, you have to be like a big corporation that can pay a lot of money to Microsoft to, to make that feature. For extensibility, it's the same because you, I'm looking at it from a user perspective. Like, how hard is it

Starting point is 00:16:48 for you as a system developer to implement this into a system? Or how hard is it for you as an encoding specialist to develop this as a decoder? And of course, if you wrote, I don't know, better blocks and you wanted to put it into Postgres without

Starting point is 00:17:04 having knowledge of the Postgres code, it's basically impossible. You have to learn everything. And then the last one is security, for which we actually got a bit of pushback from reviewers, because I marked it as weak. So we have the scale from very weak, weak, average, good, very good. And the pushback was, what do you mean weak?

Starting point is 00:17:27 Like, you have an expert developer put code into the database. It gets through pool review and tests. because of course we all have high standards, so there's never going to be any issue if there are any bugs, because we never write bugs, of course. And that's not true. I kind of expected that people might not like this argument,

Starting point is 00:17:52 but I put a citation. What I did is I did a little bit of research into this JSON integration that I talked about into like Postgres and various other operating systems, and I just looked at, When was this introduced and then tried to find issues related to Jason and Jason support? There are quite a few, but the one that I cited that was the most worrying was that in Postgres, five years later, after introducing support of the Jason type, they had a high severity SVE related to that code that was untouched for five years.

Starting point is 00:18:28 So my argument here is that you are very much increasing the surface area of possible security bugs by introducing maybe complex decoders. And because there's so many of them, we're talking about possibly thousands of lines of code. So that's not secure also because it executes at the same privilege level as the rest of the database. So if you're writing a database in C++ plus, something goes wrong in the decoder, you're risking very serious bugs. And finally, one another thing is that

Starting point is 00:19:06 some decoders require external libraries. So the example that I came across is that, for example, if you want to use the roaring bitmap in Rust, the default implementation is just a thin wrapper over a C library. that implements Roaring, which has some other static dependencies. So we're also increasing the security risk of your whole dependency chain, right?

Starting point is 00:19:35 Which, as we know, supply chain attacks are also very hotly contested security battles right now. Yeah, so Native suffers from all those shortcomings, and it's kind of a baseline for us. Like, how far can we push? How much better can we get across those three. dimensions without sacrificing too much of the performance in the process yeah nice essentially like that kind of the security thing about it I think you say in the paper like it becomes a statistical inevitability in the end right and by yes so that's a good way because it's end times am right you know just so much code that something is

Starting point is 00:20:15 going to go wrong you might win the lottery and it might not be your system but you only have to get unlucky once and then you have a bad time right so yeah it's better to be Fipen, sorry. Cool, yeah. You say that another type of approach is extensions. So where do extensions fit into this and how do they compare up against these

Starting point is 00:20:33 four principles or pillars that we want to achieve? So extensions are a powerful mechanism. It's one of the reasons where DougDB is so nice and so powerful is that you can just write your own code and load it as a module and extend the database with

Starting point is 00:20:51 something that the developers didn't think about. Now this alleviates some of the issues. Mainly extensibility, like for you as a format author or decode or encoding author, it's much easier to integrate because

Starting point is 00:21:07 you probably just need to wrap your code into the extension API of Postgres or DougDB or a skill server, I think, also has an extension mechanism with C-Sharp code. and that's it.

Starting point is 00:21:24 It works. It's also going to be pretty fast because presumably it is in the same language and also kind of in the same level of operation. As long as the extensibility API of the system is sensible and can take advantage of all of the performance features of the system. It's going to be fast. But of course, for portability,

Starting point is 00:21:50 it's still bad. You can realistically reuse an extension between two systems that are written in the same language, maybe, and have similar APIs. Like the changes probably won't be that big. You still have to maintain both, though. But the killer here is security, of course, because you are not going to convince someone, hopefully,

Starting point is 00:22:14 like hopefully they are smart enough to refuse. When you send them data and you then send them, you know, compiled extension, and a shared library and say just load this into Postgres at the same privilege level as the rest of your database and run this code on your machine. Like that's, of course, and secure. So that's why it's not going to happen. Even for normal users, but of course, some enterprise settings have extremely strict requirements

Starting point is 00:22:49 when it comes to code auditing and things like that. So that just is straight up impossible. I was going to do that. I guess the next approach then, isolated extensions, is tackling that security aspect a little bit nicer, but that then becomes against, comes with other tradeoffs, right? It trades off some of the other pillars.

Starting point is 00:23:07 So yeah, it gives a quick rundown of isolated extension, which is the last actual approach people actually do take right to this. Yeah. Yes. So you can have this idea of having an extension but instead of putting it into the same design domain like in Postgres or DugTV or you just load the module, we host it somewhere on a Docker container

Starting point is 00:23:29 and then whether it runs on local host or somewhere in the cloud doesn't really matter. And you can tellerize it, you expose sub-apy-i for like, here's data to decode, please give me back the tuples, and you run with that. This has very strong security guarantees because it depends on the isolation mechanism you choose, but of course you can run it on a completely separate machine, basically,

Starting point is 00:23:56 and then you're in full control of what can happen with that code, what's capabilities it has, so security is very strong here. It's also much easier to actually write, because you just need, in the examples that I looked at, because this is provided by Snowflake and the AWS the Redshift database. You can host

Starting point is 00:24:27 like a Lambda function and have it call into that. It's easier to write because you just get input and you're supposed to produce apples. Anyone can do that if you already written your decoder. For portability it's kind of good like as long

Starting point is 00:24:43 as your system supports this approach, this framework of decoding, it's quite easy to move because it's probably just adapting whatever, usually this happens via HTTP, so whatever rest requests are sent, and it's fine. The minus of this is performance. You are losing, the one thing that you're losing is that the code is isolated from the host,

Starting point is 00:25:12 So there is an overhead to whatever system you're going to use to call it, probably some pressed or in the best case some remote procedure call. But the real killer of performance is data transfer because you necessarily need to copy all the data into the box, into the isolation layer, and then take the decoded out. and to evaluate how feasible this is I did a very quick and dirty experiment of like, okay, how I host just a Docker container on the same local host, so there's no network overhead. It just runs some very simple code, try to send data there and get it back,

Starting point is 00:25:58 and the data transfer is immediately the bottleneck. Lightweight decoders are very, very fast, so they can decode multiple gigabytes per second and the network is not going to just data copy is not going to be fast enough so to alleviate that you would start to you would have to start breaking down barriers somehow like share the memory somehow

Starting point is 00:26:22 and then you're of course losing the isolation layers so it's a good approach if you just not for big data right if you just care about getting some data and being secure and not really fast, then sure, but this is not why we're here. Cool, yeah.

Starting point is 00:26:41 There's no free lunch, right? There's always tradeoffs, right, as well as I see. So, yeah, I guess with that then, so I know there is another possible approach around static verification, but I'd like to maybe to discuss that a little bit later on. I'd like us to get into any blocks now. So given what we just spoke about on these four pillars

Starting point is 00:27:00 and we want to kind of design something that ticks pretty much all of the boxes, is walk us through how any blocks does that, walk us through the design and how it works in practice. So I think an important part of the story behind this paper is that the idea of sending code alongside the data

Starting point is 00:27:21 is not really a clever idea. Like it doesn't take like an expert to come up with that. You can come up with that yourself, anyone can come up with that. The problem is, that when you start to implement that in practice, suddenly it turns out it's not that easy, like just saying, oh, let's just distribute code and data.

Starting point is 00:27:43 The obvious issue being, like, you're immediately opening up to remote code execution, because someone just sends you code that does something bad to your database. There's like a lot of small little design choices that you have to make along the way, and it's not obvious what the correct one is and how it's going to actually implement. impact the four properties that we talked about. So naively, to execute some code that they cause the data, you would do something similar to remote, to the isolated extensions, right, you would put up some box, some sandbox,

Starting point is 00:28:20 copy the data in, run the decoder, copy the data out. And as long as you keep the sandbox, like, secure, right? you don't give too many capabilities to that environment, that's fine. But those copies are very expensive, as we said. Most lightweight decoders can decode data faster than memory bandwidth. So avoiding both of those becomes a hard issue. Yes, so in case of web assembly, this is what we chose.

Starting point is 00:28:59 We looked at other ways of expressing the code, right? You wanted to be portable, so it has to be some language agnostic bytecode or something similar to that. We looked at WebAssembly. For WebAssembly, the good thing is that the sandbox, first of all of all of the exists, like there are WebAssembly runtimes that already exist and do all the things that you would want to do. They isolate,

Starting point is 00:29:32 you don't get access to the network, to the file system, to system calls from the isolation layer. So you get kind of security by default by the WebAssembly specification. And then a WebAssembly environment has linear memory, that's like the big,

Starting point is 00:29:51 if you don't know what WebAssembly is, People came up with it to make web browsers faster because they wanted to do more complicated stuff than normal JavaScript would reasonably support with the performance of a scripted-interpreted language. So they introduced this web assembly that can be jit compiled to native code and then executed. But of course, we're talking about the browser, so it has to be sandboxed and isolated from the rest of the system. And then of course, as it usually is with engineers, you give them a toy that's supposed to run in the browser. They take it out of the browser and try to implement all kinds of crazy stuff outside, which is kind of what we did as well. And I'm not saying that it's bad.

Starting point is 00:30:40 So to achieve our goals here, we can use the way WebAssembly is usually implemented, which is that, you create a memory map that is size of 4 kib bytes, which is as much as you can allocate an address at 32 bits in WebAssembly. And this is a virtual memory map, right? So there's no actual physical memory that's backing it. And you use the usual operating system, paging mechanism to say that all of those pages are protected. They cannot be touched.

Starting point is 00:31:16 touching them is a secfault. And then you have a very easy bumblegator method where when WebAssembly asks you for memory, you just say, okay, so you want five pages. I just tell those five pages that now they are read-write. The rest stays the same. And you can avoid all bounce checks because now any access out of bounds immediately runs into your memory protection

Starting point is 00:31:42 and you get a cycle. So this is how most web assembly runtimes can't do this, including the one that we chose, which was time. And the idea that we have is that instead of doing hard copies, we do a trick with memory maps. We call this data hooks, where we say, okay, the database system, whatever the host is, the quantity code, so data gives me a file descriptor.

Starting point is 00:32:14 I don't care what exactly this is. This can be a file on disk. This can be a file in memory, like created with MMFD. And I memory map it to a fixed spot in this linear memory. This is a very cheap operation. No actual copy happens. This is just telling the operating system to do like a mapping between pages. We said that to be read-only because you don't want the decoder to be able to modify data

Starting point is 00:32:43 that is managed by your database. And that's it. We just tell the decoder where it is, and it can access it transparently, as if it was copied there. It has no idea that actually underneath, this is just like a page. When it access it,

Starting point is 00:32:59 a page fault happens on all the usual operating system. Shenanagan has happened to make sure that the data is there. And this completely saves you the copy of, on input. If you already have the data somewhere in memory, it will just be there.

Starting point is 00:33:21 This is just a page mapping. If it's on disk, you have to read it anyway, but it will be only read once. So this is the copy on input, and then the copy on output, there are two things here. One thing is that when you say output topples,

Starting point is 00:33:39 what does that mean? as input you get some well-defined format that you want to decode, but what is the output? Every system has a different internal representation of their tuples. So we chose Apache Arrow as the output format. This is quite natural if you've heard about Arrow. It is also the default format that is already used by some existing database systems like DataFusion. is basically a specification of how data is represented in memory. So how is a 32-bit integer represented, how is a string represented,

Starting point is 00:34:23 and laid out for columnar data. So we choose that, and then we say that a key feature of arrow is that it's zero copy. This is not a strict term. It's sometimes confusing what it actually means. It comes the Mac. Basically, it means that if you have an array of integers, for example, an arrow, and you want to use it from inside your system, you can just use it without copying the entire thing

Starting point is 00:34:56 because it's already laid out like in 32 or I-32 in whatever programming language you're using is laid out as long as you're on the same endianness as the thing that produced the day. data. Yes, so our decoders produce Apache Arrow and to avoid copying on output, we say that we want pipeline execution. So this is the way both database systems that care about performance execute stuff, which is they will ask for some batch of tuples. This can be one topple, but usually it's a little bit more. They take them and then they push them through the pipeline up to a materialization point where something happens. It's put into a hash table.

Starting point is 00:35:43 It's copied somewhere anyway. So we say that as long as you use the decoder in this manner, you ask for a batch, you push it to some point, and then you either, you use this data somehow, or you materialize it explicitly. There will be no copy. The decoder produces it into its linear memory slice, and the host can just access it.

Starting point is 00:36:07 So it's fine. The trade of there is that you cannot rely on the decoder, leaving that data alone when it's called to decode another batch. So you have to decode one badge, do something with the data, and you're responsible for... It's basically transient, right? You have to be aware that if you call the decoder again on the same memory map, then the data can go away, be corrupted, whatever can happen to it. so you have to be aware of the lifetime but again we implemented this into many different data systems and in none of them was this a problem

Starting point is 00:36:46 because all of them follow the same principle of pipeline execution and this saves you the copy on output this is especially important for string data where a lot of the decoders are just going to be pointing to data that already exists in the buffer in the input

Starting point is 00:37:07 and you just put a pointer there and the host just translates the pointer from Wazim style offset to its own and that's it you don't have to touch the string data you don't have to copy it at all so it saves quite a lot of

Starting point is 00:37:22 you say a lot of money I guess time and space is money as well yeah it's very true yeah yeah cool that sounds awesome so like let's talk about evaluation then and how close any blocks actually gets to that sort of theoretical limit

Starting point is 00:37:40 and how well it does on the other stuff as well so yeah I guess is this maybe quite a little bit of a difficult thing to evaluate like a non-standard thing to evaluate so how did you actually approach the evaluation and yeah what were the key findings so we wanted to evaluate it from like we wanted to evaluate all of the properties of course the security property kind of evaluates by itself

Starting point is 00:38:06 right we're not going to claim that web assembly is the perfect solution because of course you are just relying the security you inherit from the sandbox yeah if the sandbox is broken your stuff is also broken but try to make some assumptions right yeah yeah so yeah you have to make some assumptions uh yes uh but as long as the sandbox is correct and implements the web as summary specification, this is also secure, which your conducts, and it's like in all the images you can think of

Starting point is 00:38:43 because security can also mean can someone read the data that they are supplant, right? Because if you have an extension in your database, there's no guarantee that somewhere in there is not someone that is just sending your data over the wire

Starting point is 00:38:59 to themselves. Or doing something similar with that. exposing your PII that way. Here, of course, the sandbox doesn't have the network. This is just guaranteed. It's impossible to express a network call in the WebAssembly bytecode, so it's not possible to do.

Starting point is 00:39:19 Your memory, your data is secure from the operating system because it's protected by the memory protection ways of the operating system. So really the only thing that can happen from the sandbox is that there's like a very critical vulnerability and it accesses something from outside. of the memory slice. This is like the main vulnerability. Okay. And for the other things, so portability is very

Starting point is 00:39:43 important. We implement any blocks into a variety of different systems trying to cover a lot of differences that we can think of. So Umbra and DGDB were the first ones. Umbra complicated, compiling database

Starting point is 00:39:59 developed a tomb. DGB, of course, well known, has a different paradigm they use vectorized execution, both in C++. Then we implemented it into Spark, which is completely different in Scala. It has like a hybrid execution model because it compiles some expressions,

Starting point is 00:40:22 but also uses like vectorized stuff and is highly distributed. And then finally into DataFusion, which is also vectorized but uses arrow natively and also is using Rust, which any blocs is also written in us. So we wanted to see how easy it is to do it natively. And there's like this table where I put how much time it took. It's cold person days, but it's like to lift the curtain. That's how long it took me to.

Starting point is 00:40:52 It's mat days, yeah. And how many lines of code. So the hardest was Umbra, which is not, well, you would expect it to Sombra, not surprising because again it's complicated compiled database and I had no it I never saw any of those pod bases in my life before I started doing it

Starting point is 00:41:12 and it just took me 10 days so someone who knew Umbra could probably put it out into the hardest was Spark but that's squarely because Spark is just so bad to work with like the code doesn't have any documentation

Starting point is 00:41:28 and I don't know people stack overflow also doesn't know how to implement custom data sources to do stuff like this. It was just very hard. I had to understand Scala code. I did maybe, you know, a tiny bit of Scala programming in college,

Starting point is 00:41:48 but that's it. And DataFusion was just like, yeah, so we proved that portability works. And then for the other side, for extensibility, to show that we solved the N times and problem, we took various different encodings so some of them are just raw encodings like run length

Starting point is 00:42:07 and FSST directly just you know the kernel of the decoder that we implemented we also took vortex which was like the ultimate test really which is which is a full file format written in rust very you very good very fast it's really like an alternative to care. And we wanted to see if we can just, you know, compile that into WebAssembly, put it into a decoder and get away with it.

Starting point is 00:42:42 We can. And then another one is Root, which is a very nice compiling story of people at CERN doing high-energy physics research. They have their own this custom data format for holding that particle data. It's an insane format that was developed over like a few. I don't remember exactly when it was it was like the 80s or the 90s. They developed their own custom proper custom data format in C++ Plus. It's crazy. Like they have a whole file system abstraction and they basically reinvented Parquet from first principles.

Starting point is 00:43:22 Okay. Cool. They have a REPL in CPS Plus that you can use to talk to that thing. it's crazy and I found a paper where they were looking recently

Starting point is 00:43:33 into like can we can we just use like normal tools like can we just put our data into BigQuery and ask

Starting point is 00:43:40 in SQL something about physics the answer was that this was not that easy so we thought it's a very complaining story

Starting point is 00:43:48 that here I have no idea how this format works I would never know this was actually done my maxi he spent like two days

Starting point is 00:43:57 just rewriting a subset of root into rust, put this into a decoder, and now we can read this high-energy physics data. And, you know, they were, if you run this through their tools that they use day-to-day in C++, first of all, you have to write some C++ code yourself, which poor data scientist. And it doesn't have any parallelism by itself. so you have to do parallelism by yourself in C++. Whereas here we just put it into Umbra, we write a very simple SQL query,

Starting point is 00:44:34 and you get, I think it was like 30 times better performance or something like that, like 6 gigabytes compared to 300 megabytes natively from the route. So there, you can use normal tools now. And I think that to me that's like the most compelling story. of any blogs that I have no idea how this file format works. I barely know how Umbra works, like the basics, right? Here I can use this data with this database system. It magically works.

Starting point is 00:45:11 Yeah. And then the rest of the evaluations, of course, just performance. So for performance, we wanted to focus on analytical workloads first. So how does this work in an actual use case and an actual system when you run full queries? We look at DougDB and Spark as two like the most distinct systems that we could. And we choose TPCH, the standard analytical benchmark

Starting point is 00:45:44 and then also ClickBrench because ClickBinch has a lot of string data and we wanted to test like FSST and codings that Vortex uses. And TPCH mostly has numerical stuff. So we do that and we just check the native system. Native DGDP, how does it do that? Versus DugDB but using per K. So we encode the data into per K and then read it using the DGP per K reader

Starting point is 00:46:14 versus our NEPBalker K reader versus our N.E blocks integration with Vortex. So the story, like the takeaway from this is, despite the fact that we are not native, right? You have this sandboxing. By using a newer new cutting edge

Starting point is 00:46:32 encoding for your data and the decoder in WebAssembly, you can already outperform native per K, both in time and in compression. So vortex compresses better and it's much faster to execute even through any blocks. So you can like reap

Starting point is 00:46:48 the rewards of the encoding research today. without having to go through, you know, the brigger role of implementing it natively into like DAP or writing your own extension or anything like that. And that's great. We also show on the charts an important thing that, for example, in the PCH, the Parquet workload spends most of its time decoding data. And then the rest is, you know, useful work of actually running the analytical operators. whereas for Vortex it's it's much better

Starting point is 00:47:25 it's like half of that time so it's a lot to be gained in those encodings for sure I mean the whole thing just sounds like a complete win on all fronts essentially so that kind of leads me into my the sort of next topic I want to talk on that is about impact the first thing is in the short term

Starting point is 00:47:45 have you how much impact have you seen any blocks have and has it got any traction beyond the people you've communicated without other people working on high energy particle physics or anything like that. How is that going? How are we going on that sort of journey

Starting point is 00:48:00 of actually making this thing, kind of the industry standard slash go-to thing for this problem? So, you know, it's quite new. The conference was in September. Talking there, a lot of people were interested. People from Bortex specifically are very interested because they took the story of the paper to mean,

Starting point is 00:48:22 like, what's the best way to fix parquet? Use any blocks to put vortex into porcay, and then everything is fixed. Right, okay. Which is, of course, a very nice story for them. Yeah. People from other, from, like, lands were also interested in how to use this. I think the, like, it's new,

Starting point is 00:48:43 but the potential for the impact is very, very big on, like three different fronts, I would say. One for future-proofing file formats. Very shortly after this was published, a very similar work was published in Sigma, which is called F3, which is something like future file formats, something like that. And they also use WebAssembly decoders in their thing.

Starting point is 00:49:10 The main distinction is that theirs is like a full file format. It's not ours is a framework. you can put it into anything, you can put it into your file format, you can use it just raw with some data, you can put it into your data management system. This is like a full file format, with a similar idea.

Starting point is 00:49:31 So just definitely, it's definitely like a push from our community to get those future profile formats. I think the way to have the biggest impact would be to actually get this into per K. this would require a new version of per K which we already know how adoption goes but this would mean that

Starting point is 00:49:54 per K actually becomes extensive in the future proof because then you can expose any encoding as any blocks encoding we have an experimental version of that where you basically just I'm not going to explain how per Kare works but if you know how per K works you just put a page at the start that is the decoder and then the rest of the pages is like opaque data

Starting point is 00:50:18 and you can use any blocks inside the perkyer reader to read through that and voila you have your vortex in parquet. Great. Yeah. So that's, you know, if you get this into parquet,

Starting point is 00:50:32 you get this into every data management system ever at some point. So this would be a very high impact. Another thing that I believe this can have impact on is like data science. So there's a lot of consumers out there that we probably don't know that well by just going to conferences, which are dominated. You know, enterprise uses. But like this high energy physics stuff, I know that people that do bioinformatics have a load of their own different formats.

Starting point is 00:51:06 And it would be really nice. Like if, for example, DougDB, we could release an extension for DACDB that incorporates any block. and then data scientists can just do their thing. Because from their perspective, the ideal workflow is that they launch a Jupiter notebook, they import DGDB, and they download whatever data is given on like a file share that they have and read it through DGB.

Starting point is 00:51:34 And they don't care what the format is. They want the data, they want to plot some graph, look at some stuff and do science. So that, So that would be cool if we allow them to do that. Currently, of course, it's hard. You can't load the route into that TV, for example.

Starting point is 00:51:53 I know that the genomics people have their own issues with databases. So yeah, then the last thing for looking forward for the future is that in general, my area of research is this umbrella of future-proof data. systems on modern hardware. And the question is like, how do you, what does it mean to be future proof? And then how do you achieve that? So this is tackling this from like the storage layer. Ideally, you could evolve the storage formats and encodings

Starting point is 00:52:29 completely independently of your data systems. If you make any blocks the standard that everyone introducing a new file for modern or new encoding targets, then any system that already exists that has any blocks can read everything and moreover if you develop a new system you don't have to spend time developing implementing readers you just implement any blocks and you read anything and you're happy with that so this is something that I would like to

Starting point is 00:52:57 I would like to see from systems like LingotDB for example which we're also developing at Doom just looking to be future proof and open and flexible why spend our or you know like master students time implementing per care into something if we can just do any blocks and forget about it? Yeah. So yeah. The question of how much impact we will have is basically how much effort is put into this. So the project is open source on GitHub.

Starting point is 00:53:30 And it would be it's unfortunately I have to say that it's research code. it's not the highest quality. I'm sure it's just fine, Matt. Yeah. Don't worry about that. You should see some of the code that gets written in enterprise software. So yeah.

Starting point is 00:53:48 I worked at Microsoft. I know. My standards are very high. So for my standards, it's like, fine. Okay, but it could be better. It could be packaged better. Needs to be packaged better into like a duct TV extensions. This is something that I'll be looking into,

Starting point is 00:54:05 into doing but of course if anyone is interested it would also be very nice to see people implementing this into different systems and creating encoders into

Starting point is 00:54:21 web assembly. It's very easy because everything basically everything compares to web assembly nowadays as I said for Vortex I just took the Rust code base and I just wrote you know, Target, Wazim, something, something.

Starting point is 00:54:39 And then I had to remove like a sub-directory that was dealing with input output because it was incompatible. But I already complained about this to Vortex people. And I think now the codebiz just compiles with that assembly. That's what they told me. So that's good. You get that file and that's it. You have the decoder.

Starting point is 00:54:56 You don't have to do anything because our, like this data hook mechanism means that you're just accessing memory transparently. You didn't have to worry about anything. You don't know if your decoder is running in any blocks or natively. So, yeah, like, I have big hopes for this project, and I hope we'll manage to get a big impact because the potential and, like, the community buy-in is there. So just you need to heartens that.

Starting point is 00:55:25 Yeah, well, fingers crossed it. It all turns out. Well, in the interest of time, I'm going to skip over the surprises question. I know we touched on the future one as well then, just because I actually need to be. need to leave the house in about 20 minutes. So I need to wrap things up, unfortunately. Sorry to...

Starting point is 00:55:39 I talk so much. No, no, it's all great. It's all great stuff. It's good. It's going to come out fantastic. It's really clear, insightful and well-explained. So it's great stuff. But yeah, with that, let's move on to the last word then.

Starting point is 00:55:51 So what do you want, firstly, practitioners, people out there in industry, to take away from this impetus podcast today and your new work on any blocks. And also researchers as well, what would you want them to hopefully get from, from this podcast. Oh, it's a hard question. Well, it might sound a little rich from me, but I would say that people should stop inventing file formats, maybe.

Starting point is 00:56:27 We have a lot, right? We had fast lanes come out. We had F3. There was one more that I forgot. Lans, right? and our work is explicitly not a file format. If there is one thing that I want you to take out is that any blocks is not a file format,

Starting point is 00:56:49 people already started saying that, and I hate it, because very explicitly, this is not a file format. This is like a framework, a method of exposing encodings. We don't care about the physical layout of your data. You can put this into anything.

Starting point is 00:57:05 We explicitly require very little metadata, the only thing that you need to tell us is what the output of the the topple shape is, because obviously and how many rows there are, that's all we need to know. You can put this into parquet, you can embed your own statistics.

Starting point is 00:57:23 That's the other paper from Sigma F3 that they talked about that also uses WebAssembly. You could just replace their mechanism for WebAssembly execution with any blocks, and it's fine. We're not a file format. We're much more flexible. which I think is a big plus

Starting point is 00:57:40 because it means that you can actually like push this through you don't have to fight an uphill battle of everyone now use my file format for stuff you just have to be like hey implement this encoding and you're fine yes so that's like the one thing good message yeah you've been told folks remember

Starting point is 00:58:01 any box is not a file format it's not a file format so it's not a phone Yeah, and I think for future research, I think we should, first of all, we should all be a little bit more focused on making stuff open and accessible to other people.

Starting point is 00:58:20 Because, you know, this also kind of speaks into impact. On VLDP itself, there was a lot of talk about, like, yeah, we do all of this research, but is this actually being used by people? And this is one of these things

Starting point is 00:58:34 that they'll be coming up about, those encodings, all those file formats, and then most often your code just remains somewhere in GitHub. It's referenced from a paper, and that's it. No one ever uses it. So we should be like thinking about how we can actually bring those benefits

Starting point is 00:58:53 to people. For example, you have this large community of data scientists that as the root example and the paper that they cite show are eager to like use our stuff. For me, it's more idealistic.

Starting point is 00:59:09 I would just like to help them and allow them to use our stuff. If you want a more corporate pitch, then like those people wants to use BigQuery and give you your money. But we have to allow them to give you money with

Starting point is 00:59:23 making this stuff actually open and accessible. Yes. And for future research, for any blogs, I think there are a few limitations of this that are like obvious and weren't really

Starting point is 00:59:39 addressed mostly because of time constraints the major one is filter pushdown it would be nice if the decoder knew for example some was able to take some simple filter expressions like you know eliminate nulls or eliminate this

Starting point is 00:59:55 column when it's below three or something it just filter push them would give a lot we point out in analysis of experiments like specific queries where we feel that all of the performance that we lose is due to filter push down. So there's like low-hanging fruit. The large limitation and like future direction for research that I personally am interested in is WebAssembly is great, but as I said, we're kind of using a tool that's different,

Starting point is 01:00:32 that was made for a different purpose and we kind of like ham-fist. it here. And it has one big limitation, which is that it can only run in a CPU. You have to call it as a function. It runs in the sandbox and it returns. And that's all you get.

Starting point is 01:00:49 A lot of people are now looking into both compression and decompression with accelerators or with GPUs. And any blocks, in particular web assembly, I feel like it's not capable. You can't run WebAssembly on the GPU.

Starting point is 01:01:07 I mean, you can't run it, but you can't have the sandbox. The sandbox is very CPU-centric and requires CPU facilities. So this is like, how far away can you push future-proofness? I like to think that one day someone is going to come up with a chip that we have no idea how it looks like. It's some future processing units. You don't know what it is. You don't know its architecture.

Starting point is 01:01:32 Can we actually design a system that would run, on that, even though we don't know how it looks like too much currently. How far can you push this abstraction? We have some ideas. They are mostly, you know, ditch web assembly and maybe talk about decoders on the more abstract level.

Starting point is 01:01:49 Maybe we need a specific dialect, like a DSL or something that then would be compiled. I feel like there's a lot of good research that could be done there, especially now that there's a push for like, as I said, GPU and accelerators,

Starting point is 01:02:04 heterogeneous hardware and stuff like that. Yeah, so I think that's it. Cool. That's a great message to end on there, Matt. Thank you very much for taking the time to talk with us today. It's been very, very insightful. Sightful chat. I'm sure the listener was there. We'll really have enjoyed it. Yeah, best of luck with the rest of your PhD

Starting point is 01:02:25 and the rest of your academic career as well. We'll be keeping an eye on you for sure. I'm sure there's plenty of good research to come. Yeah. I'll see you on the next best paper all. Exactly. There we go. There we go.

Starting point is 01:02:39 I expect one every time now, Matt. That's it. Yeah. Cool. We'll end things.

Disseminate: The Computer Science Research Podcast - Mateusz Gienieczko | AnyBlox: A Framework for Self-Decoding Datasets | #69

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.