Disseminate: The Computer Science Research Podcast - Mateusz Gienieczko | AnyBlox: A Framework for Self-Decoding Datasets | #69
Episode Date: March 5, 2026In this episode of Disseminate: The Computer Science Research Podcast, host Dr. Jack Waudby is joined by Mateusz Gienieczko, PhD researcher at TU Munich and co-author of the VLDB Best Paper Award winn...ing paper AnyBlox.They dive deep into a fundamental problem in modern data systems: why cutting-edge data encodings and file formats rarely make it from research into real-world systems — and how AnyBlox proposes a radical solution.Mateusz explains the core idea of self-decoding data, where datasets ship with their own portable, sandboxed decoders, allowing any database system to read any encoding safely and efficiently. Built on WebAssembly, AnyBlox bridges the long-standing gap between database research and practice without sacrificing performance, portability, or security.This episode is essential listening for database researchers, data engineers, system builders, and industry practitioners interested in the future of data formats, analytics performance, and making research matter in practiceLinks:Paper: https://www.vldb.org/pvldb/vol18/p4017-gienieczko.pdfGitHub: https://github.com/AnyBloxMat's Homepage: https://v0ldek.com/ Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Disseminate the Computer Science Research Podcast.
Hey everyone, welcome to Dismanate the Computer Science Research Podcast.
I'm your host, Jack Wardby.
This is the first episode of 26, so a very belated, happy new year to all of you computer science research enthusiasts out there.
In today's show, we are going to be talking about a paper that was published at VLDB last year
that actually won the best research paper award, and that is any blocks, a framework
for self-decoding datasets.
And I'm really pleased to say I've got one of the authors of that paper with me today,
Matt Jen Yetchko.
So welcome, Matt.
Hello.
Cool.
So we always start off as customer on the podcast with you telling the listeners a little bit
about your journey and how you kind of arrived at where you are today.
And I did do a little bit of research, and I know that you wrote your first line of code in 2013
and to quote from your homepage, this made a lot of people very angry
and it's been more widely regarded as a bad move.
So, yeah, maybe take us through your story, Matt.
Yeah, so, yeah, it started coding in high school that was around that time.
I did my bachelor's and master's in Warsaw in computer science.
My master's thesis was Arsendaph, which is the Blazing Fast Jason Path engine.
It was published in Aspos 23.
The title of the paper is supporting descendants in SIMD accelerated JSONPath.
It was a fun project with an intersection of like some theory of automata,
which always fascinated me, like computation theory,
and, you know, low-level tinkering.
Because SIMD, you have to use low-level code,
which I didn't do much before.
I had industry experience as well, mostly doing dot-net, like,
back in development, cloud stuff,
talking to databases,
but not writing databases.
I went to work at Microsoft
for a year
in Dublin identity.
It was
not fun.
And I started looking for a PhD
because I thought that
I should try with an academic
career before
the industry sucks out the rest of
my soul. And I had
I had the
opportunity to work at
at UM
under, well, it's complicated actually.
Like I,
our labs are pretty entente
wide between Professor Neumann, Professor Kichever
and Professor Lise, but
now I'm supervised by Professor Yanekeva
at the database share and
data processing, so that's what
I do. Yeah,
90 blocks was my first paper.
now I'm
Jesus, I was two years already
of the PhD, this is my third year.
Time flies, man.
Yeah, time goes quick.
That's soon we graduated.
Yeah, but I guess the plan is a long time then
to stay in academia, right?
And that's the decree, like, to pursue.
Industry wasn't for you.
That's how I feel right now.
Of course, my change,
the landscape and, like,
the job market evolves so quickly nowadays
that it's really hard to make long-term plans.
but I feel much better doing what I'm doing now
than I did at any company that they work then
so this is much more exciting.
Fantastic, yeah, well I see your research career
definitely has started strong
because this is a very good paper
and obviously winning a research award
for the VLDBBBS paper straight away.
It's a good start.
So yeah, things are starting wealthier
so hopefully the long way that continue for you.
So let's let's speak about that then.
So yeah.
I joked at the conference that, well,
now it's all down here, right?
I want to go by it down.
There'll be a test of time award, surely, right?
In 10 years time as well, Matt.
So, yeah, we've got that to look forward to at least, yeah.
Hopefully.
But yeah.
Cool.
Yeah, so let's talk about any blocks then.
So first of all, actually, let's give the listen some context
and that kind of the domain that we're talking about here a little bit.
So this is all about kind of, well,
there's a lot of encoding innovations that happen or happen in the research world
that very rarely kind of find the way into industry
into being adopted in practical settings.
So why is this?
And yeah, just really set the context for us
for what we're going to be talking about today
and where any blocks fits into this whole landscape
of computing and, yeah, encoders and decoders and things.
Yeah, so as I said, if you look at research at Sigbot, for example,
there's a lot of encodings and a lot of new data formats.
in practice, no one uses them.
In practice, just everyone uses Parquet, basically.
It's like, you know, everything else is like an outlier percentage-wise.
Yeah, Parquet is popular, but it's very hard to extend with those new encodings.
This problem is, let's say, semi-famous blogpost from DougDB,
that they cannot enable
those better encodings for per kare
by default when that DB writes data
because there are still some readers out there that don't support it
and you want to be able to support everything
because if there is one thing that you hate as a user
is dealing with compatibility problems
anyone who had to manage any kind of dependencies
of packages knows that very well
so it's hard to add new stuff
and then if you create a new data format,
it's kind of hard to break through
because if you create a new data format
and it's supported nowhere, basically,
unless you have your own database and you have to support,
you have to convince some big system like DACDB
or, I don't know, Spark would be like the main platform
for data science to support your format.
They're not going to do that because why would they,
no one uses it.
But of course, no one is going to use it if it's not supported anywhere.
So you're in this kind of a loop.
Classic catch-22, right?
Classic catch-22, right?
Exactly.
So, yeah, so Parquet is there, but a lot of people will tell you that it sucks and needs to be updated.
People create new file formats, but they're not really used.
And there is like a systemic issue here that we kind of,
identify as an instance
of this N times M problem
which is
I know it from compilers
right it's like this idea that you have
n different dialects of a language
and then end different
architectures or platforms
that you want to target and
if you do this naively you have to do
n times M work because for
every language you need a separate
compiler for every platform
and that sucks and is
unsustainable
and all of those problems
here you have a similar problem
where you have N systems
and different databases
and then you have M different data formats
that you would like to support
because nowadays people just put all of their data into a data lake
or into a Google Drive
and they are just expected to be able to read it
with whatever they want to use
so again you have
you have this total maintenance
and development efforts of N times M
which is not sustainable
So we look at how those problems are solved in compilers, for example,
we, well, people invented VM.
I'm not going to take credit to that.
So we basically put a layer of abstraction in the green.
We say that from those languages, we create an intermediate thing,
and from this intermediate thing, we compile to end different targets,
and then instead of n times and you have n plus m.
So we are wondering if we can do something similar here.
And in particular, like, the paper more than about the product, the end product is like an exploration of what shape should this abstraction have and what are the desired properties of it so that it works and is nice.
And our answer is any blocks.
We believe that our paper substancy is that this is the way to go.
because, yeah, of course, layers of abstraction
introduce overheads
and in the world of databases
and like, you know, VLDB and Sigma,
we mostly care about performance a lot.
So, yeah.
That's nice, yeah, it's a good summary of kind of where,
of the need for any blocks, right?
And, yeah, I guess from that then,
we should probably talk about,
you can maybe give us, I mean,
I kind of always like to ask,
like the sales pictures were like, okay, you've got me in an elevator here.
And, or maybe I'm, I think I would say, what are they right when I see?
The question is like, how would you explain it to your grandma maybe?
But like, so yeah, imagine how have your grandma and like, give me like a brief like,
this is what we do, which is what it is basically, which is what any blocks is.
I mean, you give a good description there before, but like, like, yeah, concise.
Like, how, how would you describe it?
I don't think my grandma knows what parquet is.
So that might be a little hard.
But the idea that we bring is that
instead of the systems being in charge of how data is read,
we should inverse this
and it's the data that should tell you how to read itself.
Or as we say, your data should decode itself.
Instead of disseminating just...
Very nice.
Instead of disseminating just your file and some exotic encoding,
you also distribute the decoder
and then as long as we have the proper abstraction in place
it should be possible for any system to read this
and any blocks basically achieves this.
The sales pitch that I would give is like, look,
we put some work and now in just like this figure in the paper
in all of those various systems,
Umbra, DDB, Spark, Data Fusion, whatever you want,
you can read all of those different formats, some of which you never heard of,
and you don't need any support from either side, really.
You don't have to ask the data system maintainers for permission,
and the only thing that you need to understand is how the system works
or how the data format works, and that's it.
There are, of course, complications.
There's a lot of details hidden in this, but this is the pitch.
That's the core idea.
That's very nice.
Yeah, it's the idea of self-decoding data,
I think you phrase it in your paper, which is really cool.
We're going to talk about some more in some depth.
So let's get into the meat of things.
And so obviously, the end, the end, the end result was any blocks,
but there must have been there's a journey from going from kind of what was the,
kind of the state of the art and what people do and how people have solved this
are like, what the end, time, then problem kind of looks like today in the world,
to get into this end result.
So you approach this in kind of by defining four properties that any sort of system
like or thing like any blocks would would need to satisfy and so yeah tell us what these four
properties are and then we can run through and the kind of the approaches and how they compare
up against these four properties yes so there's a question about what do you want from this
abstraction like what how do we evaluate right because you know i'm kind of learning how to how to
properly do research but the way you're supposed to do
the science is like, okay, like how do we even test that we got what we wanted?
So the properties we described is, first of all, portability.
This is like the whole thing, right?
Like we want to be able to use the same solution across all the different systems
and the cost of integrating it shouldn't be too big.
Because first of all, if we just allow you to do it only on some architectures
or only with some kinds of systems, no one's going to use it.
If it requires you rewriting your whole database stack, no one's going to use it.
So how hard is it to do that?
The second thing is security.
This is something that an astute listener when they hear we should distribute code alongside data immediately thinks.
It's like, what do you want me to just execute random code on my machine?
So the solution has to be somehow robust for bad decoders,
not necessarily even malicious
decodes, but just bad code.
Then performance,
again, I think it's self-explanatory,
you don't want to pay a lot for this
abstraction.
And finally,
this is, I think, the only
non-obvious one is extensibility,
which is, like,
the other side of this
end-time problem, right?
We have portability here for the systems.
And for formats, how hard is it to actually
make the format conform to the
abstraction because you could put so many restrictions on what is an appropriate decoder
that some of the existing data formats just wouldn't be able to be written into any blocks,
for example, or it would be extremely hard.
You would have to hire a PhD student to do it, which would also be bad.
And then we kind of look at the landscape of what is out there, what are people looking into,
and we come into the conclusion that none of the current directions are suitable.
They all suffer in at least one of those dimensions and we need something different.
Nice, yeah.
Let's run through those then.
So the first one is the native code approach.
So tell us kind of what that actually is and then why it's bad relative to, I mean,
it doesn't perform poorly all of these measures, right, but on pretty much most of them.
So yeah, tell us how it does.
So this is kind of a solution.
It's like the baseline that we have,
which is how data formats evolve now,
which is someone just,
you know, someone at DougDB just goes and writes the code
that reads, I know, orc and pushes it to the main branch,
and that's it.
It just runs natively in the database.
As other examples, we use, like, Jason.
This is somehow like,
through my
lifetime. This is like a shift that has
happened where people kind of embraced
JSON in databases, right?
When I was starting out, it was like this
format in the web and then people
started that in support of it into different databases.
Now you can read JSON files basically in anything.
So like into Posgars, into MySQL.
Yeah, so this is
is of course not really an abstraction, but it's a sensible baseline because for performance,
this is the best you can get through it, because we're assuming that you're taking someone
very smart that understands the system in and out, and they can write the most efficient code
for that particular decoding, and does the best performance you can get. But of course,
this runs straight into the end-time problem. It is not portable nor extensible,
because a decoder that you're right for one host
cannot be really reused in another.
For some hosts, of course, this is not even like a possible thing.
Because for DACDB, for example,
it's at least possible that you can go and propose a change.
But if you want to use, I don't know, SQL server,
you would have to, realistically, you have to be like a big corporation
that can pay a lot of money to Microsoft to,
to make that feature.
For extensibility, it's the same because
you,
I'm looking at it from
a user perspective. Like, how hard is it
for you as a system developer to implement
this into a system? Or how hard
is it for you as an encoding
specialist to develop this
as a decoder? And of course,
if you wrote, I don't know,
better blocks and you wanted to
put it into Postgres without
having knowledge of the Postgres code,
it's basically impossible.
You have to learn everything.
And then the last one is security,
for which we actually got a bit of pushback from reviewers,
because I marked it as weak.
So we have the scale from very weak, weak, average, good, very good.
And the pushback was, what do you mean weak?
Like, you have an expert developer put code into the database.
It gets through pool review and tests.
because of course we all have high standards,
so there's never going to be any issue
if there are any bugs,
because we never write bugs, of course.
And that's not true.
I kind of expected that people might not like this argument,
but I put a citation.
What I did is I did a little bit of research
into this JSON integration that I talked about
into like Postgres and various other operating systems,
and I just looked at,
When was this introduced and then tried to find issues related to Jason and Jason support?
There are quite a few, but the one that I cited that was the most worrying was that in Postgres, five years later, after introducing support of the Jason type,
they had a high severity SVE related to that code that was untouched for five years.
So my argument here is that you are very much increasing the surface area of possible security bugs
by introducing maybe complex decoders.
And because there's so many of them, we're talking about possibly thousands of lines of code.
So that's not secure also because it executes at the same privilege level as the rest of the database.
So if you're writing a database in C++ plus,
something goes wrong in the decoder,
you're risking very serious bugs.
And finally, one another thing is that
some decoders require external libraries.
So the example that I came across
is that, for example,
if you want to use the roaring bitmap in Rust,
the default implementation is just a thin wrapper
over a C library.
that implements Roaring, which has some other static dependencies.
So we're also increasing the security risk of your whole dependency chain, right?
Which, as we know, supply chain attacks are also very hotly contested security battles right now.
Yeah, so Native suffers from all those shortcomings, and it's kind of a baseline for us.
Like, how far can we push?
How much better can we get across those three.
dimensions without sacrificing too much of the performance in the process yeah nice
essentially like that kind of the security thing about it I think you say in the paper
like it becomes a statistical inevitability in the end right and by yes so that's a good
way because it's end times am right you know just so much code that something is
going to go wrong you might win the lottery and it might not be your system but you
only have to get unlucky once and then you have a bad time right so yeah it's better to be
Fipen, sorry.
Cool, yeah.
You say that another type of
approach is extensions.
So where do extensions fit into
this and how do they compare up against these
four principles or pillars
that we want to achieve?
So extensions are a powerful mechanism.
It's one of the reasons where DougDB is so
nice and so powerful
is that you can just write your own code
and load it as a module and
extend the database with
something that the developers
didn't think about.
Now this
alleviates some of the issues.
Mainly
extensibility, like for you as a
format author or decode or encoding
author, it's much easier to integrate because
you probably just need to wrap
your code into
the extension API
of Postgres or DougDB
or a skill server, I think,
also has an extension mechanism
with C-Sharp code.
and that's it.
It works.
It's also going to be pretty fast
because presumably it is in the same language
and also kind of in the same level of operation.
As long as the extensibility API of the system is sensible
and can take advantage of all of the performance features of the system.
It's going to be fast.
But of course, for portability,
it's still bad.
You can realistically reuse an extension
between two systems that are written in the same language,
maybe, and have similar APIs.
Like the changes probably won't be that big.
You still have to maintain both, though.
But the killer here is security, of course,
because you are not going to convince someone, hopefully,
like hopefully they are smart enough to refuse.
When you send them data and you then send them,
you know, compiled extension,
and a shared library and say just load this into Postgres at the same privilege level
as the rest of your database and run this code on your machine.
Like that's, of course, and secure.
So that's why it's not going to happen.
Even for normal users, but of course, some enterprise settings have extremely strict requirements
when it comes to code auditing and things like that.
So that just is straight up impossible.
I was going to do that.
I guess the next approach then, isolated extensions,
is tackling that security aspect a little bit nicer,
but that then becomes against,
comes with other tradeoffs, right?
It trades off some of the other pillars.
So yeah, it gives a quick rundown of isolated extension,
which is the last actual approach people actually do take right to this.
Yeah.
Yes.
So you can have this idea of having an extension
but instead of putting it into the same design domain
like in Postgres or DugTV or you just load the module,
we host it somewhere on a Docker container
and then whether it runs on local host or somewhere in the cloud
doesn't really matter.
And you can tellerize it, you expose sub-apy-i for like,
here's data to decode, please give me back the tuples,
and you run with that.
This has very strong security guarantees
because it depends on the isolation mechanism you choose,
but of course you can run it on a completely separate machine, basically,
and then you're in full control of what can happen with that code,
what's capabilities it has, so security is very strong here.
It's also much easier to actually write,
because you just need, in the examples that I looked at,
because this is provided by Snowflake
and the AWS
the Redshift
database. You can host
like a Lambda function
and have it
call into that. It's easier to write
because you just get input and you're supposed to produce
apples. Anyone can do that
if you already written your decoder.
For portability
it's kind of good like as long
as your system
supports this approach, this framework of decoding,
it's quite easy to move because it's probably just adapting
whatever, usually this happens via HTTP,
so whatever rest requests are sent, and it's fine.
The minus of this is performance.
You are losing, the one thing that you're losing
is that the code is isolated from the host,
So there is an overhead to whatever system you're going to use to call it,
probably some pressed or in the best case some remote procedure call.
But the real killer of performance is data transfer because you necessarily need to copy all the data into the box,
into the isolation layer, and then take the decoded out.
and to evaluate how feasible this is I did a very quick and dirty experiment of like,
okay, how I host just a Docker container on the same local host,
so there's no network overhead.
It just runs some very simple code, try to send data there and get it back,
and the data transfer is immediately the bottleneck.
Lightweight decoders are very, very fast,
so they can decode multiple gigabytes per second
and the network is not going to
just data copy is not going to be fast enough
so to alleviate that you would start to
you would have to start breaking down barriers somehow
like share the memory somehow
and then you're of course losing the isolation layers
so it's a good approach
if you just
not for big data right
if you just care about getting some data
and being secure and not really fast, then sure,
but this is not why we're here.
Cool, yeah.
There's no free lunch, right?
There's always tradeoffs, right, as well as I see.
So, yeah, I guess with that then,
so I know there is another possible approach
around static verification,
but I'd like to maybe to discuss that a little bit later on.
I'd like us to get into any blocks now.
So given what we just spoke about on these four pillars
and we want to kind of design something
that ticks pretty much all of the boxes,
is walk us through how
any blocks does that, walk us through the design
and how it works in practice.
So I think an important part of the story
behind this paper is that the idea
of sending code alongside the data
is not really a clever idea.
Like it doesn't take like an expert
to come up with that.
You can come up with that yourself,
anyone can come up with that.
The problem is,
that when you start to implement that in practice, suddenly it turns out it's not that easy,
like just saying, oh, let's just distribute code and data.
The obvious issue being, like, you're immediately opening up to remote code execution,
because someone just sends you code that does something bad to your database.
There's like a lot of small little design choices that you have to make along the way,
and it's not obvious what the correct one is and how it's going to actually implement.
impact the four properties that we talked about.
So naively, to execute some code that they cause the data,
you would do something similar to remote, to the isolated extensions,
right, you would put up some box, some sandbox,
copy the data in, run the decoder, copy the data out.
And as long as you keep the sandbox, like, secure, right?
you don't give too many capabilities to that environment,
that's fine.
But those copies are very expensive, as we said.
Most lightweight decoders can decode data faster than memory bandwidth.
So avoiding both of those becomes a hard issue.
Yes, so in case of web assembly, this is what we chose.
We looked at other ways of expressing the code, right?
You wanted to be portable, so it has to be some language agnostic bytecode or something similar to that.
We looked at WebAssembly.
For WebAssembly, the good thing is that the sandbox, first of all of all of the
exists, like there are WebAssembly
runtimes that already exist and do
all the things that you would want to do.
They isolate,
you don't get access to the network,
to the file system, to system calls
from the isolation layer.
So you get kind of security by default
by the WebAssembly specification.
And then
a WebAssembly environment has linear memory,
that's like the big,
if you don't know what WebAssembly is,
People came up with it to make web browsers faster because they wanted to do more complicated
stuff than normal JavaScript would reasonably support with the performance of a scripted-interpreted language.
So they introduced this web assembly that can be jit compiled to native code and then executed.
But of course, we're talking about the browser, so it has to be sandboxed and isolated from the rest of the system.
And then of course, as it usually is with engineers, you give them a toy that's supposed to run in the browser.
They take it out of the browser and try to implement all kinds of crazy stuff outside, which is kind of what we did as well.
And I'm not saying that it's bad.
So to achieve our goals here, we can use the way WebAssembly is usually implemented, which is that,
you create a memory map that is size of 4 kib bytes,
which is as much as you can allocate an address at 32 bits in WebAssembly.
And this is a virtual memory map, right?
So there's no actual physical memory that's backing it.
And you use the usual operating system,
paging mechanism to say that all of those pages are protected.
They cannot be touched.
touching them is a secfault.
And then you have a very easy bumblegator method
where when WebAssembly asks you for memory,
you just say, okay, so you want five pages.
I just tell those five pages that now they are read-write.
The rest stays the same.
And you can avoid all bounce checks
because now any access out of bounds immediately runs into your memory protection
and you get a cycle.
So this is how most web assembly runtimes can't do this,
including the one that we chose, which was time.
And the idea that we have is that instead of doing hard copies,
we do a trick with memory maps.
We call this data hooks,
where we say, okay, the database system, whatever the host is,
the quantity code, so data gives me a file descriptor.
I don't care what exactly this is.
This can be a file on disk.
This can be a file in memory, like created with MMFD.
And I memory map it to a fixed spot in this linear memory.
This is a very cheap operation.
No actual copy happens.
This is just telling the operating system to do like a mapping between pages.
We said that to be read-only because you don't want the decoder to be able to modify data
that is managed by your database.
And that's it.
We just tell the decoder where it is,
and it can access it transparently,
as if it was copied there.
It has no idea that actually underneath,
this is just like a page.
When it access it,
a page fault happens on all the usual operating system.
Shenanagan has happened to make sure that the data is there.
And this completely saves you the copy of,
on input.
If you already have the data
somewhere in memory,
it will just
be there.
This is just a page mapping.
If it's on disk, you have to read it anyway,
but it will be only read once.
So this is the copy on input,
and then the copy on output,
there are two things here.
One thing is that
when you say output topples,
what does that mean?
as input you get some well-defined format that you want to decode, but what is the output?
Every system has a different internal representation of their tuples.
So we chose Apache Arrow as the output format.
This is quite natural if you've heard about Arrow.
It is also the default format that is already used by some existing database systems like DataFusion.
is basically a specification of how data is represented in memory.
So how is a 32-bit integer represented, how is a string represented,
and laid out for columnar data.
So we choose that, and then we say that a key feature of arrow is that it's zero copy.
This is not a strict term.
It's sometimes confusing what it actually means.
It comes the Mac.
Basically, it means that if you have an array of integers, for example, an arrow,
and you want to use it from inside your system,
you can just use it without copying the entire thing
because it's already laid out like in 32 or I-32
in whatever programming language you're using is laid out
as long as you're on the same endianness as the thing that produced the day.
data. Yes, so our decoders produce Apache Arrow and to avoid copying on output, we say that we want pipeline execution.
So this is the way both database systems that care about performance execute stuff, which is they will ask for some batch of tuples.
This can be one topple, but usually it's a little bit more. They take them and then they push them through the pipeline
up to a materialization point where something happens.
It's put into a hash table.
It's copied somewhere anyway.
So we say that as long as you use the decoder in this manner,
you ask for a batch, you push it to some point,
and then you either, you use this data somehow,
or you materialize it explicitly.
There will be no copy.
The decoder produces it into its linear memory slice,
and the host can just access it.
So it's fine.
The trade of there is that you cannot rely on the decoder, leaving that data alone when it's called to decode another batch.
So you have to decode one badge, do something with the data, and you're responsible for...
It's basically transient, right?
You have to be aware that if you call the decoder again on the same memory map, then the data can go away, be corrupted, whatever can happen to it.
so you have to be aware of the lifetime
but again we implemented this into many different
data systems and in none of them was this a problem
because all of them follow the same principle
of pipeline execution
and this saves you the copy on output
this is especially important for string data
where a lot of the decoders
are just going to be
pointing to data that already exists
in the buffer in the input
and you just put a pointer
there and the host just
translates the pointer from
Wazim style offset
to its own and that's it
you don't have to touch the string data
you don't have to copy it at all so it saves quite a lot
of
you say a lot of money I guess
time and space is money as well
yeah it's very true yeah yeah
cool that sounds awesome
so like
let's talk about evaluation then and how close
any blocks actually gets to
that sort of theoretical limit
and how well it does on the other stuff as well
so yeah I guess is this maybe quite a little bit of a difficult thing
to evaluate like a non-standard thing to evaluate
so how did you actually approach the evaluation
and yeah what were the key findings
so we wanted to evaluate it from
like we wanted to evaluate all of the properties
of course the security property kind of evaluates by itself
right we're not going to claim that web assembly is the perfect solution because of course you are just
relying the security you inherit from the sandbox yeah if the sandbox is broken your stuff is also
broken but try to make some assumptions right yeah yeah so yeah you have to make some
assumptions uh yes uh but as long as the sandbox is correct and implements the web as summary
specification, this is also
secure, which
your conducts, and
it's like in all the images you can think of
because security can also mean
can someone read
the data that they are supplant, right? Because
if you have an
extension in your database, there's no
guarantee that somewhere in there
is not someone that is just sending
your data over the wire
to themselves.
Or
doing something similar with that.
exposing your PII that way.
Here, of course, the sandbox doesn't have the network.
This is just guaranteed.
It's impossible to express a network call in the WebAssembly bytecode,
so it's not possible to do.
Your memory, your data is secure from the operating system
because it's protected by the memory protection ways of the operating system.
So really the only thing that can happen from the sandbox is that
there's like a very critical vulnerability and it accesses something from outside.
of the memory slice.
This is like the main vulnerability.
Okay. And for the other
things, so portability is very
important. We implement any blocks
into a variety of different
systems trying to cover
a lot of differences that we can think of.
So Umbra and DGDB
were the first ones.
Umbra
complicated, compiling database
developed a tomb.
DGB, of course, well known,
has a different paradigm
they use vectorized execution, both in C++.
Then we implemented it into Spark,
which is completely different in Scala.
It has like a hybrid execution model
because it compiles some expressions,
but also uses like vectorized stuff and is highly distributed.
And then finally into DataFusion, which is also vectorized
but uses arrow natively and also is using Rust,
which any blocs is also written in us.
So we wanted to see how easy it is to do it natively.
And there's like this table where I put how much time it took.
It's cold person days, but it's like to lift the curtain.
That's how long it took me to.
It's mat days, yeah.
And how many lines of code.
So the hardest was Umbra, which is not, well, you would expect it to Sombra,
not surprising because again it's complicated
compiled database and I had no
it I never saw any of those
pod bases in my life
before I started doing it
and it just took me 10 days
so someone who knew Umbra could probably
put it out into
the hardest was Spark
but that's squarely because Spark
is just so bad to work with
like the code
doesn't have any documentation
and
I don't know
people stack overflow also doesn't know how to implement
custom data sources to do stuff like this.
It was just very hard.
I had to understand Scala code.
I did maybe, you know,
a tiny bit of Scala programming in college,
but that's it.
And DataFusion was just like,
yeah, so we proved that portability works.
And then for the other side,
for extensibility,
to show that we solved the N times and problem,
we took various different encodings
so some of them are just raw encodings like run length
and FSST directly just you know the kernel of the decoder
that we implemented we also took vortex which was like the
ultimate test really which is
which is a full file format written in rust
very you very good very fast
it's really like an alternative to
care.
And we wanted to see if we can just, you know, compile that into WebAssembly, put it into a decoder and get away with it.
We can.
And then another one is Root, which is a very nice compiling story of people at CERN doing high-energy physics research.
They have their own this custom data format for holding that particle data.
It's an insane format that was developed over like a few.
I don't remember exactly when it was it was like the 80s or the 90s.
They developed their own custom proper custom data format in C++ Plus.
It's crazy.
Like they have a whole file system abstraction and they basically reinvented Parquet from first principles.
Okay.
Cool.
They have a REPL in CPS Plus that you can use to talk to that thing.
it's crazy
and
I found a paper
where they were
looking recently
into like
can we
can we just use
like normal tools
like can we just
put our data
into BigQuery
and ask
in SQL
something about physics
the answer was
that this was not
that easy
so we thought
it's a very
complaining story
that here
I have no idea
how this format works
I would never know
this was actually
done my maxi
he spent like
two days
just
rewriting a subset of root into rust, put this into a decoder,
and now we can read this high-energy physics data.
And, you know, they were, if you run this through their tools that they use day-to-day
in C++, first of all, you have to write some C++ code yourself, which poor data scientist.
And it doesn't have any parallelism by itself.
so you have to do parallelism by yourself in C++.
Whereas here we just put it into Umbra, we write a very simple SQL query,
and you get, I think it was like 30 times better performance or something like that,
like 6 gigabytes compared to 300 megabytes natively from the route.
So there, you can use normal tools now.
And I think that to me that's like the most compelling story.
of any blogs that I have no idea how this file format works.
I barely know how Umbra works, like the basics, right?
Here I can use this data with this database system.
It magically works.
Yeah.
And then the rest of the evaluations, of course, just performance.
So for performance, we wanted to focus on analytical workloads first.
So how does this work in an actual use case
and an actual system when you run full queries?
We look at DougDB and Spark
as two like the most distinct systems that we could.
And we choose TPCH, the standard analytical benchmark
and then also ClickBrench because ClickBinch has a lot of string data
and we wanted to test like FSST and codings
that Vortex uses.
And TPCH mostly has numerical stuff.
So we do that and we just check the native system.
Native DGDP, how does it do that?
Versus DugDB but using per K.
So we encode the data into per K and then read it using the DGP per K reader
versus our NEPBalker K reader versus our N.E blocks integration with Vortex.
So the story, like the takeaway from this is,
despite the fact
that we are not native, right?
You have this sandboxing.
By using
a newer
new cutting edge
encoding for your data
and the decoder in WebAssembly,
you can already outperform native
per K, both in time and in compression.
So vortex compresses better
and it's much faster to execute
even through any blocks.
So you can like reap
the rewards of the
encoding research today.
without having to go through, you know, the brigger role of implementing it natively into like DAP or writing your own extension or anything like that.
And that's great.
We also show on the charts an important thing that, for example, in the PCH, the Parquet workload spends most of its time decoding data.
And then the rest is, you know, useful work of actually running the analytical operators.
whereas for Vortex it's
it's much better
it's like half of that time
so it's a lot to be gained in those encodings
for sure I mean the whole thing
just sounds like a complete win on all fronts essentially
so that kind of leads me into my
the sort of next topic I want to talk on
that is about impact
the first thing is in the short term
have you how much impact
have you seen any blocks have
and has it got any traction
beyond the people you've communicated
without other people working on high energy particle physics
or anything like that.
How is that going?
How are we going on that sort of journey
of actually making this thing,
kind of the industry standard slash go-to thing
for this problem?
So, you know, it's quite new.
The conference was in September.
Talking there, a lot of people were interested.
People from Bortex specifically are very interested
because they took the story of the paper to mean,
like, what's the best way to fix parquet?
Use any blocks to put vortex into porcay,
and then everything is fixed.
Right, okay.
Which is, of course, a very nice story for them.
Yeah.
People from other, from, like, lands were also interested in how to use this.
I think the, like, it's new,
but the potential for the impact is very, very big on,
like three different fronts, I would say.
One for future-proofing file formats.
Very shortly after this was published,
a very similar work was published in Sigma,
which is called F3,
which is something like future file formats, something like that.
And they also use WebAssembly decoders in their thing.
The main distinction is that theirs is like a full file format.
It's not ours is a framework.
you can put it into anything,
you can put it into your file format,
you can use it just raw with some data,
you can put it into your data management system.
This is like a full file format,
with a similar idea.
So just definitely,
it's definitely like a push from our community
to get those future profile formats.
I think the way to have the biggest impact
would be to actually get this into per K.
this would require a new version of per K
which we already know how adoption goes
but this would mean that
per K actually becomes extensive in the future proof
because then you can expose any encoding
as any blocks encoding we have an experimental version of that
where you basically just
I'm not going to explain how per Kare works
but if you know how per K works you just put a page
at the start that is the decoder
and then the rest of the pages is like opaque data
and you can use
any blocks inside the perkyer reader
to read through that and voila
you have your vortex in parquet.
Great.
Yeah.
So that's, you know,
if you get this into parquet,
you get this into every data management system ever
at some point.
So this would be a very high impact.
Another thing that I believe
this can have impact on is like data science.
So there's a lot of consumers out there that we probably don't know that well by just going to conferences, which are dominated.
You know, enterprise uses.
But like this high energy physics stuff, I know that people that do bioinformatics have a load of their own different formats.
And it would be really nice.
Like if, for example, DougDB, we could release an extension for DACDB that incorporates any block.
and then data scientists can just do their thing.
Because from their perspective,
the ideal workflow is that they launch a Jupiter notebook,
they import DGDB,
and they download whatever data is given on like a file share that they have
and read it through DGB.
And they don't care what the format is.
They want the data,
they want to plot some graph,
look at some stuff and do science.
So that,
So that would be cool if we allow them to do that.
Currently, of course, it's hard.
You can't load the route into that TV, for example.
I know that the genomics people have their own issues with databases.
So yeah, then the last thing for looking forward for the future is that in general, my area of research is this umbrella of future-proof data.
systems on modern hardware.
And the question is like, how do you,
what does it mean to be future proof?
And then how do you achieve that?
So this is tackling this from like the storage layer.
Ideally, you could evolve the storage formats and encodings
completely independently of your data systems.
If you make any blocks the standard that everyone introducing a new file for modern
or new encoding targets, then any system that already exists that has any blocks
can read everything and moreover if you develop a new system
you don't have to spend time
developing implementing readers you just implement
any blocks and you read anything and you're happy with that
so this is something that I would like to
I would like to see from systems like LingotDB for example
which we're also developing at Doom just looking to be
future proof and open and flexible
why spend our or you know like master students time implementing per care into something if we can just do any blocks and forget about it?
Yeah.
So yeah.
The question of how much impact we will have is basically how much effort is put into this.
So the project is open source on GitHub.
And it would be it's unfortunately I have to say that it's research code.
it's not the highest quality.
I'm sure it's just fine, Matt.
Yeah.
Don't worry about that.
You should see some of the code that gets written
in enterprise software.
So yeah.
I worked at Microsoft.
I know.
My standards are very high.
So for my standards, it's like, fine.
Okay, but it could be better.
It could be packaged better.
Needs to be packaged better into like a duct TV extensions.
This is something that I'll be looking into,
into doing but of course
if anyone is interested
it would also be very
nice to see
people implementing
this into different
systems and
creating encoders into
web assembly. It's very easy
because everything
basically everything compares
to web assembly nowadays
as I said for Vortex
I just took the Rust
code base and I just wrote
you know, Target, Wazim, something, something.
And then I had to remove like a sub-directory that was dealing with input output because it was
incompatible.
But I already complained about this to Vortex people.
And I think now the codebiz just compiles with that assembly.
That's what they told me.
So that's good.
You get that file and that's it.
You have the decoder.
You don't have to do anything because our, like this data hook mechanism means that you're just
accessing memory transparently.
You didn't have to worry about anything.
You don't know if your decoder is running in any blocks or natively.
So, yeah, like, I have big hopes for this project,
and I hope we'll manage to get a big impact
because the potential and, like, the community buy-in is there.
So just you need to heartens that.
Yeah, well, fingers crossed it.
It all turns out.
Well, in the interest of time, I'm going to skip over the surprises question.
I know we touched on the future one as well then,
just because I actually need to be.
need to leave the house in about 20 minutes.
So I need to wrap things up, unfortunately.
Sorry to...
I talk so much.
No, no, it's all great.
It's all great stuff.
It's good.
It's going to come out fantastic.
It's really clear, insightful and well-explained.
So it's great stuff.
But yeah, with that, let's move on to the last word then.
So what do you want, firstly, practitioners,
people out there in industry, to take away from this impetus podcast today
and your new work on any blocks.
And also researchers as well, what would you want them to hopefully get from,
from this podcast.
Oh, it's a hard question.
Well, it might sound a little rich from me,
but I would say that people should stop inventing file formats, maybe.
We have a lot, right?
We had fast lanes come out.
We had F3.
There was one more that I forgot.
Lans, right?
and our work is explicitly not a file format.
If there is one thing that I want you to take out
is that any blocks is not a file format,
people already started saying that,
and I hate it,
because very explicitly,
this is not a file format.
This is like a framework,
a method of exposing encodings.
We don't care about the physical layout of your data.
You can put this into anything.
We explicitly require very little
metadata, the only thing that you need to tell us
is what the output of the
the topple shape is, because obviously
and how many rows there are,
that's all we need to know.
You can put this into parquet, you can embed
your own statistics.
That's the other paper
from Sigma F3 that they talked about that also
uses WebAssembly. You could just
replace their mechanism for WebAssembly
execution with any blocks,
and it's fine. We're not a file format.
We're much more flexible.
which I think is a big plus
because it means that you can actually like push this through
you don't have to fight an uphill battle
of everyone now use my file format for stuff
you just have to be like
hey
implement this encoding and you're fine
yes so that's like the one thing
good message yeah you've been told folks remember
any box is not a file format
it's not a file format so it's not a phone
Yeah, and I think
for future research,
I think we should,
first of all,
we should all be a little bit more focused on
making stuff open and accessible to other people.
Because, you know,
this also kind of speaks into impact.
On VLDP itself,
there was a lot of talk about,
like, yeah,
we do all of this research,
but is this actually being used by people?
And this is one of these things
that they'll be coming up about,
those encodings, all those file formats, and then
most often your code just
remains somewhere in GitHub. It's referenced
from a paper, and that's it.
No one ever uses it.
So we should be like
thinking about how we can actually bring those benefits
to people.
For example, you have this
large community of data scientists that
as the root example
and the paper that they cite show
are eager to like use
our stuff.
For me, it's more idealistic.
I would just like to help them
and allow them to
use our stuff.
If you want a more corporate pitch,
then like those people
wants to use BigQuery and give you your money.
But we have to allow them
to give you money with
making this stuff actually open
and accessible.
Yes.
And for future research,
for any blogs,
I think there are a few
limitations of this that are like obvious
and weren't really
addressed mostly because of time
constraints the major one is filter pushdown
it would be nice if the
decoder knew
for example some
was able to take some simple filter expressions
like you know
eliminate nulls or eliminate this
column when it's below three or something
it just filter push them would give a lot
we point out in analysis of
experiments like specific queries where
we feel that all of the performance that we lose is due to filter push down.
So there's like low-hanging fruit.
The large limitation and like future direction for research that I personally am interested in is
WebAssembly is great, but as I said, we're kind of using a tool that's different,
that was made for a different purpose and we kind of like ham-fist.
it here.
And it has one big limitation,
which is that it can only run
in a CPU.
You have to call it as a function.
It runs in the sandbox and it returns.
And that's all you get.
A lot of people are now looking into
both compression and decompression
with accelerators or with GPUs.
And any blocks,
in particular web assembly,
I feel like it's not
capable. You can't run
WebAssembly on the GPU.
I mean, you can't run it, but you can't have the sandbox.
The sandbox is very CPU-centric and requires CPU facilities.
So this is like, how far away can you push future-proofness?
I like to think that one day someone is going to come up with a chip that we have no
idea how it looks like.
It's some future processing units.
You don't know what it is.
You don't know its architecture.
Can we actually design a system that would run,
on that, even though we don't know how it looks like
too much currently.
How far can you push this abstraction?
We have some ideas.
They are mostly, you know,
ditch web assembly and maybe talk about
decoders on the more abstract level.
Maybe we need a specific dialect,
like a DSL or something that
then would be compiled.
I feel like there's a lot of
good research that could be done there, especially
now that there's a push
for like, as I said, GPU
and accelerators,
heterogeneous hardware and stuff like that.
Yeah, so I think that's it.
Cool. That's a great message to end on there, Matt.
Thank you very much for taking the time to talk with us today.
It's been very, very insightful.
Sightful chat. I'm sure the listener was there.
We'll really have enjoyed it.
Yeah, best of luck with the rest of your PhD
and the rest of your academic career as well.
We'll be keeping an eye on you for sure.
I'm sure there's plenty of good research to come.
Yeah.
I'll see you on the next best paper all.
Exactly.
There we go.
There we go.
I expect one every time now, Matt.
That's it.
Yeah.
Cool.
We'll end things.
