The Data Stack Show - 166: Data Processing Fundamentals and Building a Unified Execution Engine Featuring Pedro Pedreira of Meta
Episode Date: November 29, 2023Highlights from this week’s conversation include:The concept of composable at a lower level of data infrastructure (1:28)New architectures and components that allow developers to build databases (3:...44)Pedro's background and experience in data infrastructure (6:18)The Spectrum of Latency and Analytics (12:59)Different Query Engines for Different Use Cases (16:32)Vectorized vs Code Gen Data Processing (19:33)Vectorization and Code Generation (21:21)Examples of Vectorized Engines (24:33)Rewriting Execution Engine in C++ (27:22)Different Organization of Presto and Spark (33:17)Arrow and its Extensions (37:15)The similarities between analytics and ML (44:33)Offline feature engineering and data preprocessing for training (48:00)Dialect and semantic differences in using Velox for different engines (50:01)The convergence of dialects (52:23)Challenges of substrate and semantics (53:18)Future plans for Velox (58:09)The discussion on evolving Parquet (1:03:38)The integration of the relational model and the tensor model (1:07:29)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com Costas,
this week's conversation is
with Pedro from Meta
and wow,
what a lot to talk about.
VLOX, of course,
is a huge topic
and
his work on that and
usage of that inside of Meta,
which is fascinating.
So execution engine that does a ton of stuff inside of Meta.
But Pedro's really an expert in so many different things.
Databases, you know, sort of architecture of data infrastructure.
So much to talk about. The thing that I'm really interested in is this concept of composable has
been a marketing term in the data space as it relates to sort of, let's say, like higher level
vendors that you would purchase to, you know, sort of handle data in an ingestion pipeline or an egress pipeline, right?
But Pedro has a really interesting perspective on this concept of composable at a lower level of data infrastructure.
And the execution engine is really sort of cut through some of the marketing noise that the vendors are creating with higher level tooling and help us understand at the infrastructure level, what does composable mean?
So I think that would be I think you put like in the right way because I think the difference here when we are talking about composability
is on what level of abstraction we're talking about
when it comes to composability.
Like the vendors you're talking about, I think they are talking more about
like composability of like, let's say features in a way
or like functionality like the user wants.
Composability when it comes like to what we were discussing with Pedro,
is a little bit more fundamental and has to do more with how software
systems are architected.
And, okay, I'm sure people that listen to us, they know the value of being
able to build a system, a software system that has some kind of separation of concerns
between its modules.
So have much more agility and flexibility in building, updating,
having people that are dedicated to different areas and in general,
have something that scales much easier in terms of building.
We are not talking about processing scalability.
Now, traditionally, this was never, like, the case with database systems, though, right?
Like, database systems were, like, kind of like big monoliths in a way, and, like, for
some very good reasons.
Like, it has a lot to do with, like, how hard it is, like, to build such a system.
So, when we're talking about, like, the composability that we'll be discussing with Pedras, it's more about that.
Like how there are some new architectures and some new components coming out that actually allow a developer to pick different libraries and build databases, right?
Or like data processing systems, let's say, in general. And that's a very important
thing in the industry, because traditionally, building database systems has been extremely
hard, exactly because there was almost zero concept of reusability of libraries or software
or whatever. You pretty much had to do everything from scratch. And that made the whole process of
building these systems really hard and very risky, also from a venture
point of view.
So we are entering a new era when it comes to these systems where with technologies like
Arrow, for example, Velux, we start seeing, let's say, some fundamental components
that you find in every system out there
provided as like a library in a way
that you can take and like integrate
and build your own system, right?
So this is like the things
that we are going to like to talk about
when we're talking about composability with him.
But there's much, much more.
Actually, we're going to talk a lot
about like some very basic and important concepts when it comes to data processing.
And by the way, Pedro has been working for 10 years in meta data infrastructure.
So he has seen a lot in these past 10 years.
There's so many things that have changed and they were like built so we're going to talk a lot about like the
evolution things that 10 years ago like were innovative and today they need to be rethinked
and that's a hint because he's going like to announce some very interesting things also about
some changes and some updates in like some very important systems out there. So Velux is a very interesting project, but also involves some amazingly experienced people.
We start with Pedro today.
Everyone should listen and are going to enjoy and take a glimpse of the future of what is
coming.
And hopefully we are going to have him back and also more people related to this technology
to talk more about that stuff in the future.
All right, well, let's dig in and chat with Pedro.
Let's do it.
Hello, everyone, again, to another episode of the Data Stack Show.
I'm here today with a very special guest, Pedro from Meta.
And we are going to be talking about some very interesting things
that have to do with databases.
But first of all, Pedro, welcome.
It's really nice to have you here.
And tell us a few things about yourself.
Pedro Almeida- Yeah, definitely.
Yeah.
I think first of all, I'm super excited to be here.
Thanks for the invitation.
Just introducing myself, I'm Pedro Pedreira.
I'm from Brazil.
I've been working at, well, formerly Facebook, now Meta for exactly 10 years now.
It's been my, we call this a metaversary.
So my 10th metaversary was about a couple of weeks ago.
So yeah, I've been always focused on data infrastructure.
I'm a software engineer.
I've been developing systems, query infrastructure. I'm a software engineer. I've
been developing kind of systems, query processing, like all sorts of things around data. I spent
about five to six years working on a project called Kubrick, which was something that I
started inside Maddox. It was something based on some of the ideas I was researching in my PhD.
So the idea was more creating a new database,
very focused on very low latency analytic queries.
So we have a special way to index the data
and partition data in kind of very small containers.
And it could use that to speed up your queries,
especially if it was kind of highly filtered queries.
So I spent, like I said,
about maybe five to six years working on that.
It was really cool getting the things I was researching and turned that into an actual product inside Matter.
It was something that at some point was running many inquiries.
And then there was this other thing that was being developed in parallel called Fresco, which was also a really great piece of technology.
It was making some good progress.
And at some point we started seeing that they're kind of getting closer and closer to the point where we eventually,
we brought the teams together and we started merging the technology. And then it was when
some of the ideas around Bellux and actually unifying the execution engines, all those
different compute engines into a library started. And that's how we kind of, you know, we started
looking into Bellux. And I've been doing that for, I think the last three years or so.
So I've been really focused on Bellux, but inside Meta, I also work pretty
closely with some of the other compute engine teams, so I work very closely
with the Presto team, with the Spark team, Graph Analytics, but this is
kind of my world, like software engineering around all those analytic
engines, but I also work very closely with the machine learning infrastructure,
people, real-time data, like all things kind of related to the series.
My thing.
Okay.
That's super cool.
Okay.
Before we start like talking about more, let's say like recent things and more like, how
to say that, like Silicon Valley kind of things. I'd like to ask you about like growing up like in Brazil, right?
And deciding like to get into databases.
And the reason I'm asking is because, okay, I also come like from,
like I wasn't born and like raised in the United States.
I was also like in Greece.
I went like to like a technical school there
and there were like many different options
like to go and focus on.
And what brought you,
let's say when you were like in Brazil studying
to focus on databases?
Like what was the thing?
Like it wasn't like,
I remember like most of my friends,
like for example,
we were like at electrical and computer engineering school.
Everyone wanted to build, at the beginning, a 3D engine.
Playing Quake back then.
I'm a little bit old.
And we all wanted to do something like that.
Databases were also something that would come later.
It was a very specific small group of people getting into that stuff.
So tell me about that. How you got exposed to that and got excited about it.
Yeah, definitely.
Good question.
I think for me, when I was in college, I started looking into operating systems.
So it's really going down into Linux kernel, understanding how all those things work, memory
management, how to manage resources, things like that.
And then I spent some time looking at distributed systems as well.
So I think the first thing that really caught my eye on databases is just, I think, the
kind of the breadth of things that you get to interact with, right?
Because if you're building databases, I mean, you need to understand operating systems.
You need to be a great software engineer, but there's also compilers, languages, kind of optimizers.
So I think there's just so many things that you need to understand if you actually want to go down the database route.
It was just something that was fascinating to me.
And I started a little bit more on doing research on transaction processing.
And then somehow I kind of fluctuated to more of this analytics side of
a multi-dimensional indexing on databases. I think the more I started learning how databases work,
I think the more I started, like I said, just getting fascinated by how complex those things
are. And I think that's the part that was really interesting because you can go as deep as you
want in compilers inside databases or then it you're harder, like going down to how harder works and how
prefetching, how caching, like all those things are, you know, you
need to have a really good understanding of those to build a database.
I think that was just something that always caught my eye.
That's how I got into that.
Yeah.
Oh yeah.
That makes a lot of things.
Okay.
And another question that it's like I will ask this in an attempt, like to try and clarify
things for like our audience, because it is one of that stuff that it always gets like
the semantics vary depending on who you talk with.
And that has to do with like latency when it comes like to interacting with data.
Right.
So we have streaming processing.
We have like what some people say, real-time databases, we have interactive analytics, we have many different, let's say,
terms that are used for different systems. It'll have to do with the latency of the query.
And it's very interesting because you mentioned Prestoo and Presto, it is also like a
system that if you compare it with something like Spark, for example, right?
It's much more interactive.
Like it is, I mean, not how Spark can be like today, but if we take it like when these things
like started, but the whole idea was that whatever we design here, like the system we
are designing, like one, let's say guiding design here, like the system we are designing, like
one guiding principle is that these queries that we are running, they need to respond
in a timely manner, right?
We are not talking about queries that they are going to run for hours.
But then you're mentioning Kubrick, and you're talking about even lower latency. Can you help us understand a little bit like how time and latency relates to databases
and try to create some kind of categories there that are as distinct as possible and
like to communicate this with our audience out there?
Interesting.
Yeah, I think, let me take a stab at that.
I think like you're saying, if we're looking at more analytic queries, there's definitely
a spectrum that goes to interactive analytics.
Sometimes they call that last mile analytics, which are things that you need to serve with
really low latency.
Imagine things serving user-facing dashboards that have really high QPS and you're expecting
very low latency.
And building systems focus on that,
like you need to take different trade-offs.
You cannot expect to scan petabytes of data for those queries.
There's probably some pre-computation,
some caching required,
like even how you propagate your query
between workers,
it should be very,
kind of really fine-tuned.
So I think we usually call that
there's this kind of this last mile analytics
that relies on a lot of this last mile analytics that relies
on a lot of caching and pre-computation.
Then there's more interactive analytics.
Sometimes they call them maybe exploration or ad hoc, which are in some ways you're doing
some exploration of data.
So you know your data set, maybe you're kind of trying to join different tables to get
some insight, but it's not something that you already found and you already built a dashboard.
And you have a few questions that you want to serve with really high latency.
There's this part in between that is a little more exploratory, but it still needs to be executed with low latency because there's a human expecting to see that result.
So it goes from this serving in a way to this exploration. And then on the other side,
there's more of those kind of really
kind of higher latency batch queries,
which is just, okay, I need to prepare my data.
So I need to join this, I don't know,
10 petabyte table with this,
another five petabyte table, generate some tables.
And then those tables are going to be used
to serve something else.
So usually, and this is only talking about analytics,
but it goes from this side of serving dashboards, your kind of human
exploration, a little more ad hoc to kind of large batch processing and ETL.
And I think, like I said, at least inside meta, like a lot of the batch workloads
are just running side Spark, right?
Because it's a system that is really good at handling large queries.
There's a good support for query restartability, and it's a very reliable system.
And as you move to the more interactive part, you go to systems like Presto, and
we also have things that are focused to kind of higher QPS and serving really low
latency queries.
I think that's an expector.
Again, that's only talking about an alert.
So if you talk about kind of transactional workload, if you're talking about ML or
real-time data,
like there's another world,
but that would be maybe
a high-level classification.
No, that makes total sense.
And like, just like to help our people
like map these categories,
these three categories
with like some like products out there, right?
So you have systems like Pino
or Druid or ClickHouse.
And I would assume, and correct me if I'm wrong,
these are like the systems that are closer
to what Kubrick was supposed to be doing.
Correct.
Then you have the systems like Trino,
like Presto or like Snowflake
that are more for longer running queries,
but still like queries where there is a human waiting
in front of the screen, right? And then you have all the bad stuff that might take like hours,
some case like even days, like maybe, but, and then you have like Spark. And now, okay,
I'm going like to ask a very naive question, but I have like a good reason to ask it like that.
Why do we need like different query engines and we can't have just one
and cover all the different use cases? Yeah, I think that's a good question. It
goes down to that whole one size does not fit all discussion started, I think, maybe at this point,
20 years ago-ish. But yeah, I think essentially depending on the capabilities,
depending on what you need,
what are the capabilities you need from your engine,
you need to assemble your query engine in different ways.
So for example, like if you need something with very low latency,
you probably want to propagate your query to less servers.
You want to have things maybe pre-calculated.
Maybe a good example is if your query is really short-lived
and something fails in the middle of your query,
it's probably cheaper to just restart your query
than trying to just kind of save intermediate state, right?
While this is very different from if you're running a query
that takes, I don't know, an entire day and something failed,
you need to have ways to restart just that part of your query.
So different systems, depending on the capability, depending on your requirements, they need
to be kind of organized and assembled in different ways.
There are always discussions about HTAP systems and systems that could magically kind of cover
most of that.
There's some projects that are even kind of successful in the industry, but I think at
least for us at scale, we don't see anything that can kind of work at our scale and still fit all those requirements.
I think the very interesting question, and that's how we got to Bellux, is that you do
have all those different engines, but if you look at them, how different they actually
are, right?
As well, if you look at it, well, the first good example and where we started is if you
look at things between Presto and Spark, like you mentioned, Presto is definitely
more focused to lower latency queries, more ad hoc, while Spark is just for those really
large batches.
But if you look at the code that actually execute the database operations, they are
the same.
There's nothing different about the way you execute hash joins or the way you execute
expression.
Of course, one of them is vectorized.
The other one is a little kind of more closer to code gen,
but essentially kind of the semantic of those operations
are always the same.
So I think there was a lot of questions that we,
or maybe the discussion we brought was that,
yes, the engine needs to be different
because the way you organize or query execution
requirements are different,
but the code you execute can be shared,
or at least a lot of the code can be shared.
And that's how we got it to balance.
All right.
That makes a lot of sense.
You mentioned an interesting distinction in terminology here.
You mentioned vectorized versus code gen, right?
Tell us a little bit more about that.
And the reason I'm saying that is because I'm aware that there are two distinct,
let's say, main categories of how an execution engine is operating or architected.
But I'm not sure that many people know about that. They think of vectors and might think
for many different things, or cold-gen with, again, many different things.
So let's get a little bit more into that. And that is going to like to give us also let's say the like the stepping stone to get into vellux right
so what's the difference between the two yeah i think it's essentially like how you you process
your data i think we usually when you talk about vectorization the assumption is that
you take entire batches of data so you're not executing operations against one single row, right? So you have a batch of, say, 10,000 rows,
and then you apply one operation at a time over this entire batch. And then, I mean, there are a
lot of trade-offs, but it's usually a little better suited to CPUs because, well, the prefetching is
better, your memory locality is better. So usually executing those operations against the batch,
like you can amortize a lot of this cost.
I think where things get tricky is that if you don't, if you cannot batch records
or create batches that are large enough, then a lot of cases you have a bunch of
operations that you want to execute against a single record, so the, a lot
of the overhead you have with vectorization, then it really adds up.
So I think what we see is that if you're looking at analytics, we're usually processing a lot of data.
Vectorization might be a better candidate just because you always have large batches.
So essentially the overhead of finding what's the operation you need to execute on that batch, it gets some more tight because the batch is large enough. But usually in some cases where your bags are smaller and I think there are
probably some disagreements in the community on where exactly this barrier
that this, you know, this line is draw, but usually in real life things
are a little more transactional.
Right.
So we have like lots of operations that you want to apply over and
a one record or a few records.
Then code gen is it might be a better solution because you essentially take
all the code and you
generate at execution time, you generate some assembly code that can execute those operations
more efficiently.
And when we started those things, there was a lot of discussion on what would be the best
kind of methodology to use for validates.
All the algorithms are created, all the enterprise.
We have been experimenting with CodeGen as well.
One of the tricky parts of CodeGen is always that there's some delay and there's some overhead in generating this code at execution time, so it's always like, you
know, have some gains, but then like, does the compilation time really offset that?
And given at least our experience, like usually vectorization
still was a clear winner.
Another discussion there is always just the kind of complexity aspect of kind of generating code at runtime.
It might get things kind of trickier to debug.
It may make things harder to develop.
There was some recent work in the community that actually suggests that cogen might not be as terrible as it seems.
Like how to make it easier to develop Ogen-based engines.
But I think our experience with that for analytics
vectorization was still the clear winner.
I think even if you look at the industry,
most of the engines focused on analytics,
they do vectorization, which is this idea
of just executing the same operation
over large batches of data.
Yeah, that makes a lot of sense.
And I'd like to make something clear here,
because I think people might also think
that when we're talking about vectorization,
it's always connected to specific instruction sets
that some CPUs have, like AVX512 or something like that.
But actually, vectorization is much more fundamental than that.
Yes, you can have acceleration because of these instruction sets there,
but you get benefits anyway because of these instruction sets there, but you get benefits
anyway even without the instruction sets. It's a whole architecture. It's not just
having special hardware that can accelerate some specific things.
Correct? I'm not confusing people more, but there's a lot of confusion out there,
and I would like to make sure that we make things clear.
No, exactly.
I think that's a great point.
When we talk about vectorization, that doesn't necessarily mean SIMD is an instruction.
I think the point is just designing your database operations in a way that SIMD can be leveraged to execute those things efficiently, there's always a discussion on whether you do this explicitly or you let the compiler do it and kind of implicitly discover that this pattern can be translated to SIM the instructions. But basically what you said, I think it's maybe a little higher level.
It's more, for example, if you're executing hash to volumes and you need to look up on hash tables, like whether you take one record and then you do the entire lookup in your
hash table, or if you first calculate all the hashes and then you kind of refetch things,
it's more in that kind of higher level, how you organize the operations you're executing
rather than just simply as part of that, but it's not only that.
Yeah, yeah, 100%. organize the operations you're executing rather than just simply as part of that, but it's not only that. Yeah. Yeah. A hundred percent. And okay. Can you like, before we get into like more into like Velux, can you give us like an example of like a database? I mean, Velux obviously
like is vectorized, but another one that is like, let's say not a library, a whole database,
but it's like vectorized so people like can relate to that. And on the, a whole database, but it's vectorized so people can relate to that.
And on the other hand, again, a popular system that is called Gen and people might not know
that it is called Gen. Yeah. I think Presto is a good example of a vectorized engine.
I think that the vectorization that Presto does and Velux does is not exactly the same,
but I think the paradigm is the same.
Fulton, at least based on the public information
they have about paper, it seems like everything is vectorized.
I have the impression that Snowflake is also vectorized
just based on the people who created them,
the system they worked before.
But I think that there's not as much information.
But I would say that most of the analytics engines are focused on, are essentially vectorized.
Most of the systems essentially coming from kind of the German universities, like that's
a kind of a big shop of vectorization.
So of Cogem.
So the Umber system, I think is maybe the newest system they have been working on, which
is all based on CodeGem.
Before that, they had the Hyper system, which I think at some point was acquired by either Salesforce or SAP.
So that was also like one of the kind of really successful CodeGem systems.
I think that those maybe are the best examples.
I know that Redshift is, it does some CodeGem as well.
But I think that's kind of maybe the first one that comes to mind. Redshift does some code gen as well.
But I think that's kind of maybe the first one that comes to mind.
Yeah, I think DuckDB is also vectorized.
DuckDB for sure is vectorized, yeah.
Yeah, and I had the impression that Spark, I mean, not with Photon,
it was code gen.
But I might be wrong here, but I had the impression that it was still... I mean, if you use like NodesPhoton, it is like CodeGen, but...
Yeah, I think you're right. I'm just not sure, because I know that we do have some internal changes to Spark as well.
I don't know if the CodeGen is something that we added or we improved, but definitely the default kind of Java, Scala, Evolve is based on... It is not vectorized, it's based on CodeGen is something that we added or we improved. But definitely the default kind of Java, Scala, Evolve
is not vectorized, it's based in CodeGen.
Yeah, yeah. Which kind of makes
sense because vectorization, I think in the JVM
space is a little bit more challenging
from what it seems like in some cases.
And a quick question here
to get into Velux. So Velux is
by the way written in C++,
right? So
it diverges a lot from the paradigm of these systems like Spark and Presto
that were JVM-based systems, right?
Why is that?
That's the first question.
And the second question is, how do you bridge the two?
Because it sounds interesting
to go and be able to take Presto
or take Spark and change the execution engine
and put there something that is written in C++
and have the games that you might have.
But how trivial is to do that, right?
Yeah, good question.
So when we started Bellux,
I think there was maybe two different discussions.
Like one of them is what you mentioned,
just how we would write that and which language.
So that actually started inside the Presto team inside Meta.
So it was a couple of people
just looking at accelerating Presto, right?
So what are the optimizations you can do on the Java stack?
And then I think at some point,
like this idea came up of if we actually rewrite those things in Presto, right? So, you know, what are the optimizations you can do on the Java stack? And then I think at some point, like this idea came up of
if we actually rewrite those things in C++,
like would they be more efficient?
So at that time,
there was some smaller benchmarks
or micro benchmarks,
just kind of compare how efficiently,
like if you just look at the very hot loops,
and I think one of the things
or the areas where we started
was essentially table scans,
because that's where most of the CPU was going.
So if you look at the main loops inside table scan, and you just kind of map those things
into C++, or what we call native code, how much faster that would be.
I think we got some good data that at least for those smaller micro benchmarks, there
was some performance gain, I think it was a few X from 4X to 10X, depending on what
you do.
So there was a good performance gain by just kind of moving things
from Java to C++.
I know the Java lovers will probably yell at me.
Of course, there's a lot of kind of things that you can stretch
the JVM to do things even more efficient, but then you kind of start
losing a lot of kind of the good benefits that the JVM and Java,
some of the guarantees that the language gives you.
So you can kind of stretch Java in awkward ways,
and then it can bridge some of that.
But then I think it just came up that writing those things in C++
in the first place was just kind of a better idea performance-wise.
So that was one aspect of the discussion.
The second one was that we didn't want to do,
or when we started writing some of that,
the idea is that we didn't want to do that on a per-engine basis.
So I think usually if you're looking at just Presto or just Spark, it's one discussion.
For us, we have 10 to 20 different engines, right?
So you don't want to have to do that on a per-engine basis.
So I think that's when the idea of like, if we need to rewrite those things from Java and C++, that's a lot of work.
Like it's almost 10 years of work.
But if we want to do that, we want to do that once,
and then we're using all the different engines from not just Preston and Spark,
but we're using that in stream processing.
We want to reuse that in machine learning for processing, for training,
for serving and inference.
We want to reuse that for data ingestion.
We want to reuse that in as many vertical or different use cases as we can. And then creating that in C++ just make this a lot
easier. At least for us, the whole ML stack is all C++. So it's not that you could reuse a Java
library in that world as well. So I think just C++ was a better choice at that point.
I think to your second question about how we plan those things,
it's interesting.
It also depends on exactly how you're integrating
or maybe how deep you want the integration with the engine to be.
Specifically for Presto, there was a discussion
on whether you just replace the eval code by C++,
and then in that world, you need to have some sort of inter-process communication
or you need to use JNI.
And then there was questions about, you know, how much complexity would JNI add?
There's also discussions about memory management,
like how much memory you give to the C++ side of things
and how much memory the JVM would need.
So what we ended up deciding is that specifically for Presto,
we removed Java
completely from the workers.
So essentially, Presto has this
two-tiered architecture where you have
one or a few coordinators that take a
query, parse a query, optimize it,
generate query fragments,
but then the execution of those fragments
they're done
in worker nodes.
So what we do is that on those worker nodes,
we completely replace the Java processor.
So there's no Java running that.
Most of the execution of the operation,
they're done by Valux,
but there's also kind of a thin C++ layer
that communicates,
essentially implements the HTTP methods
that communicate with coordinator.
So that's the project we call Prestissimo.
So Prestissimo is essentially
this thin layer around the around the Alex. So that's how we go the specifically inside press.
So that's how we go between between between languages from Java to C++. We do that via kind of
HTTP REST interfaces. And is there, like with Spark, for example,
and like, okay, the reason I'm asking is,
seeing also like the project Gluten,
which is like attaching,
not the, I mean, Gluten is like more generic,
but tries like to create like a layer there
to put like different execution engines like on Spark.
And I also had like the impression that at least like the photon folks that
were talking about JNI and like also the context switching that happens there.
And like what kind of like overhead this is also odd and like, do you
gain at the end or like you lose?
Like, so from what you've seen so far, because okay, like traditionally
Java to communicate with other systems, you
use JNI, right?
Does it make sense to do that with a system that's, okay, I mean, you put all this effort
to write in C++ at the end to go and add benefits in performance.
You don't really want to lose any of that benefits, right?
So based on your experience, what have you learned around that?
Yeah, I think that's a good question.
I think for the JNI discussion was less about performance.
I think the assumption is that you cross the API boundary very rarely.
So I think the idea is that you would go from the Java world into C++
and cross the JNI boundary just once.
And then all the heavy lifting operation would happen just inside C++.
So we wouldn't transfer...
Another way of saying that is that we wouldn't transfer data between JNI and Java.
So all the heavy operations would be on the C++ process.
I think the discussion was a little less about just performance and just about simplifying
the stack and resource management. Another big difference is just how Presto and
Spark are organized. Presto is essentially a service. So we have a few workers and those
workers are always running. There's a single process and you share this process among different
queries. So in a way, you want to be very conscious about how much memory you have on the box, how much memory that process is able to use. And then inside this process, you share
memory and memory pools among different queries. While on the Spark side, it's different. Spark
is just like, it's more like a batch job dispatching system. So when you execute a query,
you allocate containers and different pools. And then at least how we use that internally is that you do that.
And that lives while your query is executing.
At that point, you kind of remove that.
Right.
So I think this being able to efficiently share resources and share memory, like it's
a lot more important on the Presto side than it is on the Spark side.
With that said on the Spark side, there's also kind of, we went through a few
iterations on how those things would be integrated. side. With that said, on the Spark side, there's also kind of, we went through a few iterations
on how those things would be integrated.
There was a first try internally, and I think that was at least two years ago, doing something
similar to Gluten, but instead of doing via JNI, just using a separate process.
So essentially Spark, the Spark driver would create a new C++ process and then communicate
using some sort of IDL, some sort of inter-process communication.
We went from that mode to a completely different project that we can maybe talk in a little more details.
But essentially, the idea was taking the same Presto frontend and the same Presto worker code and executing them on top of the Spark
runtime, which is something we call Presto and Spark. So I think that was kind of our strategy
because the integration work would be a little less, right? So we just needed to integrate
Velux into Presto. And then if we're using the Presto engine on top of Spark, then the
integration cost would be easier. So that's how we use things internally.
And then on the community, as you mentioned,
like Intel started and created this project called Gluten,
which is a little, in a way, similar to Fulton in the sense that it uses JNI to communicate with Velux.
And then the API is based on Substrate and Arrow.
So I think that one is closer, but we don't use that internally,
but we've been working closely with Intel.
And I think in a way,
it's a good way to look at Gluten
as sort of like being a JNI wrapper around Velux
that you can integrate that into Spark.
But I think the project is generic enough
that we can reuse that in other Java engines as well,
like Flink, like even Trino and anything that uses Java.
Yeah, that's interesting.
Okay, what about Arrow, though?
Because one of the promises of Arrow is that we agree upon a format on how to represent things in memory.
And this can be shared among like different like processes and like all
these things, which obviously like, okay, sounds like very interesting
because when you're working with a lot of data, like the last thing that
you want to do is like copy data around, right?
So that's always like a big overhead.
So where does like Arrow fit?
I know that Velox, for example, is inspired from Arrow,
the way that the layout in memory is designed.
But it's not Arrow.
It's not exactly like implementing Arrow.
So how do you see the projects there, let's say, work together?
And what's your prediction for the future in terms of the promises of Arrow and the
role of Arrow in the industry?
Yeah, that's a good question.
There's some background to that as well.
Yeah, I think, like I said, Arrow is this columnar layout
that is meant to, you know,
it's a standard in the industry.
And then I think the whole idea
is that systems would reuse that layout
and then you can move data,
zero copy across engines.
So when we started Bellux,
I think, of course,
the idea was just kind of
reusing that same layout
because, well, it makes,
it would make it a lot easier
to integrate with different agents.
What we started seeing
is that it was in a few different places where we saw that if we made some small changes to that layout, we could implement execution primitives more efficiently.
So then there was a lot of discussions on whether we would stick to the layout at the cost of maybe losing performance in some cases, or whether we would extend Arrow in a way and implement our own specific memory layout in some cases or whether we would extend Arrow in a way
and implement kind of our own specific memory layout
in some cases just so we could execute
some operations more efficiently.
And I think because for Valor to really care
about efficiency, we decided to go with kind of extensions.
So that's how we kind of created
what we usually call Valor vectors.
It's essentially the same thing as Arrow in most cases.
The only difference is how we represent strings is slightly different,
and how we represent kind of vectors and, sorry, arrays and maps.
There's some differences on how we do that.
And we also add some more encodings that were not,
at least at the time, were not available in Arrow,
like a constant encoding, RLE, and frame of reference. But then we decided to extend that and in a way it deviates from the Arrow standard.
So that's how Velux started and all data represented in memory all followed the Velux
vector mode. Then we started because many engines integrating with Velux,
they would like to produce and consume Arrow. There's some sort of conversion, or there's a layer that can do this conversion, which
like I said, in most cases, zero copy.
But if you hit one of those cases where we represent things differently, then it incurs
and can copy.
So this is how the project started.
Because Velux was gaining adoption, was gaining visibility, we also started working with the Arrow community just to kind of understand those differences and see if that's something that the Arrow community would be interested in extending the standard as well.
We started working with a company called Voltron Data that was also interested in Valox, and they have a lot of good work on the Arrow community.
So what we ended up doing was actually proposing to the Arrow community to add those Valox extensions to the Arrow community. So what we ended up doing was actually proposing to the Arrow community to
add those VALX extensions to the Arrow specification. I think at some point we were
able to convince them that those things are actually useful, and not just VALX,
that some of the other compute engines also do things in a similar way. So some of those
changes are already adopted and already available in the new version of Arrow. There's one last
difference that we're still having some discussion, but hopefully
we'll make it to the next version of Arrow.
So essentially, I think long story short, like we, when we started, it was sort of
a fork or like, we called it an Arrow++ version, but then we were working with
the community to, again, to kind of contribute those changes and then hopefully
align, and then the idea is, of course,
that we can ship data back and forth each hour
without any performance loss,
making everything zero copy.
Okay, that's awesome.
So you mentioned a couple of folks out there
who are using Velux.
And who else is using Velux outside of Meta?
It is like a pretty mature open source project right now.
If someone goes there, you can see a lot of contributions happening
from people also outside of meta.
So what's the current status of the open source project?
Yeah, that's a good question.
I think that's one area that we are actually really happy on
how much attention and how much visibility we're getting.
I think part of our assumptions on the open source balance is that, because in a way,
the audience is very limited. It's not something that end users would download and use. In a way,
it's very different from that DB. That DB must have millions of users because data scientists,
data engineers, they actually use that. I think for us, our audience is just developers
who actually write compute engine.
So in a way, I think the group of people
who would be interested on this would be very limited.
But regardless, we saw very kind of quick growth
on just attention and different companies in the industry
were just interested in the project
and wanted to both
start using benchmarking, comparing with version they had internally and then eventually help
us.
There's a list of companies that one of the first ones that joined us really early on
and has been helping specifically from the Presto perspective was Ahana.
So they have a team of people working with us really closely.
They recently got acquired by IBM.
So think not IBM, Ahana, like they're one of the companies who've been uh helping us develop and
using velox specifically on the presto context for a while uh intel was another company that
joined us very early on uh they're interested a little more on the spark perspective and i think
of course on the part of making sure that this library runs efficiently on Intel hardware.
So they have a lot of work on the Spark side as well.
So they're the company who created Gluten.
Specifically for Gluten, there's a lot of other companies who joined us afterward and helped develop and are using that internally.
So we heard from Intel, Alibaba, Tencent, Pinterest, a large number of companies who have been looking into that as well.
We've also been chatting with Microsoft about Newton and some other usages of
Allux here. I think probably we heard and we saw people from most tech companies at this point,
with very few exceptions, like some of them already actively using that internally and
helping contributing code, like some of them are just asking questions and trying to benchmark
things. But I think
most of the large tech companies,
we heard them in
some shape or form,
which was really interesting. It was really
validating to our work.
Yeah, 100%. I think it's very rewarding
also for someone
who's leading a project like this.
So, two questions that I have to ask them because I don't want to forget them.
One, the first one you mentioned at some point, Velux is being used, not just
like for analytical workloads, but for streaming, it's used for ML.
And I'd like to ask you, let used for ML.
And I'd like to ask you,
let's take ML as an example, just because it's the hot thing right now. It helps with marketing.
So, what's the difference?
And the reason I'm asking what's the difference is because
I also try to understand and I want to learn what is also like they have like in common, right?
Because if you can share like such complicated components like Velux, just like the execution engine that does like the processing of the data, right?
Between ML and analytics, it means that they are different, but there are also some things that are common.
So I'd love to hear that from you.
So please help us learn.
Yeah, no, definitely.
Yeah, I think, like I said, when we started Bellots, we started learning and talking to
many other teams.
And I think that's what we started realizing that actually the similarities
between agents developed
for all those different use cases,
they're much higher than what we expect.
And I think it was the assumption
that are just different things
that do various things
in their very particular ways.
But what we found was just,
it was kind of basically the same thing
developed by different groups of people.
And you can see that within analytics,
but you see that even more kind of
evident between analytics and ML just happens that all the kind of data
management systems and analytics were developed by database people and all
the kind of data part of ML was developed by ML engineers.
So of course, there's a lot of things that are very particular to you, to ML,
like how you do the training, like all the bulk synchronous processing you do on that, like all the hardware acceleration.
But there's also the data part on ML.
So there was maybe two main use cases that we've been looking at.
One of them is just what they call data preprocessing.
Because basically, when you want to start your training process, you need to feed lots
of data.
And usually, this data is stored in some sort of storage.
Like for us, we pull data from the same warehouse.
So there's this logic where it's usually not as, of course,
not nearly as complex as a large analytic query.
It's not like you're doing aggregations or you're joining different data sets.
But it's still like you're doing a table scan, you're decoding files.
Like sometimes I think we don't get to do a lot of tutoring,
but there's always some expression evaluation. You execute some UDFs. So this part is actually very similar to...
Essentially it's the same. There's nothing different about how you decode those files,
how you open Parquet or Org files and how you execute expressions. The UDF APIs that machine
learning users use and all the types we provide to them.
So all that part was very similar.
But it was also a discussion that how much we could reuse.
And then what we ended up seeing is that it's just kind of more of the same.
So we had a different library that was also a C++ library
that is heavily used for this specific use case,
like for training when you're pulling this data
and feeding this data into the model.
We had a C++ library internally, and we saw that the implementation of those things, it's
very similar.
So this is one of the areas where we started on working with Balance.
The second part was around feature engineering, which is essentially how do you take your
raw data and transform into features.
And then there's a few different parts of that. Like how do you do that on offline systems, essentially you should have
your internal warehouse tables with the raw data, how do you do, how you
run ETL processes that actually generate features that can be used for training.
We also do a similar thing on the real time side.
So there's also real time system that does feature engineering.
And there's also the part of serving features.
So essentially, when a user needs to see an ad,
like when you get the input data,
you need to transform that data really fast
and really low latency to transform that into features
so it can actually do the inference on the model.
So again, those less mild transformations
they need to do from the raw data into features,
again, it's essentially expression evaluation sometimes at UDF.
So that's the part we're converging.
And again, I think the more we look into that, the more we
saw that it's just essentially more of the
same, just implemented by different
groups of people with different semantics, different
bugs, but they're the same.
And what's the difference?
Just like we couldn't be really curious here.
What's the difference between the offline feature engineering and the data
pre-processing for training that you mentioned because okay, what I understood
by the difference is that data pre-processing is much more let's say like
doing like serialization, deserialization, like more that doesn't necessarily require
like typical like ETL stuff like joins
and things like this, while the feature engineering might be like more complicated, like the type
of processing there.
Yeah, I think on the feature engineering side, you can imagine that the offline, it's essentially
it runs on top of traditional SQL analytic computations.
So basically, you're doing large SQL transformations.
Sometimes you're joining things or executing expression,
but basically you're taking all this raw data
and creating features.
Data preprocessing path is more,
you already have those features created.
You're pulling this data and feeding this
into the PyTorch model or whatever model
you're trying to train.
So in a way, you can see the slightly different parts
of this pipeline.
But also maybe the complexity
and the type of operations you execute,
they're also different, right?
But when you're creating those features,
the type of operations you can run,
they might be a lot more complex.
There's a lot more to it.
There's also shipping data,
different places, doing shuffles
and moving lots of data to different places.
While on the training side, you want to
very efficiently just pull this data
and feed it to the trainers
very fast.
Makes sense.
You mentioned UDFs and some other stuff.
There's a question that I wanted
to ask about Velux.
If someone goes to the documentation of Velux,
they will see that there are two groups of functions there,
one for Presto and one for Spark, right?
And you will see the same functions in a way,
like implemented like twice in a way, right?
Why is this happening?
Like why you need something like that?
Yeah, that's a very good question.
And I think that's where we get to the whole kind of
dialect and semantic discussion. So basically when we created Valux, there was this discussion of
which specific dialect we should follow, right? Because it's not just that the code base are the
same, but for example, what are the data types you support? And when there's an overflow,
what's your behavior? Because different engines, they do that in different ways.
So if you follow the behavior in one engine
and you try to integrate ballots in a different engine,
then you're essentially changing the semantics.
So how exactly you handle operations,
what are the functions you provide,
what are the data types you provide was a big thing.
And what we ended up deciding was that
because we need to use ballELOX across different engines,
VELOX needs to somehow be kind of dialect agnostic.
So what we ended up doing is that the core library or the core part of VELOX,
we tried to make it as kind of engine independent as we can be.
And then you have extensions that you can add to follow one specific dialect.
So I think what you mentioned is that if you're using VELOX for, then you're going to use the Velux framework and you want to add all the
Presto functions. And that would be a Velux version that follows the Presto dialect. But
if you try to use that in Spark, that won't work because Spark is expecting different functions,
the different semantics. So there is also a set of Spark functions that you can use. So if you're
compiling Velux for Gluten, for example, you're going to use Bellocks with the Spark SQL functions.
But the other interesting part is that it lets you kind of mix and match those different things.
So you can potentially have the Spark version of Bellocks running within Gluten, but also
exposing Presto functions for compatibility. So it kind of enables you to do this sort of thing.
I think that's one of the parts that we saw is it's actually pretty tricky, right? Because the code is the same, but each
engine has its own quirks and its own SQL dialects and no different databases, they have the exact
same, even if all of them support SQL and all of them claim to comply to the SQL standard, exactly,
they're never compatible. They're always different.
So there's also discussion.
What we ended up doing internally inside Meta,
because that's also a problem that we really care, right? So if you're a data user and you need to interact
with five or six different systems,
and each system has a different kind of SQL dialect
with different coordinates and different semantics,
it makes our life a lot harder.
So as a separate goal, we also have this idea of converging dialects.
So it would be a lot easier if we had one SQL dialect
that people could use to express their computation,
and that would work just on top of all those different systems,
but not just even for SQL, but potentially if you're creating
your future engineering pipelines, or if you're doing, you know, if you start training your model, like ideally if the functions you expose,
the data types and semantics were exactly the same as your Presto queries or Spark queries,
like it just makes the user's life a lot more efficient, right?
So I think this is something that it doesn't necessarily come from Bellix, but Bellix allows
you to do that because you have the same library, the same functions,
the same semantic being used
in all those different engines.
We can make the user experience
more aligned.
And this is what makes data users
and data scientists
a lot more efficient.
So that's another thing.
It's a lot harder than just Velux
because, well,
we need to make changes on engines.
There's all sorts of concerns
with backwards compatibility,
but it's something that long-term we're also looking at.
Yeah, I think that's also like my...
How to say that?
Like my...
Not concern, but something that always confused me a little bit
with Substrate.
Because I was like, okay, yeah, sure.
You can take a plan and you can serialize in Substrate
and then, let's say have like a plugin on Spark
that gets that and like executes it.
But at the end, like the semantics of Spark
might be slightly like different
than like Presto, let's say, for example.
So is Substrate at the end
like provide the interoperability like layer that we would like to have at the end?
Or it's like limited at the end by how semantics are like implemented in each system?
Because, okay, Substrate cannot go and like implement that stuff.
It's not its purpose, right?
So what do you think about that?
Because it's like, depending on who I talk with, it's very interesting.
If I talk like with researchers and databases, they're like, doesn't make sense.
If you get like into the point of like using Substrate, why like just implement the dialect
you want like from scratch anyway, if you want to do it right.
Right.
And then you see like people in the industry, they're like much more pragmatic.
I mean, yeah, like they try like to use it.
Right.
But what's your, what's your take on that?
Like specifically for Substrate.
And I understand that like, okay, I don't want you like to be,
it might also be like a little bit political in a way
because like there are like communities that are involved,
but I think it is constructive,
like for the communities and the people involved
like to like share opinions
because these are like hard problems at the end.
There's no easy solution.
Yeah, no, that's a super hard problem.
And I think in a way that's also, as you rightly pointed out, it's sort of a controversial
discussion.
It depends on who you ask.
Different people might have different views on that.
Yeah, I've been, because of the work in Valux and because of the work with Arrow,
essentially it's the same community, right? So we've been talking to the Substrate community
as well and how they're planning to handle that. I think the problem space essentially is that if
you want to create a unified IR for systems, which is at least my understanding that's what
Substrate is meant to do, you need to be able to express the computation without, like, you
cannot have room for interpretation.
Like it needs to be something that captured every single
detail of your execution.
And maybe like a good example is what I mentioned before.
Like if you're, if you have overflows, like what's the behavior, like
you throw exceptions, you wrap around, like this is something that needs
to be captured by the IR.
Other things as well, if you have arrays,
like you start at zero, then you start with one.
So those are just kind of some toy examples.
But I think with what we started seeing in practice
as we implement some of those functions,
there is so many really kind of complicated details.
Like essentially every single function
has all sorts of kind of corner case behavior.
There are bugs.
Like so if you want to have a bug by bug
implementation of different things, it's really hard. My understanding is that the substrate
project perspective is that they're trying to capture all those details inside the IR itself.
So essentially, of course, they don't have anything around execution, but the IR would
have enough information for you to implement things in a way that the execution is 100% correct. There are questions on whether this
is doable or not. There's also the part of like, if you actually capture all the single details,
like, I mean, it is only as useful as the engine is actually able to provide all those different
knobs. I think it is an open question.
My personal perspective is that getting an IR that was generated by one system and executing
on a different system other than maybe a Hello World example is like, okay, I actually have
a complicated query that has large expressions, functions, EDFs.
I hardly see that really working in practice.
But let's see.
I think we have this discussion with the Substrate community
and they have people actually trying to enumerate
all those differences.
Let's see. That's still an open question.
Personally, still a little bit skeptical,
but let's see.
No, it makes sense.
Let's talk a little bit about
what's next for Velux.
I mean, it is in a mature state. Okay, so let's talk a little bit about what's next for Velux, right?
I mean, it is in a mature state.
I mean, it's being used by Meta. Okay, the scale that Meta has, other companies like outside Meta,
and it's actively developed, right?
So what's next?
What are some exciting things about Velux that you would
like to share with like the audience out there yeah no that's a that's another interesting part
like we there's a few things that come to mind of things we're working on right now things have
just started and what we were looking at in the future one one of the one part is I think more
maybe tactical is just the open source part.
I think there's a lot of discussion that just because how fast the project is going,
like how much interest we're getting from the community,
there's a discussion on how do we kind of evolve the open source mechanics,
like the open source governance, like how do we formalize a lot of those things.
But this is something we've been actively working on, not just inside Meta,
but Palcanture partners, like discussions on how the community operates, decision-making, essentially like open source governance.
It's never an easy problem.
So this is kind of a little short term.
This is something we're looking at, of course, other than just continue adding features,
optimizations, and things we've been doing for a while.
The second part we've been, we started exploring some of that, and we started to
see some of the early results that are super promising
is just this idea that
now that you have all the different engines
using Valox, like using the same unified
library, it gives you like a
really nice framework to evolve
what you have underneath, right? So essentially now
if your hardware is evolving, right? So it's not just
SIMD, but if you have other
sort of accelerators,
you have GPUs, FPGAs,
or essentially if you want
to optimize your code
to make better use
of hardware as hardware evolves,
you couldn't do that before
because you had to do that
once per engine.
But now that we unify that
and you have a single library,
this framework is really compelling.
So we started investigating
what would take for,
essentially, what are the frameworks and what are the APIs we need to provide in
Valux to be able to leverage GPUs and FPGAs and other accelerators. So there's a lot of
exciting work on that direction. We've been partnering with some hardware vendors as well.
I think, of course, for hardware vendors, it's also a very compelling platform. If I just integrate
my accelerator into Valux and suddenly I can execute Spark, press the
queries, run stream processing and whatnot.
So we're seeing a lot of interest on the project from the hardware vendor perspective.
We have some really nice initial results on just how much of your query plan you can delegate
and execute efficiently in, for example, GPUs. I've also been working with companies
that rely on accelerators based on FPGAs for now,
but we're seeing a lot of good,
both traction and initial results.
So that's one area that we're really investing.
Of course, we care about this as a matter,
but I think there's also something that we see
a lot of interest from the community,
both from hardware vendors or also from cloud vendors who want to enable different types of
hardware. So this is another area we've been working on. Lastly, there's also a discussion
we started, which is related to VELOX, but maybe we're plugging on a few different ways. It's just
the evolution of file formats. I think there's a lot of discussions
on the research community.
I think even at Yale DB a couple of months ago,
there was two papers comparing just Parquet and ORC
and kind of pointing some of the flaws.
So there's a discussion on, can we do better?
What are some of the flaws?
Like, of course, not just flaws in the design,
but also the use cases we have
and the access pattern to those files are evolving.
So, and those files, again, both of them were also developed 10 years ago.
But there's a lot of discussion on how much or what we could change.
Internally inside Mata, we have a project called Alpha, which is a new file format.
It's a little more focused to ML workloads.
So we also have been partnering with some other people from both academia and the industry
and trying to understand, is there anything better?
What comes after Parquet?
Like I mentioned, it's a little orthogonal to VELOX, but it's related in a way that both
VELOX will be able to encode and decode those files as well.
Yeah, a hundred percent.
I mean, at the end, a big part of doing like executing analytical workloads is like
reading from these files, right?
And usually you read a lot of data.
And I remember like, because it was very interesting that paper you mentioned,
it was very interesting to see like how much is left, like in terms of like
being able to saturate like S3, for example, when you are reading data from
there.
So that's, there's a lot that can be done there, for sure.
But Parquet has been around, it's very foundational to whatever we do out there.
So how do you see this happening, actually?
How we can move into updating Parquet in a way that is not going to break things.
And in some cases also, let's say, allow use cases that might be almost,
how to say that, not compatible with the initial idea of Parquet.
And I have something very specific in my mind about that,
because you mentioned ML.
And I think one of the interesting things with ml when it comes like to storage is that the columnar idea that you take the whole column there and like you operate like on a column exists but there are also
like cases where you need like more than that like you might need like point queries too or you might
need to have different points
in time for like the features that you have like for example like all these things like in the mail
and ai as an extension are like important and there was like but it was not like designed for
that stuff like they didn't exist 12 13 years ago so how do you do that like how do you iterate on
these things because they're like very foundational right? Yeah, no, I think absolutely.
That's the interesting question.
There's basically two ways of going around that.
We can try to evolve Parquet and just making backward compatible changes.
Personally, I'm still skeptical of that.
And there was also Parquet v2, which is still an interesting process. But I think a lot
of the changes that we think that we might need to make, I'm not sure if we can make them in a
backwards compatible way. So I think there's a discussion of whether we can actually extend
things and make this kind of somehow Parquet compatible, or if we would need a completely
different file format. I think think to the discussion about ML,
like the first thing that we saw that just breaks
is just in a lot of those ML workloads,
you have like thousands to 10,000s of features, right?
So if you want to map those things to a Parquet file,
like there's all sorts of inefficiencies
because it was not designed to have
that many streams and firewalls.
So this is one of the things I've been looking at.
And like I mentioned, this output file
format, we have like addresses this specific pain point, but there are also discussions on
hardware acceleration. Like if we're assuming that GPUs are becoming even more pervasive,
a lot of this computation will be uploaded to GPUs at some point. And it means that table scans
will be delegated to GPUs. And then your Parquet decoding will also be executed in GPUs.
Another thing we've been seeing is that Parquet was also not designed with that use case in
mind, and there's a lot of inefficiencies that make it a little harder, but not as efficient
as it could be as you map things to GPUs.
And even to things like SIMD, there's a lot of encoding algorithms that are present in
Parquet that are not easily parallelizable.
There's a lot of discussions on if we had to design a new file format,
giving those new requirements of supporting many different streams,
but also things that can efficiently be executed and not just seem to
vectorize harder, but also nested accelerated, what would this format be?
And I think what we're seeing is that it could be fairly different from Parquet.
This is still an open question, like super
interesting research problem, let's
see. But
I think that's another very interesting area, and we
see a lot of interest in basically
all the partners and all the different companies
we talk to, they all have this problem. Yeah, Parquet
is great, but we can
probably do better at this point.
Alright, one last question
for you, because we kept discussing and we kept going back to
like new hardware, new workloads.
And one of the things that like I find like very interesting and I'm thinking a lot about
it and I'd love to hear like how you think about this is the following.
So when it comes like to working with data, we have two very distinct, let's say,
models of computation, right?
We have the relational model that we are using,
like all these operators that we have
in the database system, including Velux, right?
Which is a huge foundation
of what we have achieved as an industry in the past.
I don't know when it was introduced, like 40 years ago or 50 or something like that, right?
Until today.
And then we have all this new stuff with AI and HTML.
And of course, we are using also different hardware there.
And we have the TenShort model of processing data, right?
And I mean, okay,
both are processing data,
but they are not exactly the same thing.
But we need both in a way.
We need somehow to bring
these two together.
Do you see a path
towards that without
having to abandon one of the
two? Or how
do you think about this problem?
Yeah, that's an excellent question. I don't know if I have a very good answer, but I think
what I would say is that I think if you look at the operations, they're all data operations.
Some of them were implemented by database people like vectorization joins
expression evaluation and a lot of them was just matrix multiplications and things are implemented
by ai people so in a way like you philosophically you're getting data and you're kind of shuffling
and executing transformations right i think it just the the data layout of those things is all
very different right i think one way that you think about this is all the data we have on the warehouse
and all the kind of database computations
that run on top of Arrow, essentially,
while the ML step runs on top of tensors.
So I think the way we're thinking about this
is I think for now, at least,
it's less about unifying those two worlds,
and having efficient ways to pull this Arrow data
and then transforming it into tensors.
There's also discussion that should we add support for tensors in Velux as well,
so it could actually support tensors as a data type and then potentially could have tensors
exposed inside Spark or inside Presto as data types as well.
So how much and how deep this integration goes, I think is still an interesting
open question. I think if we were to redesign and rewrite the entire stack from scratch,
I think probably things would look very differently. But I think just the fact that
we have those two very mature worlds, like one coming from PyTorch, another one coming from
Analytics, essentially database, I think it makes it harder. I know there's some very interesting work from Microsoft
and actually doing the opposite, like trying to reuse ML infrastructure
to run databases.
And I know that they actually have been getting some pretty interesting results.
I don't know.
Let's see if all this thing will unfold.
But I think for us, at least having maybe good ways
to kind of interoperate data between those two worlds, getting data
from analytics world, transforming them to tensors, handling that to PyTorch.
I think that's probably on the short term what we care about most.
But I do see that there's more potential for integration on that.
I think, like I said, I think it's just things that were created by different people and
ended up being unnecessarily siloed.
100%. Okay.
Pedro, unfortunately, we don't have more time.
I'd love to continue.
I think we could be talking for another hour at least.
Hopefully, we'll have the opportunity to have you again in the future.
There's so many things happening with like Velux. And I think Velux as an open source project can give out like a very
interesting like a view of like what is happening in the industry right now.
That's why I'm like so excited when I have like people like you.
It's not just like the technology is also like getting like taking like a
glimpse of what is coming in general.
Right.
So hopefully we will have you like again in the future.
Thank you so much. i've learned a lot
i'm sure that our audience also like is going to learn a lot and if someone wants like to know
more about like velox or wants to play around with velox or whatever let's say like they want
like to get their hands on it like what should they do yeah no so yeah i think first of all thanks for
an invitation was a great chat like a lot just to kind of nerd out about databases i'm super
excited to be here yeah if anyone in the audience is interested about any of that about bellworks
like we usually just ask folks to reach out on the github repo just you know go there create a
github issue i don't know kind of reach out to
some of us there's a slack channel where we kind of we that we use to communicate with the community
so just send me or someone from the from the community an email we can get you guys added but
just reach out like if you don't have any questions or if you're planning to use that if
you want to learn more about something like you know we're always uh pretty responsive on github and slack you know just send us a message send me an email if you
reach out thanks again costas for having me it was a lot of fun yeah likewise we hope you enjoyed
this episode of the data stack show be sure to subscribe on your favorite podcast app to get
notified about new episodes every week.
We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.