The Data Stack Show - 166: Data Processing Fundamentals and Building a Unified Execution Engine Featuring Pedro Pedreira of Meta

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com Costas, this week's conversation is with Pedro from Meta and wow,

Starting point is 00:00:31 what a lot to talk about. VLOX, of course, is a huge topic and his work on that and usage of that inside of Meta, which is fascinating. So execution engine that does a ton of stuff inside of Meta.

Starting point is 00:00:52 But Pedro's really an expert in so many different things. Databases, you know, sort of architecture of data infrastructure. So much to talk about. The thing that I'm really interested in is this concept of composable has been a marketing term in the data space as it relates to sort of, let's say, like higher level vendors that you would purchase to, you know, sort of handle data in an ingestion pipeline or an egress pipeline, right? But Pedro has a really interesting perspective on this concept of composable at a lower level of data infrastructure. And the execution engine is really sort of cut through some of the marketing noise that the vendors are creating with higher level tooling and help us understand at the infrastructure level, what does composable mean? So I think that would be I think you put like in the right way because I think the difference here when we are talking about composability

Starting point is 00:02:07 is on what level of abstraction we're talking about when it comes to composability. Like the vendors you're talking about, I think they are talking more about like composability of like, let's say features in a way or like functionality like the user wants. Composability when it comes like to what we were discussing with Pedro, is a little bit more fundamental and has to do more with how software systems are architected.

Starting point is 00:02:34 And, okay, I'm sure people that listen to us, they know the value of being able to build a system, a software system that has some kind of separation of concerns between its modules. So have much more agility and flexibility in building, updating, having people that are dedicated to different areas and in general, have something that scales much easier in terms of building. We are not talking about processing scalability. Now, traditionally, this was never, like, the case with database systems, though, right?

Starting point is 00:03:10 Like, database systems were, like, kind of like big monoliths in a way, and, like, for some very good reasons. Like, it has a lot to do with, like, how hard it is, like, to build such a system. So, when we're talking about, like, the composability that we'll be discussing with Pedras, it's more about that. Like how there are some new architectures and some new components coming out that actually allow a developer to pick different libraries and build databases, right? Or like data processing systems, let's say, in general. And that's a very important thing in the industry, because traditionally, building database systems has been extremely hard, exactly because there was almost zero concept of reusability of libraries or software

Starting point is 00:03:57 or whatever. You pretty much had to do everything from scratch. And that made the whole process of building these systems really hard and very risky, also from a venture point of view. So we are entering a new era when it comes to these systems where with technologies like Arrow, for example, Velux, we start seeing, let's say, some fundamental components that you find in every system out there provided as like a library in a way that you can take and like integrate

Starting point is 00:04:31 and build your own system, right? So this is like the things that we are going to like to talk about when we're talking about composability with him. But there's much, much more. Actually, we're going to talk a lot about like some very basic and important concepts when it comes to data processing. And by the way, Pedro has been working for 10 years in meta data infrastructure.

Starting point is 00:04:55 So he has seen a lot in these past 10 years. There's so many things that have changed and they were like built so we're going to talk a lot about like the evolution things that 10 years ago like were innovative and today they need to be rethinked and that's a hint because he's going like to announce some very interesting things also about some changes and some updates in like some very important systems out there. So Velux is a very interesting project, but also involves some amazingly experienced people. We start with Pedro today. Everyone should listen and are going to enjoy and take a glimpse of the future of what is coming.

Starting point is 00:05:42 And hopefully we are going to have him back and also more people related to this technology to talk more about that stuff in the future. All right, well, let's dig in and chat with Pedro. Let's do it. Hello, everyone, again, to another episode of the Data Stack Show. I'm here today with a very special guest, Pedro from Meta. And we are going to be talking about some very interesting things that have to do with databases.

Starting point is 00:06:08 But first of all, Pedro, welcome. It's really nice to have you here. And tell us a few things about yourself. Pedro Almeida- Yeah, definitely. Yeah. I think first of all, I'm super excited to be here. Thanks for the invitation. Just introducing myself, I'm Pedro Pedreira.

Starting point is 00:06:27 I'm from Brazil. I've been working at, well, formerly Facebook, now Meta for exactly 10 years now. It's been my, we call this a metaversary. So my 10th metaversary was about a couple of weeks ago. So yeah, I've been always focused on data infrastructure. I'm a software engineer. I've been developing systems, query infrastructure. I'm a software engineer. I've been developing kind of systems, query processing, like all sorts of things around data. I spent

Starting point is 00:06:51 about five to six years working on a project called Kubrick, which was something that I started inside Maddox. It was something based on some of the ideas I was researching in my PhD. So the idea was more creating a new database, very focused on very low latency analytic queries. So we have a special way to index the data and partition data in kind of very small containers. And it could use that to speed up your queries, especially if it was kind of highly filtered queries.

Starting point is 00:07:20 So I spent, like I said, about maybe five to six years working on that. It was really cool getting the things I was researching and turned that into an actual product inside Matter. It was something that at some point was running many inquiries. And then there was this other thing that was being developed in parallel called Fresco, which was also a really great piece of technology. It was making some good progress. And at some point we started seeing that they're kind of getting closer and closer to the point where we eventually, we brought the teams together and we started merging the technology. And then it was when

Starting point is 00:07:52 some of the ideas around Bellux and actually unifying the execution engines, all those different compute engines into a library started. And that's how we kind of, you know, we started looking into Bellux. And I've been doing that for, I think the last three years or so. So I've been really focused on Bellux, but inside Meta, I also work pretty closely with some of the other compute engine teams, so I work very closely with the Presto team, with the Spark team, Graph Analytics, but this is kind of my world, like software engineering around all those analytic engines, but I also work very closely with the machine learning infrastructure,

Starting point is 00:08:25 people, real-time data, like all things kind of related to the series. My thing. Okay. That's super cool. Okay. Before we start like talking about more, let's say like recent things and more like, how to say that, like Silicon Valley kind of things. I'd like to ask you about like growing up like in Brazil, right? And deciding like to get into databases.

Starting point is 00:08:53 And the reason I'm asking is because, okay, I also come like from, like I wasn't born and like raised in the United States. I was also like in Greece. I went like to like a technical school there and there were like many different options like to go and focus on. And what brought you, let's say when you were like in Brazil studying

Starting point is 00:09:16 to focus on databases? Like what was the thing? Like it wasn't like, I remember like most of my friends, like for example, we were like at electrical and computer engineering school. Everyone wanted to build, at the beginning, a 3D engine. Playing Quake back then.

Starting point is 00:09:31 I'm a little bit old. And we all wanted to do something like that. Databases were also something that would come later. It was a very specific small group of people getting into that stuff. So tell me about that. How you got exposed to that and got excited about it. Yeah, definitely. Good question. I think for me, when I was in college, I started looking into operating systems.

Starting point is 00:09:56 So it's really going down into Linux kernel, understanding how all those things work, memory management, how to manage resources, things like that. And then I spent some time looking at distributed systems as well. So I think the first thing that really caught my eye on databases is just, I think, the kind of the breadth of things that you get to interact with, right? Because if you're building databases, I mean, you need to understand operating systems. You need to be a great software engineer, but there's also compilers, languages, kind of optimizers. So I think there's just so many things that you need to understand if you actually want to go down the database route.

Starting point is 00:10:33 It was just something that was fascinating to me. And I started a little bit more on doing research on transaction processing. And then somehow I kind of fluctuated to more of this analytics side of a multi-dimensional indexing on databases. I think the more I started learning how databases work, I think the more I started, like I said, just getting fascinated by how complex those things are. And I think that's the part that was really interesting because you can go as deep as you want in compilers inside databases or then it you're harder, like going down to how harder works and how prefetching, how caching, like all those things are, you know, you

Starting point is 00:11:09 need to have a really good understanding of those to build a database. I think that was just something that always caught my eye. That's how I got into that. Yeah. Oh yeah. That makes a lot of things. Okay. And another question that it's like I will ask this in an attempt, like to try and clarify

Starting point is 00:11:28 things for like our audience, because it is one of that stuff that it always gets like the semantics vary depending on who you talk with. And that has to do with like latency when it comes like to interacting with data. Right. So we have streaming processing. We have like what some people say, real-time databases, we have interactive analytics, we have many different, let's say, terms that are used for different systems. It'll have to do with the latency of the query. And it's very interesting because you mentioned Prestoo and Presto, it is also like a

Starting point is 00:12:09 system that if you compare it with something like Spark, for example, right? It's much more interactive. Like it is, I mean, not how Spark can be like today, but if we take it like when these things like started, but the whole idea was that whatever we design here, like the system we are designing, like one, let's say guiding design here, like the system we are designing, like one guiding principle is that these queries that we are running, they need to respond in a timely manner, right? We are not talking about queries that they are going to run for hours.

Starting point is 00:12:35 But then you're mentioning Kubrick, and you're talking about even lower latency. Can you help us understand a little bit like how time and latency relates to databases and try to create some kind of categories there that are as distinct as possible and like to communicate this with our audience out there? Interesting. Yeah, I think, let me take a stab at that. I think like you're saying, if we're looking at more analytic queries, there's definitely a spectrum that goes to interactive analytics. Sometimes they call that last mile analytics, which are things that you need to serve with

Starting point is 00:13:16 really low latency. Imagine things serving user-facing dashboards that have really high QPS and you're expecting very low latency. And building systems focus on that, like you need to take different trade-offs. You cannot expect to scan petabytes of data for those queries. There's probably some pre-computation, some caching required,

Starting point is 00:13:34 like even how you propagate your query between workers, it should be very, kind of really fine-tuned. So I think we usually call that there's this kind of this last mile analytics that relies on a lot of this last mile analytics that relies on a lot of caching and pre-computation.

Starting point is 00:13:48 Then there's more interactive analytics. Sometimes they call them maybe exploration or ad hoc, which are in some ways you're doing some exploration of data. So you know your data set, maybe you're kind of trying to join different tables to get some insight, but it's not something that you already found and you already built a dashboard. And you have a few questions that you want to serve with really high latency. There's this part in between that is a little more exploratory, but it still needs to be executed with low latency because there's a human expecting to see that result. So it goes from this serving in a way to this exploration. And then on the other side,

Starting point is 00:14:25 there's more of those kind of really kind of higher latency batch queries, which is just, okay, I need to prepare my data. So I need to join this, I don't know, 10 petabyte table with this, another five petabyte table, generate some tables. And then those tables are going to be used to serve something else.

Starting point is 00:14:41 So usually, and this is only talking about analytics, but it goes from this side of serving dashboards, your kind of human exploration, a little more ad hoc to kind of large batch processing and ETL. And I think, like I said, at least inside meta, like a lot of the batch workloads are just running side Spark, right? Because it's a system that is really good at handling large queries. There's a good support for query restartability, and it's a very reliable system. And as you move to the more interactive part, you go to systems like Presto, and

Starting point is 00:15:12 we also have things that are focused to kind of higher QPS and serving really low latency queries. I think that's an expector. Again, that's only talking about an alert. So if you talk about kind of transactional workload, if you're talking about ML or real-time data, like there's another world, but that would be maybe

Starting point is 00:15:27 a high-level classification. No, that makes total sense. And like, just like to help our people like map these categories, these three categories with like some like products out there, right? So you have systems like Pino or Druid or ClickHouse.

Starting point is 00:15:44 And I would assume, and correct me if I'm wrong, these are like the systems that are closer to what Kubrick was supposed to be doing. Correct. Then you have the systems like Trino, like Presto or like Snowflake that are more for longer running queries, but still like queries where there is a human waiting

Starting point is 00:16:04 in front of the screen, right? And then you have all the bad stuff that might take like hours, some case like even days, like maybe, but, and then you have like Spark. And now, okay, I'm going like to ask a very naive question, but I have like a good reason to ask it like that. Why do we need like different query engines and we can't have just one and cover all the different use cases? Yeah, I think that's a good question. It goes down to that whole one size does not fit all discussion started, I think, maybe at this point, 20 years ago-ish. But yeah, I think essentially depending on the capabilities, depending on what you need,

Starting point is 00:16:48 what are the capabilities you need from your engine, you need to assemble your query engine in different ways. So for example, like if you need something with very low latency, you probably want to propagate your query to less servers. You want to have things maybe pre-calculated. Maybe a good example is if your query is really short-lived and something fails in the middle of your query, it's probably cheaper to just restart your query

Starting point is 00:17:12 than trying to just kind of save intermediate state, right? While this is very different from if you're running a query that takes, I don't know, an entire day and something failed, you need to have ways to restart just that part of your query. So different systems, depending on the capability, depending on your requirements, they need to be kind of organized and assembled in different ways. There are always discussions about HTAP systems and systems that could magically kind of cover most of that.

Starting point is 00:17:38 There's some projects that are even kind of successful in the industry, but I think at least for us at scale, we don't see anything that can kind of work at our scale and still fit all those requirements. I think the very interesting question, and that's how we got to Bellux, is that you do have all those different engines, but if you look at them, how different they actually are, right? As well, if you look at it, well, the first good example and where we started is if you look at things between Presto and Spark, like you mentioned, Presto is definitely more focused to lower latency queries, more ad hoc, while Spark is just for those really

Starting point is 00:18:10 large batches. But if you look at the code that actually execute the database operations, they are the same. There's nothing different about the way you execute hash joins or the way you execute expression. Of course, one of them is vectorized. The other one is a little kind of more closer to code gen, but essentially kind of the semantic of those operations

Starting point is 00:18:28 are always the same. So I think there was a lot of questions that we, or maybe the discussion we brought was that, yes, the engine needs to be different because the way you organize or query execution requirements are different, but the code you execute can be shared, or at least a lot of the code can be shared.

Starting point is 00:18:44 And that's how we got it to balance. All right. That makes a lot of sense. You mentioned an interesting distinction in terminology here. You mentioned vectorized versus code gen, right? Tell us a little bit more about that. And the reason I'm saying that is because I'm aware that there are two distinct, let's say, main categories of how an execution engine is operating or architected.

Starting point is 00:19:13 But I'm not sure that many people know about that. They think of vectors and might think for many different things, or cold-gen with, again, many different things. So let's get a little bit more into that. And that is going to like to give us also let's say the like the stepping stone to get into vellux right so what's the difference between the two yeah i think it's essentially like how you you process your data i think we usually when you talk about vectorization the assumption is that you take entire batches of data so you're not executing operations against one single row, right? So you have a batch of, say, 10,000 rows, and then you apply one operation at a time over this entire batch. And then, I mean, there are a lot of trade-offs, but it's usually a little better suited to CPUs because, well, the prefetching is

Starting point is 00:20:00 better, your memory locality is better. So usually executing those operations against the batch, like you can amortize a lot of this cost. I think where things get tricky is that if you don't, if you cannot batch records or create batches that are large enough, then a lot of cases you have a bunch of operations that you want to execute against a single record, so the, a lot of the overhead you have with vectorization, then it really adds up. So I think what we see is that if you're looking at analytics, we're usually processing a lot of data. Vectorization might be a better candidate just because you always have large batches.

Starting point is 00:20:34 So essentially the overhead of finding what's the operation you need to execute on that batch, it gets some more tight because the batch is large enough. But usually in some cases where your bags are smaller and I think there are probably some disagreements in the community on where exactly this barrier that this, you know, this line is draw, but usually in real life things are a little more transactional. Right. So we have like lots of operations that you want to apply over and a one record or a few records. Then code gen is it might be a better solution because you essentially take

Starting point is 00:21:03 all the code and you generate at execution time, you generate some assembly code that can execute those operations more efficiently. And when we started those things, there was a lot of discussion on what would be the best kind of methodology to use for validates. All the algorithms are created, all the enterprise. We have been experimenting with CodeGen as well. One of the tricky parts of CodeGen is always that there's some delay and there's some overhead in generating this code at execution time, so it's always like, you

Starting point is 00:21:33 know, have some gains, but then like, does the compilation time really offset that? And given at least our experience, like usually vectorization still was a clear winner. Another discussion there is always just the kind of complexity aspect of kind of generating code at runtime. It might get things kind of trickier to debug. It may make things harder to develop. There was some recent work in the community that actually suggests that cogen might not be as terrible as it seems. Like how to make it easier to develop Ogen-based engines.

Starting point is 00:22:05 But I think our experience with that for analytics vectorization was still the clear winner. I think even if you look at the industry, most of the engines focused on analytics, they do vectorization, which is this idea of just executing the same operation over large batches of data. Yeah, that makes a lot of sense.

Starting point is 00:22:22 And I'd like to make something clear here, because I think people might also think that when we're talking about vectorization, it's always connected to specific instruction sets that some CPUs have, like AVX512 or something like that. But actually, vectorization is much more fundamental than that. Yes, you can have acceleration because of these instruction sets there, but you get benefits anyway because of these instruction sets there, but you get benefits

Starting point is 00:22:46 anyway even without the instruction sets. It's a whole architecture. It's not just having special hardware that can accelerate some specific things. Correct? I'm not confusing people more, but there's a lot of confusion out there, and I would like to make sure that we make things clear. No, exactly. I think that's a great point. When we talk about vectorization, that doesn't necessarily mean SIMD is an instruction. I think the point is just designing your database operations in a way that SIMD can be leveraged to execute those things efficiently, there's always a discussion on whether you do this explicitly or you let the compiler do it and kind of implicitly discover that this pattern can be translated to SIM the instructions. But basically what you said, I think it's maybe a little higher level.

Starting point is 00:23:50 It's more, for example, if you're executing hash to volumes and you need to look up on hash tables, like whether you take one record and then you do the entire lookup in your hash table, or if you first calculate all the hashes and then you kind of refetch things, it's more in that kind of higher level, how you organize the operations you're executing rather than just simply as part of that, but it's not only that. Yeah, yeah, 100%. organize the operations you're executing rather than just simply as part of that, but it's not only that. Yeah. Yeah. A hundred percent. And okay. Can you like, before we get into like more into like Velux, can you give us like an example of like a database? I mean, Velux obviously like is vectorized, but another one that is like, let's say not a library, a whole database, but it's like vectorized so people like can relate to that. And on the, a whole database, but it's vectorized so people can relate to that. And on the other hand, again, a popular system that is called Gen and people might not know

Starting point is 00:24:30 that it is called Gen. Yeah. I think Presto is a good example of a vectorized engine. I think that the vectorization that Presto does and Velux does is not exactly the same, but I think the paradigm is the same. Fulton, at least based on the public information they have about paper, it seems like everything is vectorized. I have the impression that Snowflake is also vectorized just based on the people who created them, the system they worked before.

Starting point is 00:25:00 But I think that there's not as much information. But I would say that most of the analytics engines are focused on, are essentially vectorized. Most of the systems essentially coming from kind of the German universities, like that's a kind of a big shop of vectorization. So of Cogem. So the Umber system, I think is maybe the newest system they have been working on, which is all based on CodeGem. Before that, they had the Hyper system, which I think at some point was acquired by either Salesforce or SAP.

Starting point is 00:25:32 So that was also like one of the kind of really successful CodeGem systems. I think that those maybe are the best examples. I know that Redshift is, it does some CodeGem as well. But I think that's kind of maybe the first one that comes to mind. Redshift does some code gen as well. But I think that's kind of maybe the first one that comes to mind. Yeah, I think DuckDB is also vectorized. DuckDB for sure is vectorized, yeah. Yeah, and I had the impression that Spark, I mean, not with Photon,

Starting point is 00:26:00 it was code gen. But I might be wrong here, but I had the impression that it was still... I mean, if you use like NodesPhoton, it is like CodeGen, but... Yeah, I think you're right. I'm just not sure, because I know that we do have some internal changes to Spark as well. I don't know if the CodeGen is something that we added or we improved, but definitely the default kind of Java, Scala, Evolve is based on... It is not vectorized, it's based on CodeGen is something that we added or we improved. But definitely the default kind of Java, Scala, Evolve is not vectorized, it's based in CodeGen. Yeah, yeah. Which kind of makes sense because vectorization, I think in the JVM space is a little bit more challenging

Starting point is 00:26:34 from what it seems like in some cases. And a quick question here to get into Velux. So Velux is by the way written in C++, right? So it diverges a lot from the paradigm of these systems like Spark and Presto that were JVM-based systems, right? Why is that?

Starting point is 00:26:57 That's the first question. And the second question is, how do you bridge the two? Because it sounds interesting to go and be able to take Presto or take Spark and change the execution engine and put there something that is written in C++ and have the games that you might have. But how trivial is to do that, right?

Starting point is 00:27:22 Yeah, good question. So when we started Bellux, I think there was maybe two different discussions. Like one of them is what you mentioned, just how we would write that and which language. So that actually started inside the Presto team inside Meta. So it was a couple of people just looking at accelerating Presto, right?

Starting point is 00:27:41 So what are the optimizations you can do on the Java stack? And then I think at some point, like this idea came up of if we actually rewrite those things in Presto, right? So, you know, what are the optimizations you can do on the Java stack? And then I think at some point, like this idea came up of if we actually rewrite those things in C++, like would they be more efficient? So at that time, there was some smaller benchmarks or micro benchmarks,

Starting point is 00:27:55 just kind of compare how efficiently, like if you just look at the very hot loops, and I think one of the things or the areas where we started was essentially table scans, because that's where most of the CPU was going. So if you look at the main loops inside table scan, and you just kind of map those things into C++, or what we call native code, how much faster that would be.

Starting point is 00:28:15 I think we got some good data that at least for those smaller micro benchmarks, there was some performance gain, I think it was a few X from 4X to 10X, depending on what you do. So there was a good performance gain by just kind of moving things from Java to C++. I know the Java lovers will probably yell at me. Of course, there's a lot of kind of things that you can stretch the JVM to do things even more efficient, but then you kind of start

Starting point is 00:28:38 losing a lot of kind of the good benefits that the JVM and Java, some of the guarantees that the language gives you. So you can kind of stretch Java in awkward ways, and then it can bridge some of that. But then I think it just came up that writing those things in C++ in the first place was just kind of a better idea performance-wise. So that was one aspect of the discussion. The second one was that we didn't want to do,

Starting point is 00:29:01 or when we started writing some of that, the idea is that we didn't want to do that on a per-engine basis. So I think usually if you're looking at just Presto or just Spark, it's one discussion. For us, we have 10 to 20 different engines, right? So you don't want to have to do that on a per-engine basis. So I think that's when the idea of like, if we need to rewrite those things from Java and C++, that's a lot of work. Like it's almost 10 years of work. But if we want to do that, we want to do that once,

Starting point is 00:29:28 and then we're using all the different engines from not just Preston and Spark, but we're using that in stream processing. We want to reuse that in machine learning for processing, for training, for serving and inference. We want to reuse that for data ingestion. We want to reuse that in as many vertical or different use cases as we can. And then creating that in C++ just make this a lot easier. At least for us, the whole ML stack is all C++. So it's not that you could reuse a Java library in that world as well. So I think just C++ was a better choice at that point.

Starting point is 00:30:06 I think to your second question about how we plan those things, it's interesting. It also depends on exactly how you're integrating or maybe how deep you want the integration with the engine to be. Specifically for Presto, there was a discussion on whether you just replace the eval code by C++, and then in that world, you need to have some sort of inter-process communication or you need to use JNI.

Starting point is 00:30:29 And then there was questions about, you know, how much complexity would JNI add? There's also discussions about memory management, like how much memory you give to the C++ side of things and how much memory the JVM would need. So what we ended up deciding is that specifically for Presto, we removed Java completely from the workers. So essentially, Presto has this

Starting point is 00:30:49 two-tiered architecture where you have one or a few coordinators that take a query, parse a query, optimize it, generate query fragments, but then the execution of those fragments they're done in worker nodes. So what we do is that on those worker nodes,

Starting point is 00:31:05 we completely replace the Java processor. So there's no Java running that. Most of the execution of the operation, they're done by Valux, but there's also kind of a thin C++ layer that communicates, essentially implements the HTTP methods that communicate with coordinator.

Starting point is 00:31:21 So that's the project we call Prestissimo. So Prestissimo is essentially this thin layer around the around the Alex. So that's how we go the specifically inside press. So that's how we go between between between languages from Java to C++. We do that via kind of HTTP REST interfaces. And is there, like with Spark, for example, and like, okay, the reason I'm asking is, seeing also like the project Gluten, which is like attaching,

Starting point is 00:31:54 not the, I mean, Gluten is like more generic, but tries like to create like a layer there to put like different execution engines like on Spark. And I also had like the impression that at least like the photon folks that were talking about JNI and like also the context switching that happens there. And like what kind of like overhead this is also odd and like, do you gain at the end or like you lose? Like, so from what you've seen so far, because okay, like traditionally

Starting point is 00:32:23 Java to communicate with other systems, you use JNI, right? Does it make sense to do that with a system that's, okay, I mean, you put all this effort to write in C++ at the end to go and add benefits in performance. You don't really want to lose any of that benefits, right? So based on your experience, what have you learned around that? Yeah, I think that's a good question. I think for the JNI discussion was less about performance.

Starting point is 00:32:53 I think the assumption is that you cross the API boundary very rarely. So I think the idea is that you would go from the Java world into C++ and cross the JNI boundary just once. And then all the heavy lifting operation would happen just inside C++. So we wouldn't transfer... Another way of saying that is that we wouldn't transfer data between JNI and Java. So all the heavy operations would be on the C++ process. I think the discussion was a little less about just performance and just about simplifying

Starting point is 00:33:24 the stack and resource management. Another big difference is just how Presto and Spark are organized. Presto is essentially a service. So we have a few workers and those workers are always running. There's a single process and you share this process among different queries. So in a way, you want to be very conscious about how much memory you have on the box, how much memory that process is able to use. And then inside this process, you share memory and memory pools among different queries. While on the Spark side, it's different. Spark is just like, it's more like a batch job dispatching system. So when you execute a query, you allocate containers and different pools. And then at least how we use that internally is that you do that. And that lives while your query is executing.

Starting point is 00:34:09 At that point, you kind of remove that. Right. So I think this being able to efficiently share resources and share memory, like it's a lot more important on the Presto side than it is on the Spark side. With that said on the Spark side, there's also kind of, we went through a few iterations on how those things would be integrated. side. With that said, on the Spark side, there's also kind of, we went through a few iterations on how those things would be integrated. There was a first try internally, and I think that was at least two years ago, doing something

Starting point is 00:34:33 similar to Gluten, but instead of doing via JNI, just using a separate process. So essentially Spark, the Spark driver would create a new C++ process and then communicate using some sort of IDL, some sort of inter-process communication. We went from that mode to a completely different project that we can maybe talk in a little more details. But essentially, the idea was taking the same Presto frontend and the same Presto worker code and executing them on top of the Spark runtime, which is something we call Presto and Spark. So I think that was kind of our strategy because the integration work would be a little less, right? So we just needed to integrate Velux into Presto. And then if we're using the Presto engine on top of Spark, then the

Starting point is 00:35:19 integration cost would be easier. So that's how we use things internally. And then on the community, as you mentioned, like Intel started and created this project called Gluten, which is a little, in a way, similar to Fulton in the sense that it uses JNI to communicate with Velux. And then the API is based on Substrate and Arrow. So I think that one is closer, but we don't use that internally, but we've been working closely with Intel. And I think in a way,

Starting point is 00:35:49 it's a good way to look at Gluten as sort of like being a JNI wrapper around Velux that you can integrate that into Spark. But I think the project is generic enough that we can reuse that in other Java engines as well, like Flink, like even Trino and anything that uses Java. Yeah, that's interesting. Okay, what about Arrow, though?

Starting point is 00:36:12 Because one of the promises of Arrow is that we agree upon a format on how to represent things in memory. And this can be shared among like different like processes and like all these things, which obviously like, okay, sounds like very interesting because when you're working with a lot of data, like the last thing that you want to do is like copy data around, right? So that's always like a big overhead. So where does like Arrow fit? I know that Velox, for example, is inspired from Arrow,

Starting point is 00:36:49 the way that the layout in memory is designed. But it's not Arrow. It's not exactly like implementing Arrow. So how do you see the projects there, let's say, work together? And what's your prediction for the future in terms of the promises of Arrow and the role of Arrow in the industry? Yeah, that's a good question. There's some background to that as well.

Starting point is 00:37:20 Yeah, I think, like I said, Arrow is this columnar layout that is meant to, you know, it's a standard in the industry. And then I think the whole idea is that systems would reuse that layout and then you can move data, zero copy across engines. So when we started Bellux,

Starting point is 00:37:35 I think, of course, the idea was just kind of reusing that same layout because, well, it makes, it would make it a lot easier to integrate with different agents. What we started seeing is that it was in a few different places where we saw that if we made some small changes to that layout, we could implement execution primitives more efficiently.

Starting point is 00:37:56 So then there was a lot of discussions on whether we would stick to the layout at the cost of maybe losing performance in some cases, or whether we would extend Arrow in a way and implement our own specific memory layout in some cases or whether we would extend Arrow in a way and implement kind of our own specific memory layout in some cases just so we could execute some operations more efficiently. And I think because for Valor to really care about efficiency, we decided to go with kind of extensions. So that's how we kind of created what we usually call Valor vectors.

Starting point is 00:38:23 It's essentially the same thing as Arrow in most cases. The only difference is how we represent strings is slightly different, and how we represent kind of vectors and, sorry, arrays and maps. There's some differences on how we do that. And we also add some more encodings that were not, at least at the time, were not available in Arrow, like a constant encoding, RLE, and frame of reference. But then we decided to extend that and in a way it deviates from the Arrow standard. So that's how Velux started and all data represented in memory all followed the Velux

Starting point is 00:38:58 vector mode. Then we started because many engines integrating with Velux, they would like to produce and consume Arrow. There's some sort of conversion, or there's a layer that can do this conversion, which like I said, in most cases, zero copy. But if you hit one of those cases where we represent things differently, then it incurs and can copy. So this is how the project started. Because Velux was gaining adoption, was gaining visibility, we also started working with the Arrow community just to kind of understand those differences and see if that's something that the Arrow community would be interested in extending the standard as well. We started working with a company called Voltron Data that was also interested in Valox, and they have a lot of good work on the Arrow community.

Starting point is 00:39:41 So what we ended up doing was actually proposing to the Arrow community to add those Valox extensions to the Arrow community. So what we ended up doing was actually proposing to the Arrow community to add those VALX extensions to the Arrow specification. I think at some point we were able to convince them that those things are actually useful, and not just VALX, that some of the other compute engines also do things in a similar way. So some of those changes are already adopted and already available in the new version of Arrow. There's one last difference that we're still having some discussion, but hopefully we'll make it to the next version of Arrow. So essentially, I think long story short, like we, when we started, it was sort of

Starting point is 00:40:14 a fork or like, we called it an Arrow++ version, but then we were working with the community to, again, to kind of contribute those changes and then hopefully align, and then the idea is, of course, that we can ship data back and forth each hour without any performance loss, making everything zero copy. Okay, that's awesome. So you mentioned a couple of folks out there

Starting point is 00:40:38 who are using Velux. And who else is using Velux outside of Meta? It is like a pretty mature open source project right now. If someone goes there, you can see a lot of contributions happening from people also outside of meta. So what's the current status of the open source project? Yeah, that's a good question. I think that's one area that we are actually really happy on

Starting point is 00:41:02 how much attention and how much visibility we're getting. I think part of our assumptions on the open source balance is that, because in a way, the audience is very limited. It's not something that end users would download and use. In a way, it's very different from that DB. That DB must have millions of users because data scientists, data engineers, they actually use that. I think for us, our audience is just developers who actually write compute engine. So in a way, I think the group of people who would be interested on this would be very limited.

Starting point is 00:41:33 But regardless, we saw very kind of quick growth on just attention and different companies in the industry were just interested in the project and wanted to both start using benchmarking, comparing with version they had internally and then eventually help us. There's a list of companies that one of the first ones that joined us really early on and has been helping specifically from the Presto perspective was Ahana.

Starting point is 00:41:59 So they have a team of people working with us really closely. They recently got acquired by IBM. So think not IBM, Ahana, like they're one of the companies who've been uh helping us develop and using velox specifically on the presto context for a while uh intel was another company that joined us very early on uh they're interested a little more on the spark perspective and i think of course on the part of making sure that this library runs efficiently on Intel hardware. So they have a lot of work on the Spark side as well. So they're the company who created Gluten.

Starting point is 00:42:32 Specifically for Gluten, there's a lot of other companies who joined us afterward and helped develop and are using that internally. So we heard from Intel, Alibaba, Tencent, Pinterest, a large number of companies who have been looking into that as well. We've also been chatting with Microsoft about Newton and some other usages of Allux here. I think probably we heard and we saw people from most tech companies at this point, with very few exceptions, like some of them already actively using that internally and helping contributing code, like some of them are just asking questions and trying to benchmark things. But I think most of the large tech companies,

Starting point is 00:43:10 we heard them in some shape or form, which was really interesting. It was really validating to our work. Yeah, 100%. I think it's very rewarding also for someone who's leading a project like this. So, two questions that I have to ask them because I don't want to forget them.

Starting point is 00:43:31 One, the first one you mentioned at some point, Velux is being used, not just like for analytical workloads, but for streaming, it's used for ML. And I'd like to ask you, let used for ML. And I'd like to ask you, let's take ML as an example, just because it's the hot thing right now. It helps with marketing. So, what's the difference? And the reason I'm asking what's the difference is because I also try to understand and I want to learn what is also like they have like in common, right?

Starting point is 00:44:08 Because if you can share like such complicated components like Velux, just like the execution engine that does like the processing of the data, right? Between ML and analytics, it means that they are different, but there are also some things that are common. So I'd love to hear that from you. So please help us learn. Yeah, no, definitely. Yeah, I think, like I said, when we started Bellots, we started learning and talking to many other teams. And I think that's what we started realizing that actually the similarities

Starting point is 00:44:45 between agents developed for all those different use cases, they're much higher than what we expect. And I think it was the assumption that are just different things that do various things in their very particular ways. But what we found was just,

Starting point is 00:44:58 it was kind of basically the same thing developed by different groups of people. And you can see that within analytics, but you see that even more kind of evident between analytics and ML just happens that all the kind of data management systems and analytics were developed by database people and all the kind of data part of ML was developed by ML engineers. So of course, there's a lot of things that are very particular to you, to ML,

Starting point is 00:45:22 like how you do the training, like all the bulk synchronous processing you do on that, like all the hardware acceleration. But there's also the data part on ML. So there was maybe two main use cases that we've been looking at. One of them is just what they call data preprocessing. Because basically, when you want to start your training process, you need to feed lots of data. And usually, this data is stored in some sort of storage. Like for us, we pull data from the same warehouse.

Starting point is 00:45:48 So there's this logic where it's usually not as, of course, not nearly as complex as a large analytic query. It's not like you're doing aggregations or you're joining different data sets. But it's still like you're doing a table scan, you're decoding files. Like sometimes I think we don't get to do a lot of tutoring, but there's always some expression evaluation. You execute some UDFs. So this part is actually very similar to... Essentially it's the same. There's nothing different about how you decode those files, how you open Parquet or Org files and how you execute expressions. The UDF APIs that machine

Starting point is 00:46:23 learning users use and all the types we provide to them. So all that part was very similar. But it was also a discussion that how much we could reuse. And then what we ended up seeing is that it's just kind of more of the same. So we had a different library that was also a C++ library that is heavily used for this specific use case, like for training when you're pulling this data and feeding this data into the model.

Starting point is 00:46:45 We had a C++ library internally, and we saw that the implementation of those things, it's very similar. So this is one of the areas where we started on working with Balance. The second part was around feature engineering, which is essentially how do you take your raw data and transform into features. And then there's a few different parts of that. Like how do you do that on offline systems, essentially you should have your internal warehouse tables with the raw data, how do you do, how you run ETL processes that actually generate features that can be used for training.

Starting point is 00:47:17 We also do a similar thing on the real time side. So there's also real time system that does feature engineering. And there's also the part of serving features. So essentially, when a user needs to see an ad, like when you get the input data, you need to transform that data really fast and really low latency to transform that into features so it can actually do the inference on the model.

Starting point is 00:47:39 So again, those less mild transformations they need to do from the raw data into features, again, it's essentially expression evaluation sometimes at UDF. So that's the part we're converging. And again, I think the more we look into that, the more we saw that it's just essentially more of the same, just implemented by different groups of people with different semantics, different

Starting point is 00:47:57 bugs, but they're the same. And what's the difference? Just like we couldn't be really curious here. What's the difference between the offline feature engineering and the data pre-processing for training that you mentioned because okay, what I understood by the difference is that data pre-processing is much more let's say like doing like serialization, deserialization, like more that doesn't necessarily require like typical like ETL stuff like joins

Starting point is 00:48:27 and things like this, while the feature engineering might be like more complicated, like the type of processing there. Yeah, I think on the feature engineering side, you can imagine that the offline, it's essentially it runs on top of traditional SQL analytic computations. So basically, you're doing large SQL transformations. Sometimes you're joining things or executing expression, but basically you're taking all this raw data and creating features.

Starting point is 00:48:52 Data preprocessing path is more, you already have those features created. You're pulling this data and feeding this into the PyTorch model or whatever model you're trying to train. So in a way, you can see the slightly different parts of this pipeline. But also maybe the complexity

Starting point is 00:49:08 and the type of operations you execute, they're also different, right? But when you're creating those features, the type of operations you can run, they might be a lot more complex. There's a lot more to it. There's also shipping data, different places, doing shuffles

Starting point is 00:49:22 and moving lots of data to different places. While on the training side, you want to very efficiently just pull this data and feed it to the trainers very fast. Makes sense. You mentioned UDFs and some other stuff. There's a question that I wanted

Starting point is 00:49:39 to ask about Velux. If someone goes to the documentation of Velux, they will see that there are two groups of functions there, one for Presto and one for Spark, right? And you will see the same functions in a way, like implemented like twice in a way, right? Why is this happening? Like why you need something like that?

Starting point is 00:50:01 Yeah, that's a very good question. And I think that's where we get to the whole kind of dialect and semantic discussion. So basically when we created Valux, there was this discussion of which specific dialect we should follow, right? Because it's not just that the code base are the same, but for example, what are the data types you support? And when there's an overflow, what's your behavior? Because different engines, they do that in different ways. So if you follow the behavior in one engine and you try to integrate ballots in a different engine,

Starting point is 00:50:30 then you're essentially changing the semantics. So how exactly you handle operations, what are the functions you provide, what are the data types you provide was a big thing. And what we ended up deciding was that because we need to use ballELOX across different engines, VELOX needs to somehow be kind of dialect agnostic. So what we ended up doing is that the core library or the core part of VELOX,

Starting point is 00:50:53 we tried to make it as kind of engine independent as we can be. And then you have extensions that you can add to follow one specific dialect. So I think what you mentioned is that if you're using VELOX for, then you're going to use the Velux framework and you want to add all the Presto functions. And that would be a Velux version that follows the Presto dialect. But if you try to use that in Spark, that won't work because Spark is expecting different functions, the different semantics. So there is also a set of Spark functions that you can use. So if you're compiling Velux for Gluten, for example, you're going to use Bellocks with the Spark SQL functions. But the other interesting part is that it lets you kind of mix and match those different things.

Starting point is 00:51:32 So you can potentially have the Spark version of Bellocks running within Gluten, but also exposing Presto functions for compatibility. So it kind of enables you to do this sort of thing. I think that's one of the parts that we saw is it's actually pretty tricky, right? Because the code is the same, but each engine has its own quirks and its own SQL dialects and no different databases, they have the exact same, even if all of them support SQL and all of them claim to comply to the SQL standard, exactly, they're never compatible. They're always different. So there's also discussion. What we ended up doing internally inside Meta,

Starting point is 00:52:11 because that's also a problem that we really care, right? So if you're a data user and you need to interact with five or six different systems, and each system has a different kind of SQL dialect with different coordinates and different semantics, it makes our life a lot harder. So as a separate goal, we also have this idea of converging dialects. So it would be a lot easier if we had one SQL dialect that people could use to express their computation,

Starting point is 00:52:36 and that would work just on top of all those different systems, but not just even for SQL, but potentially if you're creating your future engineering pipelines, or if you're doing, you know, if you start training your model, like ideally if the functions you expose, the data types and semantics were exactly the same as your Presto queries or Spark queries, like it just makes the user's life a lot more efficient, right? So I think this is something that it doesn't necessarily come from Bellix, but Bellix allows you to do that because you have the same library, the same functions, the same semantic being used

Starting point is 00:53:06 in all those different engines. We can make the user experience more aligned. And this is what makes data users and data scientists a lot more efficient. So that's another thing. It's a lot harder than just Velux

Starting point is 00:53:19 because, well, we need to make changes on engines. There's all sorts of concerns with backwards compatibility, but it's something that long-term we're also looking at. Yeah, I think that's also like my... How to say that? Like my...

Starting point is 00:53:34 Not concern, but something that always confused me a little bit with Substrate. Because I was like, okay, yeah, sure. You can take a plan and you can serialize in Substrate and then, let's say have like a plugin on Spark that gets that and like executes it. But at the end, like the semantics of Spark might be slightly like different

Starting point is 00:53:55 than like Presto, let's say, for example. So is Substrate at the end like provide the interoperability like layer that we would like to have at the end? Or it's like limited at the end by how semantics are like implemented in each system? Because, okay, Substrate cannot go and like implement that stuff. It's not its purpose, right? So what do you think about that? Because it's like, depending on who I talk with, it's very interesting.

Starting point is 00:54:25 If I talk like with researchers and databases, they're like, doesn't make sense. If you get like into the point of like using Substrate, why like just implement the dialect you want like from scratch anyway, if you want to do it right. Right. And then you see like people in the industry, they're like much more pragmatic. I mean, yeah, like they try like to use it. Right. But what's your, what's your take on that?

Starting point is 00:54:46 Like specifically for Substrate. And I understand that like, okay, I don't want you like to be, it might also be like a little bit political in a way because like there are like communities that are involved, but I think it is constructive, like for the communities and the people involved like to like share opinions because these are like hard problems at the end.

Starting point is 00:55:06 There's no easy solution. Yeah, no, that's a super hard problem. And I think in a way that's also, as you rightly pointed out, it's sort of a controversial discussion. It depends on who you ask. Different people might have different views on that. Yeah, I've been, because of the work in Valux and because of the work with Arrow, essentially it's the same community, right? So we've been talking to the Substrate community

Starting point is 00:55:30 as well and how they're planning to handle that. I think the problem space essentially is that if you want to create a unified IR for systems, which is at least my understanding that's what Substrate is meant to do, you need to be able to express the computation without, like, you cannot have room for interpretation. Like it needs to be something that captured every single detail of your execution. And maybe like a good example is what I mentioned before. Like if you're, if you have overflows, like what's the behavior, like

Starting point is 00:55:58 you throw exceptions, you wrap around, like this is something that needs to be captured by the IR. Other things as well, if you have arrays, like you start at zero, then you start with one. So those are just kind of some toy examples. But I think with what we started seeing in practice as we implement some of those functions, there is so many really kind of complicated details.

Starting point is 00:56:17 Like essentially every single function has all sorts of kind of corner case behavior. There are bugs. Like so if you want to have a bug by bug implementation of different things, it's really hard. My understanding is that the substrate project perspective is that they're trying to capture all those details inside the IR itself. So essentially, of course, they don't have anything around execution, but the IR would have enough information for you to implement things in a way that the execution is 100% correct. There are questions on whether this

Starting point is 00:56:51 is doable or not. There's also the part of like, if you actually capture all the single details, like, I mean, it is only as useful as the engine is actually able to provide all those different knobs. I think it is an open question. My personal perspective is that getting an IR that was generated by one system and executing on a different system other than maybe a Hello World example is like, okay, I actually have a complicated query that has large expressions, functions, EDFs. I hardly see that really working in practice. But let's see.

Starting point is 00:57:25 I think we have this discussion with the Substrate community and they have people actually trying to enumerate all those differences. Let's see. That's still an open question. Personally, still a little bit skeptical, but let's see. No, it makes sense. Let's talk a little bit about

Starting point is 00:57:42 what's next for Velux. I mean, it is in a mature state. Okay, so let's talk a little bit about what's next for Velux, right? I mean, it is in a mature state. I mean, it's being used by Meta. Okay, the scale that Meta has, other companies like outside Meta, and it's actively developed, right? So what's next? What are some exciting things about Velux that you would like to share with like the audience out there yeah no that's a that's another interesting part

Starting point is 00:58:12 like we there's a few things that come to mind of things we're working on right now things have just started and what we were looking at in the future one one of the one part is I think more maybe tactical is just the open source part. I think there's a lot of discussion that just because how fast the project is going, like how much interest we're getting from the community, there's a discussion on how do we kind of evolve the open source mechanics, like the open source governance, like how do we formalize a lot of those things. But this is something we've been actively working on, not just inside Meta,

Starting point is 00:58:42 but Palcanture partners, like discussions on how the community operates, decision-making, essentially like open source governance. It's never an easy problem. So this is kind of a little short term. This is something we're looking at, of course, other than just continue adding features, optimizations, and things we've been doing for a while. The second part we've been, we started exploring some of that, and we started to see some of the early results that are super promising is just this idea that

Starting point is 00:59:11 now that you have all the different engines using Valox, like using the same unified library, it gives you like a really nice framework to evolve what you have underneath, right? So essentially now if your hardware is evolving, right? So it's not just SIMD, but if you have other sort of accelerators,

Starting point is 00:59:27 you have GPUs, FPGAs, or essentially if you want to optimize your code to make better use of hardware as hardware evolves, you couldn't do that before because you had to do that once per engine.

Starting point is 00:59:38 But now that we unify that and you have a single library, this framework is really compelling. So we started investigating what would take for, essentially, what are the frameworks and what are the APIs we need to provide in Valux to be able to leverage GPUs and FPGAs and other accelerators. So there's a lot of exciting work on that direction. We've been partnering with some hardware vendors as well.

Starting point is 01:00:00 I think, of course, for hardware vendors, it's also a very compelling platform. If I just integrate my accelerator into Valux and suddenly I can execute Spark, press the queries, run stream processing and whatnot. So we're seeing a lot of interest on the project from the hardware vendor perspective. We have some really nice initial results on just how much of your query plan you can delegate and execute efficiently in, for example, GPUs. I've also been working with companies that rely on accelerators based on FPGAs for now, but we're seeing a lot of good,

Starting point is 01:00:32 both traction and initial results. So that's one area that we're really investing. Of course, we care about this as a matter, but I think there's also something that we see a lot of interest from the community, both from hardware vendors or also from cloud vendors who want to enable different types of hardware. So this is another area we've been working on. Lastly, there's also a discussion we started, which is related to VELOX, but maybe we're plugging on a few different ways. It's just

Starting point is 01:01:02 the evolution of file formats. I think there's a lot of discussions on the research community. I think even at Yale DB a couple of months ago, there was two papers comparing just Parquet and ORC and kind of pointing some of the flaws. So there's a discussion on, can we do better? What are some of the flaws? Like, of course, not just flaws in the design,

Starting point is 01:01:22 but also the use cases we have and the access pattern to those files are evolving. So, and those files, again, both of them were also developed 10 years ago. But there's a lot of discussion on how much or what we could change. Internally inside Mata, we have a project called Alpha, which is a new file format. It's a little more focused to ML workloads. So we also have been partnering with some other people from both academia and the industry and trying to understand, is there anything better?

Starting point is 01:01:48 What comes after Parquet? Like I mentioned, it's a little orthogonal to VELOX, but it's related in a way that both VELOX will be able to encode and decode those files as well. Yeah, a hundred percent. I mean, at the end, a big part of doing like executing analytical workloads is like reading from these files, right? And usually you read a lot of data. And I remember like, because it was very interesting that paper you mentioned,

Starting point is 01:02:12 it was very interesting to see like how much is left, like in terms of like being able to saturate like S3, for example, when you are reading data from there. So that's, there's a lot that can be done there, for sure. But Parquet has been around, it's very foundational to whatever we do out there. So how do you see this happening, actually? How we can move into updating Parquet in a way that is not going to break things. And in some cases also, let's say, allow use cases that might be almost,

Starting point is 01:02:59 how to say that, not compatible with the initial idea of Parquet. And I have something very specific in my mind about that, because you mentioned ML. And I think one of the interesting things with ml when it comes like to storage is that the columnar idea that you take the whole column there and like you operate like on a column exists but there are also like cases where you need like more than that like you might need like point queries too or you might need to have different points in time for like the features that you have like for example like all these things like in the mail and ai as an extension are like important and there was like but it was not like designed for

Starting point is 01:03:34 that stuff like they didn't exist 12 13 years ago so how do you do that like how do you iterate on these things because they're like very foundational right? Yeah, no, I think absolutely. That's the interesting question. There's basically two ways of going around that. We can try to evolve Parquet and just making backward compatible changes. Personally, I'm still skeptical of that. And there was also Parquet v2, which is still an interesting process. But I think a lot of the changes that we think that we might need to make, I'm not sure if we can make them in a

Starting point is 01:04:11 backwards compatible way. So I think there's a discussion of whether we can actually extend things and make this kind of somehow Parquet compatible, or if we would need a completely different file format. I think think to the discussion about ML, like the first thing that we saw that just breaks is just in a lot of those ML workloads, you have like thousands to 10,000s of features, right? So if you want to map those things to a Parquet file, like there's all sorts of inefficiencies

Starting point is 01:04:37 because it was not designed to have that many streams and firewalls. So this is one of the things I've been looking at. And like I mentioned, this output file format, we have like addresses this specific pain point, but there are also discussions on hardware acceleration. Like if we're assuming that GPUs are becoming even more pervasive, a lot of this computation will be uploaded to GPUs at some point. And it means that table scans will be delegated to GPUs. And then your Parquet decoding will also be executed in GPUs.

Starting point is 01:05:05 Another thing we've been seeing is that Parquet was also not designed with that use case in mind, and there's a lot of inefficiencies that make it a little harder, but not as efficient as it could be as you map things to GPUs. And even to things like SIMD, there's a lot of encoding algorithms that are present in Parquet that are not easily parallelizable. There's a lot of discussions on if we had to design a new file format, giving those new requirements of supporting many different streams, but also things that can efficiently be executed and not just seem to

Starting point is 01:05:36 vectorize harder, but also nested accelerated, what would this format be? And I think what we're seeing is that it could be fairly different from Parquet. This is still an open question, like super interesting research problem, let's see. But I think that's another very interesting area, and we see a lot of interest in basically all the partners and all the different companies

Starting point is 01:05:56 we talk to, they all have this problem. Yeah, Parquet is great, but we can probably do better at this point. Alright, one last question for you, because we kept discussing and we kept going back to like new hardware, new workloads. And one of the things that like I find like very interesting and I'm thinking a lot about it and I'd love to hear like how you think about this is the following.

Starting point is 01:06:22 So when it comes like to working with data, we have two very distinct, let's say, models of computation, right? We have the relational model that we are using, like all these operators that we have in the database system, including Velux, right? Which is a huge foundation of what we have achieved as an industry in the past. I don't know when it was introduced, like 40 years ago or 50 or something like that, right?

Starting point is 01:06:49 Until today. And then we have all this new stuff with AI and HTML. And of course, we are using also different hardware there. And we have the TenShort model of processing data, right? And I mean, okay, both are processing data, but they are not exactly the same thing. But we need both in a way.

Starting point is 01:07:11 We need somehow to bring these two together. Do you see a path towards that without having to abandon one of the two? Or how do you think about this problem? Yeah, that's an excellent question. I don't know if I have a very good answer, but I think

Starting point is 01:07:37 what I would say is that I think if you look at the operations, they're all data operations. Some of them were implemented by database people like vectorization joins expression evaluation and a lot of them was just matrix multiplications and things are implemented by ai people so in a way like you philosophically you're getting data and you're kind of shuffling and executing transformations right i think it just the the data layout of those things is all very different right i think one way that you think about this is all the data we have on the warehouse and all the kind of database computations that run on top of Arrow, essentially,

Starting point is 01:08:11 while the ML step runs on top of tensors. So I think the way we're thinking about this is I think for now, at least, it's less about unifying those two worlds, and having efficient ways to pull this Arrow data and then transforming it into tensors. There's also discussion that should we add support for tensors in Velux as well, so it could actually support tensors as a data type and then potentially could have tensors

Starting point is 01:08:36 exposed inside Spark or inside Presto as data types as well. So how much and how deep this integration goes, I think is still an interesting open question. I think if we were to redesign and rewrite the entire stack from scratch, I think probably things would look very differently. But I think just the fact that we have those two very mature worlds, like one coming from PyTorch, another one coming from Analytics, essentially database, I think it makes it harder. I know there's some very interesting work from Microsoft and actually doing the opposite, like trying to reuse ML infrastructure to run databases.

Starting point is 01:09:11 And I know that they actually have been getting some pretty interesting results. I don't know. Let's see if all this thing will unfold. But I think for us, at least having maybe good ways to kind of interoperate data between those two worlds, getting data from analytics world, transforming them to tensors, handling that to PyTorch. I think that's probably on the short term what we care about most. But I do see that there's more potential for integration on that.

Starting point is 01:09:40 I think, like I said, I think it's just things that were created by different people and ended up being unnecessarily siloed. 100%. Okay. Pedro, unfortunately, we don't have more time. I'd love to continue. I think we could be talking for another hour at least. Hopefully, we'll have the opportunity to have you again in the future. There's so many things happening with like Velux. And I think Velux as an open source project can give out like a very

Starting point is 01:10:06 interesting like a view of like what is happening in the industry right now. That's why I'm like so excited when I have like people like you. It's not just like the technology is also like getting like taking like a glimpse of what is coming in general. Right. So hopefully we will have you like again in the future. Thank you so much. i've learned a lot i'm sure that our audience also like is going to learn a lot and if someone wants like to know

Starting point is 01:10:33 more about like velox or wants to play around with velox or whatever let's say like they want like to get their hands on it like what should they do yeah no so yeah i think first of all thanks for an invitation was a great chat like a lot just to kind of nerd out about databases i'm super excited to be here yeah if anyone in the audience is interested about any of that about bellworks like we usually just ask folks to reach out on the github repo just you know go there create a github issue i don't know kind of reach out to some of us there's a slack channel where we kind of we that we use to communicate with the community so just send me or someone from the from the community an email we can get you guys added but

Starting point is 01:11:16 just reach out like if you don't have any questions or if you're planning to use that if you want to learn more about something like you know we're always uh pretty responsive on github and slack you know just send us a message send me an email if you reach out thanks again costas for having me it was a lot of fun yeah likewise we hope you enjoyed this episode of the data stack show be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 166: Data Processing Fundamentals and Building a Unified Execution Engine Featuring Pedro Pedreira of Meta

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.