The Infra Pod - Rust is for big data! Chat with Andy Grove

Starting point is 00:00:00 Welcome back to yet another Infra Deep Dive. This is Tim from EssenceBC and let's go Ian. And this is Ian from Snyk. And I am so excited today to be joined by Andy Grove, who is a creator of the Data Fusion REST library. Andy, actually, why don't you just introduce yourself? Absolutely. And thanks for having me on. I'm happy to be here. So I try and keep this fairly short, but I've been a software developer for an embarrassing number of decades in the industry.

Starting point is 00:00:35 And I guess it got kind of interesting 10 to 15 years ago. I moved from the UK to the US to join a startup and started building some data in for, I guess, started out with a database sharding solution. We had some technical success, but not a ton of commercial success with that. And then the company pivoted and we started building a distributed SQL query engine specifically for the reinsurance industry, which is a little bit weird. And I started out in C++ way back, jumped onto the Java bandwagon before 1.0 because I didn't have to worry about segfaults anymore. That was cool.

Starting point is 00:01:10 And I could kind of be more productive. So I did JVM for a very long time. Developers tend to build things in the languages that they know, not necessarily the best languages for a project. So I was building this distributed SQL query engine in Java. It was really interesting, learning a lot lot and this was around the same time that Apache Spark was becoming popular and our main client and investor at the time decided to switch to Apache Spark which was a sensible choice for them so the work we had been doing kind of became redundant and basically that was kind of the end

Starting point is 00:01:43 of that company so I was left in a situation where I'd been learning about building query engines, and it's really fascinating to me, and I wanted to carry on learning. And at the same time, I'd started to learn Rust. So just for fun, I started building some query engine tech in Rust in my spare time. Eventually, I got to a point where I decided that I was going to try and build something like Apache Spark in Rust. So that's kind of where that all started. Amazing. And so basically your career has been like rebuilding like data processing engines over and over and over again.

Starting point is 00:02:14 What was it about Rust that you said, ah, I'm going to go rebuild Apache Spark massive system. Why was rebuilding Apache Spark and Rust like interesting? What do you think the advantages were of Rust for this job type? So to me, Rust kind of represented the best of both worlds of C++ and Java. The performance is very similar to C++. And it's kind of different. Like Java, you don't have to worry about memory safety, but that's thanks to the garbage collector, which has pros and cons.

Starting point is 00:02:42 And Rust seems like an ideal compromise between those. So there's no GC, obviously. But a compiler saves you from making terrible mistakes that end up in segmentation faults. And that seems just really interesting to me. And it seemed ideal for the intense parts of data processing systems. It's really good to have a native language. And I just didn't want to go back to doing C++.

Starting point is 00:03:04 I kind of resisted that. So I was kind of very stubborn and just waited a long time until a native language. And I just didn't want to go back to doing C++. I kind of resisted that. So I was kind of very stubborn and just waited a long time until a better language was invented. And when Rost came along, I thought, ah, this is my chance to go and build something more efficient and more scalable. And with distributed systems, it's kind of interesting. Like the moment you have too much data to process in a single node and you go distributed,

Starting point is 00:03:21 you have a ton of overhead. So the more you can do in a single node, the longer you can kind of put off that point where you have to go distributed, or you can at least reduce the number of nodes you have, which cuts down on some of the overhead. So I was excited about the memory efficiency of Rust compared to Java. And that's something that I saw kind of bear out in some of my early experiments in this area. So a lot of it sounds like JVM at scale has huge amounts of overhead from the garbage collector and just running the JVM in general. Ross, I can build a more memory efficient version

Starting point is 00:03:52 of an engine, have more control over it and have the safety aspect of not to take fault hell. And plus a new language that was interesting and fun. How much of the problem domain when you first started building and today, actually, in general, how well is Rust suited to solving these data problem domains? Often, if you were to ask someone about building a data pipeline or data scientists spend a lot of time using Python tools,

Starting point is 00:04:18 and we have Kafka as one of the best examples of a thing in the JVM. What is your answer to those questions? Yeah, it's certainly challenging. The main problem was just the lack of maturity of the whole ecosystem around Rust. There's no flat buffer project in Rust yet,

Starting point is 00:04:35 or protocol buffer isn't quite there. That was really a big issue in some of the things I was trying to do initially, like building a distributed system. And it got to a certain point where I realized this project was just too ambitious, too hard, and it's kind of hard to build a community around it. So I kind of pared back to scope a bit and instead of going straight for distributive, I focused on making it like an in-process query engine, which,

Starting point is 00:05:00 you know, got rid of a lot of requirements around things like serialization formats. That's probably the major issue. There were some annoyances as well back then because Rust was earlier on and the language kept changing. So you'd get the next nightly release and your project wouldn't compile anymore. That was fun. I think you are either a core contributor or were at one point a core contributor of the Arrow implementation for Rust. Can you explain to our listeners, what is Apache Arrow? Why is it amazing?

Starting point is 00:05:28 Then can you talk a little about why Arrow helps building these data tools? Yeah, absolutely. So my initial excitement was around Rusted Language, and that's all very cool. But as I started sharing details of the project on Reddit, lots of smart people gave me feedback. And one of the things that multiple people told me is that nobody should be building row-based query engines in 2018. You should be doing columnar processing. I was aware of the concept, but I hadn't used it.

Starting point is 00:05:52 I wasn't really, I didn't really get it at the time. But I started doing some experiments with columnar. I saw some big speedups right away. And people told me to go check out Arrow, which I did. But it wasn't a Rust implementation. I was just using, in Rust, the vector type to have arrays of data. And yeah, I mean, that was showing some benefits. But Arrow definitely appealed to me.

Starting point is 00:06:13 So I started building a Rust implementation of Arrow. And I have to really caveat that. And the initial thing I built and donated was a very small subset of Arrow. And what exists today is massively better, thanks to the huge community behind it. I just kind of got the ball rolling at least and created like an MVP of a Rust implementation of Arrow. Going back to the question you asked I guess, so what is Arrow? Arrow really started out as a memory specification for columnar data and the benefit of having a specification for a data format is that you can kind of share data between different languages, even between different processes, without the cost of serialization. When you're using things like Apache Spark, and then you want to kind of go to Python, you've got to take the data from Spark formats, rewrite it in some other formats, and that's like incredibly expensive and wasteful. So with Arrow, there's an opportunity now. It can be kind of retrofitted

Starting point is 00:07:05 to existing systems, but we're seeing like a whole new ecosystem built that's Arrow native. And that gives you that kind of seamless interop between different languages. That's pretty exciting. And that's just the memory formats. Then there are implementations of Arrow, which provide basically compute kernels. So this is just code that you can run against your data. So maybe you have two arrays of numerics. You want to add them together just to take a trivial example. Or maybe you want to do some kind of an aggregate. There's a bunch of highly optimized kernels there in multiple languages now.

Starting point is 00:07:35 For a long time, I was primarily Java and C++. But Rust now is another very active implementation. Rust is a super interesting ecosystem. One of the core questions I have is today, there's a lot of people building data tools in Rust. What do you attribute that rise? You're one of the very first people. I mean, you started

Starting point is 00:07:54 for example, the Arrow project. I remember looking at the Rust ecosystem, seeing DataFusion three years ago and being like, this is really interesting. So what is it about Rust today and why are we seeing so many more people building on Rust? And if you have some examples of popular projects or pieces of software of people that have built stuff in Rust,

Starting point is 00:08:12 that'd be great too. Yeah, sure. I mean, I guess when I started, Rust was maybe controversial. It certainly wasn't widely popular. But now, so many of the large companies are backing Rust. I think Microsoft just came out with an announcement. They're investing heavily. Amazon are building a bunch of infrastructure in Rust.

Starting point is 00:08:30 So I think Rust has just matured to the point where it has that kind of universe of acceptance that this is a real thing that companies can afford to kind of take a bet on, which maybe was seen as risky early on. And yeah, there are lots of projects being built from Rust. Polar's is probably one of the most well-known ones, which is essentially like a Pandas replacement Rust native. It uses its own

Starting point is 00:08:51 implementation of Arrow, not the official Rust Arrow library. That may be changing over time, I'm not sure. InfluxData is probably and InfluxDB is maybe a well-known one as well. They decided to completely rebuild their core database engine in Rust. And rather than start from scratch and have to build a network engine,

Starting point is 00:09:10 which is a lot of work, they decided to base it on Data Fusion. That gave them a huge head start, and they've been really great at contributing back and helping make Data Fusion more mature, which has been really cool. You have a super fascinating journey, because I'm reading the blog post from 2018 is Rust is for big data, right? I think even attributed to that post in a lot of the recent posts, this is where I started. Right. And it all started with, I don't want JVM.

Starting point is 00:09:35 Like basically I want to exploit how to use this newer, much more efficient language. And that journey led down to, you know, data fusion, the know, Data Fusion, Arrow, Ballista, and now you're back working on Spark again, using Rappist. So I want to maybe talk about this sort of Ballista, Data Fusion. You started as a side project wanting to use Rust to build for big data, and you went down to want to build a Spark, a new Spark. Yeah. What is the biggest learning for you trying to rebuild Spark from scratch using Rust? What was like the biggest challenge and what are surprising things you learned in this journey?

Starting point is 00:10:16 It was really interesting for sure. And like, it's a lot of work to build a distributed query engine. It wasn't like I was trying to make something that has the same maturity as Spark, which has been around for like a decade with more than a thousand contributors. You know, mine was definitely like a toy project. It took me a long time to really build anything. I mean, I was working on this in my spare time,

Starting point is 00:10:37 took a break from the project when I got frustrated, then I had to come back a few months later and have another go. So it took a very long time. Momentum was kind of slow, but after like a few years working on this, I got to the point where I could run the TPC-H queries. That was a really big milestone for me. But I think one thing I learned is like Spark is a really mature product and it's really challenging for me to even match the performance

Starting point is 00:10:58 of Spark for a long time, even though using things like Arrow and Rust, a lot of code in Spark has been heavily optimized over the years. It doesn't really matter which language you're using. You always have to go through that work of finding the best data structures and algorithms and fine-tuning things to get really great results. We just assume Rust will just give you efficiencies. But like I mentioned, so much of the performance of Spark

Starting point is 00:11:22 is beyond just a pure language. But of course, Spark has been developed for so long now, they've have optimized to using more efficient Java or JVM. What is the hardest part when it comes to even getting close to performance? Is it just the algorithm side? Like how do I actually get better ways of doing you know sequels and some some algorithm type of stuff yeah i think i think there's two main areas like one is classic query optimizations to your query plan so imagine doing a join between two tables and you have some filter conditions to filter some rows out you know it's better to filter the rows out before you do the join otherwise you're producing all this data and then filtering it down so like all query engines have basic optimization rules like pushing filters through joins

Starting point is 00:12:10 just to take like a simple example but then there are more advanced optimizations that use statistics so there are different join algorithms but a very common one is a hash join where essentially you load one side of the join into memory in a hash table then you stream the other side and do the cups in the hash table well you really want to put the smaller side of the join into memory in a hash table, then you stream the other side and do the cups in the hash table. But you really want to put the smaller side of the join in the hash table, not the larger side, because of the memory constraints. So you need rules to figure out, like, okay, which side of this join is going to be larger. And that's not always simple.

Starting point is 00:12:39 You can't just look at the number of rows because there are filter conditions. So you have to predict, like predict how selective are those filters? Is this filter going to reduce the table by 90% or not at all? So that's the side of it where there's been a ton of research into that literally over decades. So having all of those optimizations to produce a good plan is one side of it. Then there's the actual implementation code for those algorithms,

Starting point is 00:13:03 just trying to make those as efficient as possible. I guess I'm very curious, what is Data Fusion now? It seems like it's now sort of like a mainstream project per se, but it's quite unknown what the adoption, what the actual state of that is. Can you maybe talk about what Data Fusion is? Absolutely. So Data Fusion and Ballista are both parts of the apache

Starting point is 00:13:26 ro project now i'm still involved a little bit but not like actively coding right now so what is data fusion today data fusion is a great foundation for building new query engines or data systems data fusion is very composable it's very extensible so the design of it is totally separate modules or crates so you have like a secret parser you have a query planner a query optimizer there's the execution engine and you've got your points for plugging in user-defined code user-defined functions so if you wanted to build a new query engine today maybe with some proprietary format you can be up and running like in a few days because it's like a toolkit. You take the bits you need,

Starting point is 00:14:06 you plug in the code that's special to your application or file formats, and you've got a great start. There's another project that I drew inspiration from, which is Apache Calcite. Calcite was used as the foundation for quite a few kind of big data query engines over the years.

Starting point is 00:14:23 But Calcite is JVM-based, not Rust. Data Fusion is a very active project. Every time I go and look, there's just so much activity. It's kind of hard to keep up with. And there are many, many companies now building new platforms on top of Data Fusion. And the most known being InfluxDB. Yeah.

Starting point is 00:14:40 There's a company called Sonata doing some interesting things on there as well. If you go to the Data Fusion repo in Apache Arrow, there is a list of projects building on top of Data Fusion. And yeah, it's a pretty long list these days. It feels like the project's got that kind of critical mass of momentum where it's going to be around for a while and people are depending on it, investing in it. And if you were to start a new project or a new company today, you're going to go solve some data problem, where would you land? Is your perspective that Rust is ready for primetime to go build? If you were going to build a Capcom replacement, would you start with Rust? If you were going to go build a better

Starting point is 00:15:16 Apache Spark, would you start with Rust? Is that where you think the future is? Yeah, great question. So in terms of languages, yeah, I'm still a big fan of Rust. I think Rust is a great solution for building data systems. However, within any system, there are portions where performance is really critical, like actually processing these joins. Things like a scheduler that's keeping track of what's happening and the different processes running on the network. And maybe it was overkill just to try and write everything in Rust. There was one project a few years ago, it's no longer active, called Blaze, which basically provided a Spark accelerator using Data Fusion. Using Spark for query planning and order distribution, but then delegating down to Rust code to actually do the query execution. And they were seeing some pretty reasonable results, like 2x on queries. So with hindsight, that would have been maybe a smarter strategy for me to try and make all this,

Starting point is 00:16:15 like get momentum sooner. If people could just keep running Spark, but slowly replace bits of it with Rust, that might have been a better path to adoption rather than just throw this away and start again with this product that has 1% of the features. Out of curiosity, I know this is sensitive enough, but I imagine you can probably work on Data Fusion full time, but now you're working on BlueRapid's Accelerator for Spark. Was there a reason you wanted to work on that?

Starting point is 00:16:42 Oh yeah, no, it's a really easy question to answer. NVIDIA kind of noticed what I was doing with Data Fusion, saw that I had skills around our own query engines, and they called me and said, would you like to come work for us? And I said, yes, and that was basically it. So NVIDIA is a great company to work for. And so the work I'm doing there, we're accelerating Apache Spark using the QDF GPU data frame library. So essentially delegating the actual physical execution down to GPU that can produce some pretty great results.

Starting point is 00:17:12 It's interesting, like Data Fusion, I was building query engines as a hobby. There have been times when I've been doing aspects of it as my job, but then it stopped being my job. It became my hobby for a while and now it's my job again. And now I'm doing it as my job. I'm less inclined to spend more weekends continuing doing similar type of work on data fusion, if that makes sense. Yeah, that's really interesting. Like GPU with Spark is not new,

Starting point is 00:17:33 but it never really gained wide adoption per se. It's always some interesting prototype. And let me talk about like, why would somebody wants to use Rapids to accelerate what kind of workloads on Spark? And where do you find some kind of trade-off? Because obviously GPU is not that cheap or easy to get by. So running the cost will be much higher.

Starting point is 00:17:55 What do you see people are adopting Rapids and changing Spark acceleration from? And how do you usually see people actually use it in practice? Sure. And I think that on the NVIDIA website, we have some case studies of people using this, and that's probably a great place to go to look at kind of numbers and cost savings.

Starting point is 00:18:12 But the one nice thing about the solution is that there are no code changes required. You literally drop in a new jar and some configs, and now your SQL ETL jobs are accelerated on the GPU. If that works for you in your use case and you see great results, then it's kind of a no-brainer. It can vary depending on your exact use case and what functionality you're using as to how good the performance improvements are.

Starting point is 00:18:34 If companies have GPUs, they're running Spark, they can drop this in and make it go faster. It's just like an easy thing to do for cost savings or getting the jobs to run faster. Are there certain types of workloads where it makes sense to do this, where it doesn't make sense for others? Like what are some circumstances

Starting point is 00:18:50 where like using a GPU with Spark makes a lot of sense? What are some circumstances where it's like, you shouldn't bother? It's not performing enough for the additional cost for the GPU in the cloud or, you know, to buy a GPU? I'm not sure I can give a great answer to that. I mean, I was kind of surprised when I first got involved in the project because I didn't realize that GPUs would be good at the type of operations

Starting point is 00:19:10 that happen within like a SQL engine. But it just turns out that GPUs have just like got so many cores, even if you're doing something that's not as efficient as it would be on the CPU, just the fact that you have so many cores means it can go faster anyway. And I think to ensure like there are certain operations that, you know, are really well optimized for GPU, some others maybe not so much yet, but it's always an ongoing effort to keep optimizing kind of more and more of the different operations and expressions that people typically use in Spark.

Starting point is 00:19:41 Cool. So we're going to jump into what we call the spicy future section. What we usually talk about is the future. Where do you see the whole data infrastructure space will go into? I think particularly given your involvement, writing sort of like the new data infrastructure in Rust, now even getting into Accelerator. Curious what you see the next five years will be. Do you see the ecosystem change? Do you see more data infra work built on new languages?

Starting point is 00:20:13 And maybe just give us your hot take. So I'm not sure these are really like my hot takes. Other people have formed these thoughts as well. It's not for me. A definite trend towards composable data systems and composable query engines. So it's no longer the case where you necessarily have to have like one thing end-to-end.

Starting point is 00:20:30 There's some interesting projects. Data Fusion is one that's very extensible. Meta are building Velox, which is a C++ query engine. It doesn't have a front-end. It doesn't have a SQL parser or query optimizations. It's just the execution. And then there are projects like Substrate, which is kind of an intermediate representation of query plans. So theoretically, you could maybe take Apache Calcite or Data Fusion to do your SQL parsing and query planning,

Starting point is 00:20:55 go through Substrate into Velox for execution. So you can start mixing and matching different things. Voltron Data, they're producing a lot of good content in this area. They've been investing in IBIS, which is a Python front end for query execution, and they can plug into different backends. That's one area that looks pretty interesting. I'm keen to hear more about Wasm in this space. I haven't really heard a lot about that, but it seems like a bit wasteful, Redox and Data Fusion, Databricks with their Photon engine, lots of people are building these query engines. It'd be great if we could share more work across them. But we've got Rust and C++ and Java for these different things. I don't know, it'd be interesting to see if there's some WASM or if users could write their user defined functions

Starting point is 00:21:40 in WASM so they can run on any of these platforms, that'd be kind of an interesting area. And I also think one of the big challenges, like moving data around. So like these days, compute and storage is very separate. And it'd be great if we could push filters, predicates down from query engines into storage. Again, maybe Wasm could become a universal Wasm and substrate. Maybe there's like a universal way of being more specific with storage about what data you actually want to retrieve. That's one whole area that is kind of interesting to me. It's super interesting. Wasm is a shiny opportunity for interoperability. It's interesting

Starting point is 00:22:18 how it has so much opportunity to reinvent so many parts of the stack, specifically in data land, like the ability to move the UDF around between systems, or even just to move a UDF that can be written in any language and compile it to Wasm and then send to the system.

Starting point is 00:22:31 I think, like, from my perspective, that would really democratize access to these systems. Like, one of the biggest challenges that we have to get any engineer to work with some of these tools is with Spark as a great example. It's like, well, you've got the JVM or you've got PySpark.

Starting point is 00:22:46 And PySpark is not Python, right? And the JVM is not Python. It's a very different system. So if you're a JavaScript developer, a TypeScript developer, or somebody that's building websites, you do not even have near any layer of the skill set to even pick some of these things up. Now, you could argue and say that those types of people will never have big data problems.

Starting point is 00:23:07 But there's lots of organizations that have big data problems and their engineering skills that they're equipped with is, I'm a TypeScript engineer, or I write some Go, and now I need to go solve a data problem. When you pick up those tools, they don't look anything like. And if you're a VP of end or a CTO, your choices are basically, oh, I got to go hire data engineers. Like people are used to working on these large data ecosystems that are used to these things, these very specific skill sets. So I think Wasm is really interesting because we did enable inoperability, assuming that you had those like a serverless platform and I could write the functions in, you know, my TypeScript and then it compiled to wasm and i could push it like think of how that changes the way that we structure engine organizations and also think about how that

Starting point is 00:23:48 like opens up more people to move up and down the stack in a way that they couldn't before right because i think everyone can understand the complexities of like how much throughput i'm trying to put through this thing right like i think that thought process moves between the different systems like the different modalities but the specifics of the jvm or pi spark doesn't move as easily as well so we spent a lot of time talking about how important gpus are to enable loms how long is going to change the world well what about gpus in the data ecosystem right what about databases with gpus like where do you see the future of the gpu in the context of building these large data ecosystems?

Starting point is 00:24:26 Yeah, that's a great question. Obviously, I work for NVIDIA, so I'm very biased on this. Be careful what I say. But I mean, seeing the work that the Rapids team does and the Spark Rapids team, there's some incredible engineering going into building data technology. So I do believe that GPUs will continue to play an important role in this space. Some of the speedups on GPU, I mean, again, it really depends on the use case. And the NVIDIA blog website has all the information, but there's some incredible speed ups on particular operations on the GPU.

Starting point is 00:24:55 So I think for the larger companies that have the true big data and have the budget for GPUs, I think it's a very compelling solution. And then there's the other end, like the people using Polars and things like that. GPUs can still play a really good role there too. I mean, Rapids has a QDF, which can be used pretty much as a drop-in replacement for Pandas. So it's in a similar space as Polars. So yeah, I highly recommend people try that out as well

Starting point is 00:25:20 if they haven't used that yet. So I want to go back and touch on your composable data systems. It's really interesting, right? Because like I said, we see more projects that are able to be more composable or extensible. Looking at all the traditional databases, there's always been this somewhat composable.

Starting point is 00:25:39 There's plugins, there's extensions you can do. That's the extent of what you could do for the most part. Now you look at the more traditional, even like data warehouses, they allow you to do like Snowpark, right? You can write Python code and stuff like that. But we're talking even much more deeper now, right? Now building any data systems or database query engine type things, I can grab a couple of projects

Starting point is 00:26:00 and I can be able to create a new data system faster. One advantage I can see having a composable data system is, hey, me building a new database, I have some particular requirements. I don't have to go from scratch, right? We have storage engines. We have that queries parser. We have this kind of thing. I think Facebook has another project too, right?

Starting point is 00:26:20 There's a bunch of these things to go faster. But there's also another angle of what composable data systems are. From a user perspective, I should be using something that's more composable because maybe there's some motivation for me to not be locked into one particular implementation. Maybe I do want it extended. I'm curious about your thoughts, because I'm trying to figure out why do we want composable data systems in the first place? Is it just faster implementation for new ones? Is there any benefits for the user side?

Starting point is 00:26:52 Or do you see something else? Why is the world moving towards that? And what's pushing this this way? Yeah, so I think there's maybe two use cases that come to mind. One is what you kind of touched on already, which is people building new new systems from scratch. I want to get that headstart,

Starting point is 00:27:07 not having to build all the components. But another area is where you have kind of integration requirements between different languages or tech stacks. So maybe you already have an investment in a bunch of Java code that's integrated with Apache Cal site for query planning and you're happy with that,

Starting point is 00:27:23 but you want the faster execution. So if you can bridge your JVM query planning with like Velox or Data Fusion for execution, that could be a quick win if you're moving away from some other kind of JVM based execution engine. That also ties into this whole concept of Spark accelerators. And I mentioned the Blaze project. So if you're doing something like that, if you have Spark or some other query engine, and you're just trying to accelerate certain components of it, that's where the composability can really come in, just to swap out certain pieces. Writing databases is always hard, right?

Starting point is 00:27:56 There's not that many developers in this world that can really competently build a production database. People can learn, but it takes years, right? I do wonder, making databases more composable and more easier to build doesn't mean everybody would just go ahead and just do them. Do you believe there's going to be more people

Starting point is 00:28:14 building their own data systems and capable to leverage this? So like, hey, maybe the enterprise is like, instead of just buying for products, we should actually come up and use these and piece together our logos here together our Legos here. Do you see that happening way more often? And lowering the barrier to entry of building a system is going to enable this too? I think people love building databases and data systems. And

Starting point is 00:28:36 yeah, having these composable systems will maybe encourage more people to do that. Whether that's a good idea or not, I'm not sure. I don't think I appreciated when I started on this just how much work it is to have a production quality system. Like there are so many, you look at Apache Spark, there are so many different operators and expressions to support. And you get into details of any of these expressions and things like type coercion and casting between types. There's like so much technical detail. It's a lot to build from scratch. So having these composable systems hopefully takes care of a lot of that stuff for you so you can just focus on building the parts that are unique to your business. So yeah, I think there will be more of that. In one role I had

Starting point is 00:29:15 a few roles ago, we had a situation where we had a Spark cluster. For certain queries, it was fine. Other queries, it didn't make so much sense. And we built a Spark cluster. For certain queries, it was fine. Other queries, it didn't make so much sense. And we built a query gateway. So as queries came in, we could inspect them, see what type of queries they were, and then make a decision like, do we just delegate Spark for this?

Starting point is 00:29:35 Because we really need a cluster because it's kind of computationally expensive. Or is this just like a trivial query that we can just process ourselves in memory and process? And that was kind of interesting. So having these kind of building blocks will allow. And that was kind of interesting. So having these kind of building blocks will allow people to build those kind of infrastructure more easily. You know, things like that can add a lot of value to an organization. And it's not building a complete database.

Starting point is 00:29:56 It's just more query intelligence and routing. I'm curious to get your take on interplay of Rust and LLMs and data systems and how these all interplay. One good example of a database system that everyone's talking about now that no one talked about two years ago is the Vector Store. I'm kind of curious to get your spicy take on where you think the future of Vector Stores are going and how that interplays if you have an opinion on it. But more importantly, I'm really interested to get your opinion and perspective on the future of LLMs and data infrastructure and where they fit or maybe don't fit. So I don't know too much about vector stores, vector databases. The one area where I do see a natural fit is more in the front end. There's already projects out there doing this, but like who wants to write complex SQL when you can just ask a question?

Starting point is 00:30:43 I see that being very powerful in terms of like more backend infrastructure things. I guess I don't have enough knowledge around LLMs for that. Yeah, we're all figuring out. So no worries on that. Going back to one of the questions, because you actually touched on a really interesting point, right? I think the composable data systems allows more possible layers in the production stacks to be able to be there. You talk about the proxying layer, and we actually see a lot of different proxies being created these days. But most of them are like large enterprises or people that have a dedicated team.

Starting point is 00:31:19 They're the only ones that can do them. You need a specific knowledge, you know how things work. I think we alluded to like, there's really more people building this. What kind of patterns do you see people building new layers of? Is it proxying just for cost reasons? Do you see other things or other layers

Starting point is 00:31:36 people will be building for production that this allows them to do? Like, I'm just curious, what have you been observing? Like, oh, wow, that's not just a database anymore. We see people doing more of these kind of things. Sure. And honestly, this is kind of a weird, frustrating thing for me is that I don't really get to see much of that.

Starting point is 00:31:53 There are lots of people using Data Fusion and contributing, but I don't really know what they're doing with it. And I'm sure some of it is like stealth startups. But yeah, there are probably people out there building these cool kind of gateway, Raritor-type solutions. I don't really have insights into what people are doing with that these days, because I'm very much focused on just kind of building these foundational components.

Starting point is 00:32:14 One of the, certainly the biggest frustrations I've had in constructing data infrastructure and thinking about how you secure it is like mesh now. There's been a lot of talk in the data world about like the concept of data mesh is the sort of inversion of control specifically in online data systems like less about the warehouse more about i've got you know database that's producing a cc stream into a kafka and that's being produced into some other streaming system then some other streaming system then back into a database to be then served to some customer someplace, right? Like the actual online loop, if you will. And my frustration has been like building stuff in that space and making it

Starting point is 00:32:51 composable is almost impossible. And there are companies, some that we've had on the pod, trying to build more composable systems that make this whole layer more programmable. And it seems to me like, you know, we're at this inflection point in data where with libraries like what you built with DataFusion, with Guasim coming online, is that we'll now have actual programmable data stacks in a sense that we haven't had before. We'll be able to operate a layer above the infrastructure. Less time, connecting a little piece of JavaScript to talk to a Kafka pipeline, then talk to Flink, and more of a higher level orchestration layer that and then talk to Flink and more of a higher level orchestration

Starting point is 00:33:26 layer that we then submit to this layer. And I think things like Data Fusion enable that. I came to this conclusion three or four years ago when we were building Cape Privacy and what we were trying to build is a data security and privacy layer so that compliance folks, security folks can write some data security

Starting point is 00:33:42 rules about what data can pass to what places. And then you go and look at the tools and you realize that there's no insertion point, right? Like there's no good insertion point. There's no good place to build this type of infrastructure. You have to go rely on the engineers to like put the control points into the data flow. And then you have to audit that the control points are in the data flow. And it's just like a terrible experience. So I guess my hot and spicy take, which is I started making a point now I'm on a diatribe, is it seems to me like the best place for the engine organization is to get to this like sort of beautiful world where it's a plug and play data streams. I can listen to anything that's there and then we can drop in pieces of

Starting point is 00:34:17 functionality and control points with own impacting the customer. The customer in this case being like the engineer actually trying to build a piece of functionality that works. I'm curious from your experience with Spark and with the stuff you've been doing with Rapids, how do you feel like the plugability exists in Spark today, right? Like today versus where it could be, what do you see that difference being? Yeah, Spark actually has a pretty good story

Starting point is 00:34:39 in terms of like its plugin architecture. So yeah, you can certainly add your own plugins to do things like replacing the physical plan or providing your own shuffle manager. There are multiple companies that are doing some form of acceleration with Spark. So yeah, over the years, all of those extension points have been kind of baked in.

Starting point is 00:34:58 So I would say it does a really good job in that area. Cool. We touch on a lot of these kinds of questions. Is there something that you're personally most exciting? Like something that people will be generally excited about? Like maybe it's about Rappers, maybe about Data Fusion that you're looking forward to?

Starting point is 00:35:15 So I'm not working on anything really outside of the kind of the Data anymore. So there's no kind of surprise things coming up for me. I think what I'm excited about really is just seeing the momentum of data fusion keep building. Performance is pretty good in data fusion already. In some cases, it's really good. In other cases, not so much.

Starting point is 00:35:34 And that's partly because people are building their own platforms on it. They aren't necessarily interested in optimizing all the things in data fusion. But I'm seeing more of a drive to get better results on some of the kind of popular public benchmarks because that's kind of important because people will look at these benchmark results and get excited about the fastest thing so you know i think it is important the data fusion community does put some more time into that so we're kind of up there that's pretty exciting so you continue to grow in maturity and get faster and just have a bigger community behind it.

Starting point is 00:36:09 Awesome. Well, Andy, thanks so much for coming on our podcast. What's the best way for people to follow you or look at the work you do? What's the best way to find that? So, x slash Twitter, I'm andygrove underscore io. You can find me on GitHub as andygrove and my website andygrove.io, I guess, are the best places.

Starting point is 00:36:31 Awesome. Thank you so much, Andy. It was such a pleasure. Well, thanks so much for having me on. This was a lot of fun.

The Infra Pod - Rust is for big data! Chat with Andy Grove

Ian and Tim sat down with Andy Grove (creator Datafusion, Apache Arrow PMC chair) to talk about his original mission to build the Big Data ecosystem in Rust, and how that has evolved until now. ...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

The Infra Pod - Rust is for big data! Chat with Andy Grove

Ian and Tim sat down with Andy Grove (creator Datafusion, Apache Arrow PMC chair) to talk about his original mission to build the Big Data ecosystem in Rust, and how that has evolved until now. ...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.