The Infra Pod - Rust is for big data! Chat with Andy Grove
Episode Date: December 18, 2023Ian and Tim sat down with Andy Grove (creator Datafusion, Apache Arrow PMC chair) to talk about his original mission to build the Big Data ecosystem in Rust, and how that has evolved until now. ...
Transcript
Discussion (0)
Welcome back to yet another Infra Deep Dive.
This is Tim from EssenceBC and let's go Ian.
And this is Ian from Snyk.
And I am so excited today to be joined by Andy Grove,
who is a creator of the Data Fusion REST library.
Andy, actually, why don't you just introduce yourself?
Absolutely. And thanks for having me on. I'm happy to be here.
So I try and keep this fairly short, but I've been a software developer for an embarrassing number of decades in the industry.
And I guess it got kind of interesting 10 to 15 years ago.
I moved from the UK to the US to join a startup and started building some data in for, I guess,
started out with a database sharding solution. We had some technical success, but not a ton of
commercial success with that. And then the company pivoted and we started building a distributed SQL
query engine specifically for the reinsurance industry, which is a little bit weird. And I
started out in C++ way back, jumped onto the Java bandwagon before 1.0
because I didn't have to worry about segfaults anymore.
That was cool.
And I could kind of be more productive.
So I did JVM for a very long time.
Developers tend to build things in the languages that they know,
not necessarily the best languages for a project.
So I was building this distributed SQL query engine in Java.
It was really interesting, learning a lot lot and this was around the same time that Apache Spark was becoming popular and our main
client and investor at the time decided to switch to Apache Spark which was a sensible choice for
them so the work we had been doing kind of became redundant and basically that was kind of the end
of that company so I was left in a situation where
I'd been learning about building query engines, and it's really fascinating to me, and I wanted
to carry on learning. And at the same time, I'd started to learn Rust. So just for fun,
I started building some query engine tech in Rust in my spare time. Eventually, I got to a point
where I decided that I was going to try and build something like Apache Spark in Rust.
So that's kind of where that all started.
Amazing. And so basically your career has been like rebuilding like data processing engines
over and over and over again.
What was it about Rust that you said, ah, I'm going to go rebuild Apache Spark massive system.
Why was rebuilding Apache Spark and Rust like interesting?
What do you think the advantages were of Rust for this job type?
So to me, Rust kind of represented the best of both worlds of C++ and Java.
The performance is very similar to C++.
And it's kind of different.
Like Java, you don't have to worry about memory safety,
but that's thanks to the garbage collector, which has pros and cons.
And Rust seems like an ideal compromise between those.
So there's no GC, obviously.
But a compiler saves you from making terrible mistakes
that end up in segmentation faults.
And that seems just really interesting to me.
And it seemed ideal for the intense parts of data processing systems.
It's really good to have a native language.
And I just didn't want to go back to doing C++.
I kind of resisted that. So I was kind of very stubborn and just waited a long time until a native language. And I just didn't want to go back to doing C++. I kind of resisted that.
So I was kind of very stubborn and just waited a long time
until a better language was invented.
And when Rost came along, I thought, ah, this is my chance
to go and build something more efficient and more scalable.
And with distributed systems, it's kind of interesting.
Like the moment you have too much data to process
in a single node and you go distributed,
you have a ton of overhead.
So the more you can do in a single node,
the longer you can kind of put off that point where you have to go distributed,
or you can at least reduce the number of nodes you have, which cuts down on some of the overhead.
So I was excited about the memory efficiency of Rust compared to Java. And that's something
that I saw kind of bear out in some of my early experiments in this area.
So a lot of it sounds like JVM at scale has huge amounts of overhead from the garbage
collector and just running the JVM in general. Ross, I can build a more memory efficient version
of an engine, have more control over it and have the safety aspect of not to take fault hell.
And plus a new language that was interesting and fun. How much of the problem domain when you first started building and today, actually, in general,
how well is Rust suited
to solving these data problem domains?
Often, if you were to ask someone
about building a data pipeline
or data scientists spend a lot of time
using Python tools,
and we have Kafka
as one of the best examples
of a thing in the JVM.
What is your answer to those questions?
Yeah, it's certainly challenging.
The main problem was just the lack of maturity
of the whole ecosystem around Rust.
There's no flat buffer project in Rust yet,
or protocol buffer isn't quite there.
That was really a big issue
in some of the things I was trying to do initially,
like building a distributed system.
And it got to a certain point where I realized this project was just too
ambitious, too hard, and it's kind of hard to build a community around it.
So I kind of pared back to scope a bit and instead of going straight for
distributive, I focused on making it like an in-process query engine, which,
you know, got rid of a lot of requirements around things like serialization formats.
That's probably the major issue. There were some annoyances as well back then because Rust
was earlier on and the language kept changing.
So you'd get the next nightly release and your project wouldn't compile anymore.
That was fun. I think you are either a core contributor
or were at one point a core contributor of the Arrow implementation
for Rust. Can you explain to our listeners, what is Apache Arrow?
Why is it amazing?
Then can you talk a little about why Arrow helps building these data tools?
Yeah, absolutely.
So my initial excitement was around Rusted Language, and that's all very cool.
But as I started sharing details of the project on Reddit, lots of smart people gave me feedback.
And one of the things that multiple people told me is that nobody should be building
row-based query engines in 2018.
You should be doing columnar processing.
I was aware of the concept, but I hadn't used it.
I wasn't really, I didn't really get it at the time.
But I started doing some experiments with columnar.
I saw some big speedups right away.
And people told me to go check out Arrow, which I did.
But it wasn't a Rust implementation.
I was just using, in Rust, the vector type to have arrays of data.
And yeah, I mean, that was showing some benefits.
But Arrow definitely appealed to me.
So I started building a Rust implementation of Arrow.
And I have to really caveat that.
And the initial thing I built and donated was a very small subset of Arrow.
And what exists today is massively better, thanks to the huge community behind it. I just kind of got the ball rolling at least and created
like an MVP of a Rust implementation of Arrow. Going back to the question you asked I guess,
so what is Arrow? Arrow really started out as a memory specification for columnar data and the
benefit of having a specification for a data format is that you can kind of share data between different languages, even between different processes, without the cost of serialization.
When you're using things like Apache Spark, and then you want to kind of go to Python, you've got to take the data from Spark formats, rewrite it in some other formats, and that's like incredibly expensive and wasteful. So with Arrow, there's an opportunity now. It can be kind of retrofitted
to existing systems, but we're seeing like a whole new ecosystem built that's Arrow native.
And that gives you that kind of seamless interop between different languages. That's pretty
exciting. And that's just the memory formats. Then there are implementations of Arrow, which
provide basically compute kernels. So this is just code that you can run against your data.
So maybe you have two arrays of numerics.
You want to add them together just to take a trivial example.
Or maybe you want to do some kind of an aggregate.
There's a bunch of highly optimized kernels there in multiple languages now.
For a long time, I was primarily Java and C++.
But Rust now is another very active implementation.
Rust is a super interesting ecosystem.
One of the core questions I have is
today, there's a lot of people building
data tools in Rust. What do you attribute
that rise?
You're one of the very first people. I mean, you started
for example, the Arrow project.
I remember looking at the Rust ecosystem, seeing DataFusion
three years ago and being like,
this is really interesting. So what is it
about Rust today and why
are we seeing so many more people building on Rust?
And if you have some examples of popular projects
or pieces of software of people that have built stuff in Rust,
that'd be great too.
Yeah, sure.
I mean, I guess when I started, Rust was maybe controversial.
It certainly wasn't widely popular.
But now, so many of the large companies are backing Rust.
I think Microsoft just came out with an announcement.
They're investing heavily.
Amazon are building a bunch of infrastructure in Rust.
So I think Rust has just matured to the point where it has that kind of universe of acceptance
that this is a real thing that companies can afford to kind of take a bet on,
which maybe was seen as risky early on.
And yeah, there are lots of projects being built from Rust.
Polar's is probably one of the most
well-known ones, which is essentially
like a Pandas replacement
Rust native. It uses its own
implementation of Arrow, not the official
Rust Arrow library. That may be
changing over time, I'm not sure.
InfluxData is probably
and InfluxDB is maybe a well-known one
as well. They decided to completely
rebuild their core database engine in Rust.
And rather than start from scratch and have to build a network engine,
which is a lot of work, they decided to base it on Data Fusion.
That gave them a huge head start, and they've been really great
at contributing back and helping make Data Fusion more mature,
which has been really cool.
You have a super fascinating journey, because I'm reading the blog post from
2018 is Rust is for big data, right?
I think even attributed to that post in a lot of the recent posts,
this is where I started. Right. And it all started with, I don't want JVM.
Like basically I want to exploit how to use this newer,
much more efficient language. And that journey led down to, you know,
data fusion, the know, Data Fusion, Arrow, Ballista,
and now you're back working on Spark again, using Rappist. So I want to maybe talk about this sort
of Ballista, Data Fusion. You started as a side project wanting to use Rust to build for big data,
and you went down to want to build a Spark, a new Spark. Yeah.
What is the biggest learning for you trying to rebuild Spark from scratch using Rust?
What was like the biggest challenge and what are surprising things you learned in this journey?
It was really interesting for sure.
And like, it's a lot of work to build a distributed query engine.
It wasn't like I was trying to make something that has the same maturity as Spark,
which has been around for like a decade
with more than a thousand contributors.
You know, mine was definitely like a toy project.
It took me a long time to really build anything.
I mean, I was working on this in my spare time,
took a break from the project when I got frustrated,
then I had to come back a few months later
and have another go.
So it took a very long time.
Momentum was kind of slow,
but after like a few years working on this, I got to the point where I could run the TPC-H queries.
That was a really big milestone for me. But I think one thing I learned is like Spark
is a really mature product and it's really challenging for me to even match the performance
of Spark for a long time, even though using things like Arrow and Rust, a lot of code
in Spark has been heavily optimized over the years.
It doesn't really matter which language you're using.
You always have to go through that work
of finding the best data structures and algorithms
and fine-tuning things to get really great results.
We just assume Rust will just give you efficiencies.
But like I mentioned, so much of the performance of Spark
is beyond just a pure language. But
of course, Spark has been developed for so long now, they've have optimized to using more efficient
Java or JVM. What is the hardest part when it comes to even getting close to performance?
Is it just the algorithm side? Like how do I actually get better ways of doing you know sequels and some some algorithm
type of stuff yeah i think i think there's two main areas like one is classic query optimizations
to your query plan so imagine doing a join between two tables and you have some filter conditions to
filter some rows out you know it's better to filter the rows out before you do the join
otherwise you're producing all this data and then filtering it down so like all query engines have basic optimization rules like pushing filters through joins
just to take like a simple example but then there are more advanced optimizations that use statistics
so there are different join algorithms but a very common one is a hash join where essentially you
load one side of the join into memory in a hash table then you stream the other side and do the cups in the hash table well you really want to put the smaller side of the join into memory in a hash table, then you stream the other side and do the cups in the hash table.
But you really want to put the smaller side of the join in the hash table,
not the larger side, because of the memory constraints.
So you need rules to figure out, like, okay,
which side of this join is going to be larger.
And that's not always simple.
You can't just look at the number of rows because there are filter conditions.
So you have to predict, like predict how selective are those filters?
Is this filter going to reduce the table by 90% or not at all?
So that's the side of it where there's been a ton of research
into that literally over decades.
So having all of those optimizations to produce a good plan
is one side of it.
Then there's the actual implementation code for those algorithms,
just trying to make those as efficient as possible.
I guess I'm very curious, what is Data Fusion now?
It seems like it's now sort of like a mainstream project per se,
but it's quite unknown what the adoption,
what the actual state of that is.
Can you maybe talk about what Data Fusion is?
Absolutely.
So Data Fusion and Ballista are both parts of the apache
ro project now i'm still involved a little bit but not like actively coding right now so what
is data fusion today data fusion is a great foundation for building new query engines or
data systems data fusion is very composable it's very extensible so the design of it is
totally separate modules or crates so you have like a secret parser you have a query planner a query optimizer there's the execution
engine and you've got your points for plugging in user-defined code user-defined functions
so if you wanted to build a new query engine today maybe with some proprietary format you can be up
and running like in a few days because it's like a toolkit.
You take the bits you need,
you plug in the code that's special
to your application or file formats,
and you've got a great start.
There's another project that I drew inspiration from,
which is Apache Calcite.
Calcite was used as the foundation
for quite a few kind of big data query engines
over the years.
But Calcite is JVM-based, not Rust.
Data Fusion is a very active project.
Every time I go and look, there's just so much activity.
It's kind of hard to keep up with.
And there are many, many companies now building new platforms
on top of Data Fusion.
And the most known being InfluxDB.
Yeah.
There's a company called Sonata doing some interesting things
on there as well.
If you go to the Data Fusion repo in Apache Arrow, there is a list of projects building on top of Data Fusion.
And yeah, it's a pretty long list these days.
It feels like the project's got that kind of critical mass of momentum where it's going to be around for a while and people are depending on it, investing in it.
And if you were to start a new project or a new company today, you're going to go solve some data problem, where would you land? Is your
perspective that Rust is ready for primetime to go build? If you were going
to build a Capcom replacement, would you start with Rust? If you were going to go build a better
Apache Spark, would you start with Rust? Is that where you think the future is?
Yeah, great question. So in terms of languages, yeah, I'm
still a big fan of Rust. I think Rust is a great solution for building data systems. However, within any system, there are portions where performance is really critical, like actually processing these joins. Things like a scheduler that's keeping track of what's happening and the different processes running on the network.
And maybe it was overkill just to try and write everything in Rust.
There was one project a few years ago, it's no longer active, called Blaze, which basically provided a Spark accelerator using Data Fusion.
Using Spark for query planning and order distribution, but then delegating down to Rust code to actually do the query execution.
And they were seeing some pretty reasonable results, like 2x on queries.
So with hindsight, that would have been maybe a smarter strategy for me to try and make all this,
like get momentum sooner.
If people could just keep running Spark, but slowly replace bits of it with Rust,
that might have been a better path to adoption rather than just throw this away and start again
with this product that has 1% of the features.
Out of curiosity, I know this is sensitive enough,
but I imagine you can probably work on Data Fusion full time,
but now you're working on BlueRapid's Accelerator for Spark.
Was there a reason you wanted to work on that?
Oh yeah, no, it's a really easy question to answer.
NVIDIA kind of noticed what I was doing with Data Fusion,
saw that I had skills around our own query engines,
and they called me and said, would you like to come work for us?
And I said, yes, and that was basically it.
So NVIDIA is a great company to work for.
And so the work I'm doing there, we're accelerating Apache Spark using the QDF GPU data frame library.
So essentially delegating the actual physical execution down to GPU that can produce some pretty great results.
It's interesting, like Data Fusion, I was building query engines as a hobby.
There have been times when I've been doing aspects of it as my job, but then it stopped
being my job. It became my hobby for a while and now it's my job again. And now I'm doing it as
my job. I'm less inclined to spend more weekends
continuing doing similar type of work on data fusion,
if that makes sense.
Yeah, that's really interesting.
Like GPU with Spark is not new,
but it never really gained wide adoption per se.
It's always some interesting prototype.
And let me talk about like,
why would somebody wants to use Rapids
to accelerate what kind of workloads on Spark?
And where do you find some kind of trade-off?
Because obviously GPU is not that cheap or easy to get by.
So running the cost will be much higher.
What do you see people are adopting Rapids
and changing Spark acceleration from?
And how do you usually see people actually use it in practice?
Sure.
And I think that on the NVIDIA website,
we have some case studies of people using this,
and that's probably a great place to go
to look at kind of numbers and cost savings.
But the one nice thing about the solution
is that there are no code changes required.
You literally drop in a new jar and some configs,
and now your SQL ETL jobs are accelerated on the GPU.
If that works for you in your use case and you see great results,
then it's kind of a no-brainer.
It can vary depending on your exact use case
and what functionality you're using as to how good the performance improvements are.
If companies have GPUs, they're running Spark,
they can drop this in and make it go faster.
It's just like an easy thing to do for cost savings
or getting the jobs to run faster.
Are there certain types of workloads
where it makes sense to do this,
where it doesn't make sense for others?
Like what are some circumstances
where like using a GPU with Spark makes a lot of sense?
What are some circumstances where it's like,
you shouldn't bother?
It's not performing enough for the additional cost
for the GPU in the cloud or, you know, to buy a GPU?
I'm not sure I can give a great answer to that.
I mean, I was kind of surprised when I first got involved in the project
because I didn't realize that GPUs would be good at the type of operations
that happen within like a SQL engine.
But it just turns out that GPUs have just like got so many cores,
even if you're doing something that's not as efficient as it would be on the CPU,
just the fact that you have so many cores means it can go faster anyway.
And I think to ensure like there are certain operations that, you know, are really well
optimized for GPU, some others maybe not so much yet, but it's always an ongoing effort
to keep optimizing kind of more and more of the different operations and expressions that
people typically use in Spark.
Cool.
So we're going to jump into what we call the spicy future section.
What we usually talk about is the future. Where do you see the whole data
infrastructure space will go into? I think particularly given your involvement,
writing sort of like the new data infrastructure in Rust, now even getting into Accelerator.
Curious what you see the next five years will be.
Do you see the ecosystem change?
Do you see more data infra work built on new languages?
And maybe just give us your hot take.
So I'm not sure these are really like my hot takes.
Other people have formed these thoughts as well.
It's not for me.
A definite trend towards composable data systems
and composable query engines.
So it's no longer the case where you necessarily have to have
like one thing end-to-end.
There's some interesting projects.
Data Fusion is one that's very extensible.
Meta are building Velox, which is a C++ query engine.
It doesn't have a front-end.
It doesn't have a SQL parser or query optimizations.
It's just the execution.
And then there are projects like Substrate, which is kind of an intermediate representation of query plans. So theoretically,
you could maybe take Apache Calcite or Data Fusion to do your SQL parsing and query planning,
go through Substrate into Velox for execution. So you can start mixing and matching different
things. Voltron Data, they're producing a lot of good content in this area.
They've been investing in IBIS, which is a Python front end for query execution,
and they can plug into different backends. That's one area that looks pretty interesting.
I'm keen to hear more about Wasm in this space. I haven't really heard a lot about that,
but it seems like a bit wasteful, Redox and Data Fusion, Databricks with their Photon engine, lots of people are building these query engines. It'd be great if we could share more work across
them. But we've got Rust and C++ and Java for these different things. I don't know,
it'd be interesting to see if there's some WASM or if users could write their user defined functions
in WASM so they can run on any of these platforms, that'd be kind of an interesting area. And I also think one of the big challenges, like moving data around. So like these
days, compute and storage is very separate. And it'd be great if we could push filters, predicates
down from query engines into storage. Again, maybe Wasm could become a universal Wasm and substrate.
Maybe there's like a universal way of being more specific
with storage about what data you actually want to retrieve.
That's one whole area that is kind of interesting to me.
It's super interesting. Wasm is a shiny
opportunity for interoperability. It's interesting
how it has so much opportunity to reinvent so many
parts of the stack, specifically in data land, like the ability
to move the UDF around
between systems,
or even just to move a UDF
that can be written in any language
and compile it to Wasm
and then send to the system.
I think, like, from my perspective,
that would really democratize access
to these systems.
Like, one of the biggest challenges
that we have to get any engineer
to work with some of these tools
is with Spark as a great example.
It's like, well, you've got the JVM or you've got PySpark.
And PySpark is not Python, right?
And the JVM is not Python.
It's a very different system.
So if you're a JavaScript developer, a TypeScript developer,
or somebody that's building websites,
you do not even have near any layer of the skill set
to even pick some of these things up.
Now, you could argue and say that those types of people will never have big data problems.
But there's lots of organizations that have big data problems and their engineering skills
that they're equipped with is, I'm a TypeScript engineer, or I write some Go, and now I need
to go solve a data problem.
When you pick up those tools, they don't look anything like.
And if you're a VP of end or a CTO, your choices are basically, oh, I got to go hire
data engineers.
Like people are used to working on these large data ecosystems that are used to these things, these very specific skill sets.
So I think Wasm is really interesting because we did enable inoperability, assuming that you had those like a serverless platform and I could write the functions in, you know, my TypeScript and then it compiled to wasm and i could push it like think of how that changes the way that we structure engine organizations and also think about how that
like opens up more people to move up and down the stack in a way that they couldn't before right
because i think everyone can understand the complexities of like how much throughput i'm
trying to put through this thing right like i think that thought process moves between the
different systems like the different modalities but the specifics of the jvm or pi
spark doesn't move as easily as well so we spent a lot of time talking about how important gpus are
to enable loms how long is going to change the world well what about gpus in the data ecosystem
right what about databases with gpus like where do you see the future of the gpu in the context of
building these large data ecosystems?
Yeah, that's a great question.
Obviously, I work for NVIDIA, so I'm very biased on this.
Be careful what I say.
But I mean, seeing the work that the Rapids team does and the Spark Rapids team,
there's some incredible engineering going into building data technology.
So I do believe that GPUs will continue to play an important role in this space.
Some of the speedups on GPU, I mean, again, it really depends on the use case.
And the NVIDIA blog website has all the information, but there's some incredible speed ups on particular operations on the GPU.
So I think for the larger companies that have the true big data and have the budget for GPUs, I think it's a very compelling solution.
And then there's the other end, like the people using Polars and things like that.
GPUs can still play a really good role there too.
I mean, Rapids has a QDF,
which can be used pretty much
as a drop-in replacement for Pandas.
So it's in a similar space as Polars.
So yeah, I highly recommend people try that out as well
if they haven't used that yet.
So I want to go back and touch
on your composable data systems.
It's really interesting, right?
Because like I said, we see more projects
that are able to be more composable or extensible.
Looking at all the traditional databases,
there's always been this somewhat composable.
There's plugins, there's extensions you can do.
That's the extent of what you could do for the most part.
Now you look at the more traditional, even like data warehouses,
they allow you to do like Snowpark, right?
You can write Python code and stuff like that.
But we're talking even much more deeper now, right?
Now building any data systems or database query engine type things,
I can grab a couple of projects
and I can be able to create a new data system faster.
One advantage I can see having a composable data system is, hey, me building a new database,
I have some particular requirements.
I don't have to go from scratch, right?
We have storage engines.
We have that queries parser.
We have this kind of thing.
I think Facebook has another project too, right?
There's a bunch of these things to go faster. But there's also another
angle of what composable data systems are. From a user perspective, I should be using something
that's more composable because maybe there's some motivation for me to not be locked into
one particular implementation. Maybe I do want it extended. I'm curious about your thoughts,
because I'm trying to figure out why do we want composable data
systems in the first place? Is it just
faster implementation for new ones?
Is there any benefits for the user side?
Or do you see something else? Why
is the world moving towards that?
And what's pushing this this way?
Yeah, so
I think there's maybe two use cases that come
to mind. One is what you kind of touched on already,
which is people building new new systems from scratch.
I want to get that headstart,
not having to build all the components.
But another area is where you have
kind of integration requirements
between different languages or tech stacks.
So maybe you already have an investment
in a bunch of Java code
that's integrated with Apache Cal site for query planning
and you're happy with that,
but you want the faster execution. So if you can bridge your JVM query planning with like Velox or Data
Fusion for execution, that could be a quick win if you're moving away from some other kind of JVM
based execution engine. That also ties into this whole concept of Spark accelerators. And I mentioned
the Blaze project. So if you're doing something like that, if you have Spark or some other query engine,
and you're just trying to accelerate certain components of it,
that's where the composability can really come in,
just to swap out certain pieces.
Writing databases is always hard, right?
There's not that many developers in this world
that can really competently build a production database.
People can learn, but it takes years, right?
I do wonder, making databases more composable
and more easier to build
doesn't mean everybody would just go ahead
and just do them.
Do you believe there's going to be more people
building their own data systems
and capable to leverage this?
So like, hey, maybe the enterprise is like,
instead of just buying for products,
we should actually come up and use these
and piece together our logos here together our Legos here.
Do you see that happening way more often? And lowering the barrier to entry of building a
system is going to enable this too? I think people love building databases and data systems. And
yeah, having these composable systems will maybe encourage more people to do that. Whether that's
a good idea or not, I'm not sure. I don't think I appreciated when I started on this just how much work it is to have a production quality system.
Like there are so many, you look at Apache Spark, there are so many different operators
and expressions to support. And you get into details of any of these expressions and things
like type coercion and casting between types. There's like so much technical detail. It's
a lot to build from scratch. So having these composable systems hopefully takes care of a lot of that stuff for you so you can
just focus on building the parts that are unique to your business.
So yeah, I think there will be more of that. In one role I had
a few roles ago, we had a situation where
we had a Spark cluster. For certain queries, it was fine. Other queries,
it didn't make so much sense. And we built a Spark cluster. For certain queries, it was fine. Other queries, it didn't make so much sense.
And we built a query gateway.
So as queries came in, we could inspect them,
see what type of queries they were,
and then make a decision like,
do we just delegate Spark for this?
Because we really need a cluster because it's kind of computationally expensive.
Or is this just like a trivial query
that we can just process ourselves in memory and process?
And that was kind of interesting.
So having these kind of building blocks will allow. And that was kind of interesting.
So having these kind of building blocks will allow people to build those kind of infrastructure more easily.
You know, things like that can add a lot of value to an organization.
And it's not building a complete database.
It's just more query intelligence and routing.
I'm curious to get your take on interplay of Rust and LLMs and data systems and how these all interplay.
One good example of a database system that everyone's talking about now that no one talked about two years ago is the Vector Store.
I'm kind of curious to get your spicy take on where you think the future of Vector Stores are going and how that interplays if you have an opinion on it.
But more importantly, I'm really interested to get your opinion and perspective on the future of LLMs and data infrastructure and where they fit or maybe don't fit.
So I don't know too much about vector stores, vector databases.
The one area where I do see a natural fit is more in the front end.
There's already projects out there doing this, but like who wants to write complex SQL when you can just ask a question?
I see that being very powerful in terms of like more backend infrastructure things.
I guess I don't have enough knowledge around LLMs for that.
Yeah, we're all figuring out.
So no worries on that.
Going back to one of the questions, because you actually touched on a really interesting point, right?
I think the composable data systems allows more possible layers in the production stacks to be able to be there.
You talk about the proxying layer, and we actually see a lot of different proxies being created these days.
But most of them are like large enterprises or people that have a dedicated team.
They're the only ones that can do them.
You need a specific knowledge, you know how things work.
I think we alluded to like,
there's really more people building this.
What kind of patterns do you see people building
new layers of?
Is it proxying just for cost reasons?
Do you see other things or other layers
people will be building for production
that this allows them to do?
Like, I'm just curious, what have you been observing?
Like, oh, wow, that's not just a database anymore.
We see people doing more of these kind of things.
Sure.
And honestly, this is kind of a weird, frustrating thing for me
is that I don't really get to see much of that.
There are lots of people using Data Fusion and contributing,
but I don't really know what they're doing with it.
And I'm sure some of it is like stealth startups.
But yeah, there are probably people out there
building these cool kind of gateway,
Raritor-type solutions.
I don't really have insights into what people are doing with that these days,
because I'm very much focused on just kind of building these foundational components.
One of the, certainly the biggest frustrations I've had in constructing data infrastructure
and thinking about how you secure it is like mesh now.
There's been a lot of talk in the data world about like the concept of
data mesh is the sort of inversion of control specifically in online data systems like less
about the warehouse more about i've got you know database that's producing a cc stream into a kafka
and that's being produced into some other streaming system then some other streaming
system then back into a database to be then served to some customer someplace, right? Like the actual online loop,
if you will. And my frustration has been like building stuff in that space and making it
composable is almost impossible. And there are companies, some that we've had on the pod,
trying to build more composable systems that make this whole layer more programmable.
And it seems to me like, you know, we're at this inflection point in data where with libraries like what you built with DataFusion,
with Guasim coming online, is that we'll now have actual programmable
data stacks in a sense that we haven't had before. We'll be able to operate
a layer above the infrastructure. Less time, connecting a little piece of
JavaScript to talk to a Kafka pipeline, then talk to Flink, and more of a higher
level orchestration layer that and then talk to Flink and more of a higher level orchestration
layer that we then submit to
this layer. And I think things like Data Fusion
enable that. I came to this conclusion
three or four years ago when we were building Cape Privacy
and what we were trying to build is
a data security and privacy layer
so that compliance folks,
security folks can write some data security
rules about what data can pass to
what places. And then you go and look at the tools and you realize that there's no insertion point,
right? Like there's no good insertion point. There's no good place to build this type of
infrastructure. You have to go rely on the engineers to like put the control points into
the data flow. And then you have to audit that the control points are in the data flow. And it's just
like a terrible experience. So I guess my hot and spicy take, which is I started making a point now I'm on a diatribe, is it seems to me like the best place
for the engine organization is to get to this like sort of beautiful world where it's a plug
and play data streams. I can listen to anything that's there and then we can drop in pieces of
functionality and control points with own impacting the customer. The customer in this
case being like the engineer actually trying to build a piece of functionality that works.
I'm curious from your experience with Spark
and with the stuff you've been doing with Rapids,
how do you feel like the plugability exists in Spark today, right?
Like today versus where it could be,
what do you see that difference being?
Yeah, Spark actually has a pretty good story
in terms of like its plugin architecture.
So yeah, you can certainly add your own plugins
to do things like replacing the physical plan
or providing your own shuffle manager.
There are multiple companies
that are doing some form of acceleration with Spark.
So yeah, over the years,
all of those extension points have been kind of baked in.
So I would say it does a really good job in that area.
Cool.
We touch on a lot of these kinds of questions.
Is there something that you're personally most exciting?
Like something that people will be generally excited
about? Like maybe it's about Rappers,
maybe about Data Fusion that you're
looking forward to?
So I'm not working on anything really
outside of the kind of the Data
anymore. So there's no
kind of surprise things coming up
for me. I think what I'm excited about really is just seeing the momentum of data fusion keep building.
Performance is pretty good in data fusion already.
In some cases, it's really good.
In other cases, not so much.
And that's partly because people are building their own platforms on it.
They aren't necessarily interested in optimizing all the things in data fusion.
But I'm seeing more of a drive to get better results on some of the kind
of popular public benchmarks because that's kind of important because people will look at these
benchmark results and get excited about the fastest thing so you know i think it is important the data
fusion community does put some more time into that so we're kind of up there that's pretty exciting
so you continue to grow in maturity and get faster
and just have a bigger community behind it.
Awesome.
Well, Andy, thanks so much for coming on our podcast.
What's the best way for people to follow you
or look at the work you do?
What's the best way to find that?
So, x slash Twitter, I'm andygrove underscore io.
You can find me on GitHub as andygrove and my website
andygrove.io, I guess, are the best places.
Awesome. Thank you
so much, Andy. It was such a pleasure. Well, thanks so much
for having me on. This was a lot of fun.