The Data Stack Show - 175: The Parts, Pieces, and Future of Composable Data Systems, Featuring Wes McKinney, Pedro Pedreira, Chris Riccomini, and Ryan Blue
Episode Date: January 31, 2024Highlights from this week’s conversation include:Introduction of the panel (0:05)Defining composable data stack (5:22)Components of a composable data stack (7:49)Challenges and incentives for compos...able components (10:37)Specialization and modularity in data workloads (13:05)Organic evolution of composable systems (17:50)Efficiency and common layers in data management systems (22:09)The IR and Data Computation (23:00)Components of the Storage Layer (26:16)Decoupling Language and Execution (29:42)Apache Calcite and Modular Frontend (36:46)Data Types and Coercion (39:27)Describing Data Sets and Schema (42:00)Open Standards and Frontiers (46:22)Challenges of standardizing APIs (48:15)Trade-offs in building composable systems (54:04)Evolution of data system composability (56:32)Exciting new projects in data systems (1:01:57)Final thoughts and takeaways (1:17:25)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome to the Data Stack Show.
We have a truly incredible panel here to discuss the topic of composable data stacks.
So many topics to cover today.
So let's get right into introductions.
And I'm just going to do it in the order that it shows up on my screen.
Chris, do you want to start out by giving us a quick background and
intro? Sure. Yeah. My name is Christopher Comiti. I have spent the last 20 years of my career at
two companies, mostly LinkedIn, where I spent a lot of time on streaming and stream processing
and was the author of Apache SAMSA, which was an early stream processing system,
kind of similar to Flink. And most recently at a company called WePay, which is acquired by JPMorgan Chase, where I ran our payments infrastructure, data infrastructure
and data engineering teams for a stretch of time. I've also written a book for new software
engineers, kind of a handbook, because I was tired of saying the same thing in one-on-ones
over and over again. I've been involved in open source. I was a mentor for the Airflow project
and helped guide it through Incubator
on Apache. I also do a little bit of investing. And so that's where I spend a chunk of my time
now. And I, yeah, write a little newsletter on all things systems infrastructure. That's me in a
nutshell. Very cool. Wes, you're up. Yeah, I'm Wes McKinney. I'm a serial open source project, open source software developer.
I've created or co-created a number of popular open source libraries, Pandas and IBIS for
Python, Paciero, kind of in-memory data infrastructure layer.
It's very relevant to the topic of today's show. I've been involved in a bunch of companies,
most recently a co-founder of Voltron Data,
building accelerated computing software
for the Composable Data Stack
and Posit, a data science platform company
for R and Python.
I'm an author of the book,
Python for Data Analysis.
So popular reference book for Python data science stack.
And I also do a fair bit
of angel investing
in and around next generation
data infrastructure startups.
Very cool.
Ryan, you're next on my screen.
Oh, thanks.
I'm Ryan Blue. I'm the co-creator of Apache Iceberg. Very cool. Ryan, you're next on my screen. architect big data systems, especially in object stores. I'm also a co-founder of Tabular, where we sell an iceberg-based architecture that has security and data management services baked in.
I left Netflix to found Tabular and Netflix.
At Netflix, we were on the open source big data team.
So I got to work on Parquet and Iceberg
and replace the read and write paths
and Spark and various other things.
Very cool.
And Pedro.
All right.
Hello, everyone.
I'm happy to be here once again.
I'm Pedro Pedreira,
software engineer. I've been at Meta for a little bit over 10 years, always involved in projects around data infrastructure, a little bit closer to analytic engines, log processing engines. So
it's been most of my career just kind of developing databases and data processing
engines. And I think about in the last five years, I started getting a little closer to
this idea of composability and how can can make the development of those engines more efficient.
So I started working on a variety of projects related to the space. One of the projects that
we eventually open sourced that got a little more visibility on the industry was Bellox,
which was recently open sourced to this idea of making execution more composable for data management systems.
But inside Meta, I work with a variety of teams with most of the warehouse compute, large warehouse compute engines like Presto, like Spark.
So kind of this data processing area for analytics, developing efficient query engines, that's sort of the thing I do.
Very cool. All right. Well, I just want to dive right into it. And Wes, I'm going to point the
first question at you and then have the rest of the panel, you know, sort of weigh in with,
you know, agreements or disagreements or comments, but let's try to define what a composable data
stack is. The term composability, you composability has been thrown out a lot.
There are even companies sort of co-opting the term for marketing purposes, which always
adds a lot of confusion out there in the marketplace.
But can you give us a definition of what a composable data stack means to you? Yeah, so it's a project or collection of projects that serves to address a
data processing need, but where the component systems are built using common open source
standards that lend themselves to efficient interoperability, either efficient or simple
interoperability. So the different pieces that you assemble to create the ultimate solution for your data platform, you know, can be achieved
without the developer having to write nearly so much, you know, glue or custom code to fit the
pieces together. So those points of contact between different the different components of the stack
are based on well-defined open
standards that are all kind of agreed upon and shared amongst the different component systems.
Any additions or disagreements from the crew?
Yeah, I think maybe just adding to what Wes said, I see that as maybe two different aspects.
One is, I think, this idea of having open APIs and standards
upon which different components can communicate,
but there's also the idea of using common components, right?
So I think at least how we see this internally,
this idea of how can we factor out common components
between engines as libraries,
and how can we define common APIs
and the kind of common standards between them to communicate, right?
So if you look at the industry or the projects on this area, there's usually those two things.
One is just defining the API and the standard, and the other one is actually implementing
things that do something with those standards.
I think there's this idea of just providing those components that communicate via common
APIs and making sure that they're somewhat interchangeable.
I'd like to add a question here, because we are talking about APIs and libraries and all these things, but who are these projects?
Who are these, let's say, if we had to define a minimum set of what defines these, let's
say, set of APIs that we can compose
data systems with today, what would that be?
And I'll start with you, Wes, because I think a lot of that stuff started with Arrow, defining,
let's say, the vocabulary that we use today.
So what's the least of these tools that we need in order to do that?
Yeah, I mean, I think the easiest way to think about it and the way that I often explain to people
is to think about all of the different layers
of a traditional database system.
So historically, you would have database companies like Oracle that would create these vertically
integrated systems that are responsible for implementing every layer of the stack, data
storage, metadata management, physical query execution, query planning, query optimization.
And then all the way down at the user level you have the
the user api which would generally be you know generally be sql and so if you think about this
vertically integrated stack of technologies and you start to think about the logical components
of the system you can start to think about okay well you know are there pieces of the stack which
could be peeled off and turned into reusable components? And if you want to have a reusable component, per se, let's just say, storage, you think you start thinking about designing open source file formats, or open source systems to say, like, if you want to turn something into a reusable component, like designing composable or to make the components of a vertically integrated system separable and reusable is a lot more difficult and a lot more engineering.
Ryan, what's your take on that?
I think it's pretty funny.
You're right. It is a lot harder.
But I think the Hadoop world taught us that you don't actually have to do that work.
Hive tables, no one ever did that work.
It was just unsafe.
And sometimes you clobbered the result that someone else was getting. And like, you know, we lived with unsafe transactions
in the storage layer for a really long time,
but it was still super useful.
Because a lot of the time,
you only had one person changing a table at a time
and you were reading historical data
and it just sort of worked.
So I think we actually backed into,
at the storage layer, at least,
making it more reliable
and having the behavior and guarantees
that we wanted to have.
Pedro, please.
Yeah, I think I would even go further
to what Wes was saying.
I'll say that most companies,
they don't even have the right incentives
to invest in composable components, right?
Because I think, like we said, developing components is a lot more expensive, right?
There's a lot more thought on what are the APIs, like it's a separate project.
You know, if you need to open source this, there's a cost of maintaining this open source
community.
So if you're developing a single engine, it's a lot more efficient for you to just write
this as a small monolith because you have full control of that.
It's a lot easier to evolve. It's a lot easier to control the direction of the different features
and the architecture is going. But actually thinking through what are the right APIs,
working as a community, identifying how this should work on other entities, it's a lot more
expensive. So I think that's why historically most of the companies, they just, you know,
they focus on developing the system,
focus on the particular workload
they have in mind.
I think where this breaks is
if you're a company
who actually needs to maintain
too many of those systems,
then you start, you know,
economically starts to make sense.
Okay, let's actually see
what we can share between those things.
And I think this,
in addition to open source,
they can get into a point where a lot of those components
are already available and already pretty high quality.
I think that's why we're getting to this inflection point
where people are actually rethinking their strategy
as a kind of proprietary monolithic software
and thinking a little more about composability
and open source and open standard.
Now, that makes a lot of sense,
but a quick question here and I want your take on that first, Pedro, and then I want to ask
the rest of the folks here because I want the perspective from both someone who works
in a hyperscale company like Meta, but also trying to figure out how that reflects to the rest of the world out there because not everyone's meta, right?
So you talk about this inflation point.
At some point, you need this modularity.
It emerges as a need.
Can you tell us a little bit more of how this was experienced by you?
Because I'm pretty sure there was some kind
of evolution. There was Hive, then we started having the rest of the system, then we reached
the point where you even had to take out the execution engine itself and make it a module
on its own with Velux. So it was a little bit more of like how these happen, like inside a company like Meta.
Yeah, sure. I think a lot of that just comes because data workloads are always evolving,
right? So first you want to execute large map-produced jobs, then you want to execute
SQL queries, then there's stream processing, then there's log analytics, then there's transactional.
So I think there's a lot of different types of data workloads. And I think the fact is just that
we cannot build a single engine
to support all of them.
So this kind of drives what we call specialization.
So what we end up doing is
you develop a single engine to support
each kind of slice of this workload.
So we have one engine that supports
really large ETL SQL-like queries.
You have another one for interactive dashboards.
You have a series of engines for kind of transactional workloads. You have another one for interactive dashboards. You have a series of engines
for kind of transactional workloads.
You have a stream processing engine.
You have now like a training engine
that can feed PyTorch and keep the GPUs busy.
So it's just because we have so many data workloads,
it kind of drove this requirement of specialization.
I think the problem is that those things
were done a lot more organically than intentionally, right?
So it's just, well, there's a new workload evolving.
People just go create a team and they start kind of developing a new engine from scratch, right?
And then you get to a point where we have 20 of those.
And then if you really look closely at them, like they are not the same, but a lot of the components are very similar.
So I think specifically,
I think to your question around execution, if you look at the execution of all those engines,
they're very similar. Not just, of course, looking at different analytic engines, but even
if you look at something like a stream processing engine, it's not exactly the same, but the way you
define functions and you execute expressions and you do joins, all of that is very similar.
So that's how we started this idea of,
let's first look at the execution
and see what are the common parts
and what we can factor out as a library.
And then just integrate and reuse within those libraries.
So this is how we created Valux,
which is something we're integrating across execution.
And what we saw is that the more we talk to other companies
and we talk to
the community, the more people are really interested on that because developing those
things is a very expensive project. It costs you, I don't know, hundreds of engineers and it takes
you 10 years. So if it's something that can actually join forces of much larger community
or just reuse an open source project that already does all those things in a very efficient manner,
that's just, you know, it saves you a lot of effort.
But this is how we kind of got into this idea for execution, but there are
similar projects targeted to other parts of the stack as well.
Okay.
That makes sense.
Chris, I think you have something to add here.
Yeah.
I, so I'm more or less agree with what Pedro was saying.
I think that the key word there was sort of the organic aspect of this. And I think Ryan called this out as well as looking back to the early
days with HDFS and stuff. I think the big evolution of which S3 is just a continuation
is the separation of storage and compute. And I think Pedro is focused much more on the query
engine aspect of it. I think that probably is a symptom of being at Meta, which is a very large company.
But the alternative sort of, I don't know, storyline that I think people go through is
they get their data on S3 and they're like, oh, okay, I need to query it.
And then like Ryan said, well, there's no ACLs.
And so now you need some kind of ACL thing on top of the query engine.
And then, oh, you need a data catalog or some form of information schema.
And so very organically, you start building out these components.
But because it's kind of piecemeal, because initially you just wanted to query your logs,
right?
And then you start getting streaming data or OLTP data in there.
You start adding stuff over time. And so I think that has been more my journey,
is more one of not so much going horizontally
across a bunch of different query engines,
or maybe I'm not sure horizontally, vertically,
but not so much going across a bunch of different query engines,
but starting to add more and more features
that a normal database would have,
especially being in fintech most recently.
It's like, you know, you take security very seriously and data
discovery is a whole thing there. So I think that's another point of view on how this stuff
evolved. Yeah, that makes sense, Chris. I want to ask you, because you have also experienced,
let's say, not only the part of the data where it's at rest and they are getting processed, but also in the capturing phase of
data, the delivery of the data.
Do you see this
concept of composability
that we are talking about, which
let's say comes a little bit more from
the data warehousing or the
OLAP systems, but do you see this kind of
concept being part also of
these systems in front,
systems like Kafka or even OLTP?
What's your take on that?
Yeah, absolutely.
I mean, taking streaming, for example,
when I was working on Samsung,
and I think Link handles this as well now,
there's streaming and the sort of near-line ecosystem
is pretty adjacent to batch.
And so many of the streaming query engines
can also do batch processing on top of HTFS or S3.
It's a very concrete example.
Going even farther upstream,
there are LTP databases that are experimenting
with sort of bridging the gap as well,
whether that's something like Materialize,
so pseudo OLAP, or something like Neon
that has a tiered storage
layer that includes S3, or it's made persistent, not as write-ahead log.
So definitely, you can kind of look in any direction and see disaggregation and components
being overlapped or shared.
Yeah, makes sense.
Ryan, I think you wanted to add something.
Oh, yeah. Yeah, makes sense. Ryan, I think you want to add something? and we needed them to work on the same data sets. And we needed to have this sort of architecture
that all plays nicely with streaming and batch
and ad hoc queries and, you know,
anything from a Python script to Snowflake or Redshift.
Whereas I think Pedro's perspective is kind of fun
because he's coming at it from like,
how do we build engines and share components
like the optimizer, which is a lot of fun because he's coming at it from like, how do we build engines and share components like the optimizer,
which is a lot of fun as well.
And, you know, Wes as well, where, you know, can we,
can we have really high,
high bandwidth transfer of data between those components within the engine
itself?
So I think there are like two separate ends of this conversation.
Yeah. Yeah, we have
more of the user
side of things and the builder
side of things. Wes, I think
you also wanted to add something on the previous thing.
Yeah,
I think it's interesting.
I think the way that we arrived
even at this concept
of composable data stack
or composable data systems, you know, was a little bit, you know, it was a little bit organic.
So when I got involved with what became Apache Arrow, I was needing to define basically an ad hoc table format or an in-memory data representation for data frames or tables so that I could hook pandas up to systems in the Hadoop ecosystem.
And there were many other systems that had defined in-memory tabular columnar formats,
either for transferring data, for example, between Apache Hive and clients of Apache Hive,
or many database systems had built in- columnar formats that were essentially implementation details of their execution engine.
And they had no interest in exposing those memory formats to the outside world.
And so as I was finding myself basically, you know,
starting to create an ad hoc solution to the problem that I had,
which was connecting Python to these other systems,
it was only at that moment that we,
you know, there was a, you know, a collective realization that like, we should try to create
some piece of technology that could be used in a variety of different circumstances for solving
that problem rather than creating yet another ad hoc solution that's incompatible with every
other solution to that problem.
And so I think as time has gone on,
people find themselves reinventing the same wheels.
And then finally, if you have the bandwidth or the motivation to build a software project or an open source project
or an internal corporate project that is more reusable or more composable
and you have the experience to do it the right way,
then I think that's what's caused this to happen now
as opposed to, you know, 10 or 15 years ago
when the open source ecosystem
was comparatively a lot more nascent, emerging,
whereas it's a lot more mature and mainstream now.
Yeah, that makes a lot of sense.
Pedro, you want to add something, so please.
Yeah, no, I think just quickly addressing Ryan's point.
I think it makes sense for us when we started looking at this space
was a lot more from a practical perspective, right?
How can we be more efficient as an organization?
How can we...
And a little more from a software development perspective.
But as we made progress, we tried to get a little more from a software development perspective, but as we made progress,
we tried to get a little more scientific
with that as well, right?
So essentially, if you stop,
if you remove how the engines
are developed today
and the standards
and components we have
and just think about
what are the different layers,
like what are the common layers
between every data management system?
So we kind of define an architecture saying that every single data management system has a language layer,
which is essentially take something from a user.
Sometimes it's a SQL statement.
Sometimes it's PySpark or Panda or something non-SQL, but you take this.
So there's another component that is just how you represent the computation, right? So you take the user input and you create an IR, which is, I think Substrate was one project targeting
kind of standardizing the IR. But if you look at every single system from analytics to transactional
to data ingestion to machine learning, anything, you have a language, it translates to IR. There's
a series of transformations that you do on this IR, both for metadata, resolving views,
app calls, security, all sorts of things.
And at some point, you get you an IR
that is ready for execution.
It goes through an optimizer.
So every single engine has,
or sometimes the optimizer just doesn't do anything.
But there's a system that takes this IR
and generates an IR ready for execution.
There's some code or some component
that can actually execute this IR given a an IR ready for execution. There's some code or some component that can actually execute
this IR given a host
and given resources, which is a little
more what we're targeting with VALUX.
And then you can, of course, there's a lot more details.
There's the environment where you run those things,
which is the runtime, and then it goes from
MapReduce, Spark, to, I don't know,
now I heard that Redshift
can run on several architectures, but it's just
this environment that we call the runtime.
So we kind of define this and we say that if you look at every single data management system today, they all compose of those layers.
And of course, those layers are not completely aligned between them and they don't use open standards.
So there's a discussion of what exactly the project's addressing each one of those components and what are the right APIs.
But if you look at all those, so all those engines, they kind of follow this mode. So this is
sort of the mental model we have internally. Like I said, Velux addresses the execution part,
but you also have some other efforts on the language part, on the IR, even on the
kind of common optimizer it makes sense ryan yeah and i'll i i also want to ask you like based on
this like mental model that's better described where iceberg feeds right but you also want to
add something well i was gonna gonna initially say that our our experience creating iceberg was
largely like wes's where we sort of backed into it by
saying, how do we make people more efficient? How do we make these things work together without
stomping on one another's results and things like that? And what we ended up with was
kind of like Pedro's talking about, we said, Hey, what are the existing concepts for this space
that we should reuse? You know, how do people expect it to work? And I actually really like
Pedro's, you know, breakdown of the different layers, right? The language layer, the IR,
the optimization layer, the execution layer. And then the, I guess, environment, environmental sort of layer,
I forget what you called that one. And then underneath that, I think storage, and that's
where Iceberg fits in, which is weird and orthogonal. And another thing that interests
me here is what is moving between layers. So, you know, security is traditionally done at that very top
point where you understand what the user is trying to do, and then you have enough information to say
whether or not they can do it. But if you have multiple different systems, right, if you're
talking like, hey, streaming or maybe a process, or some SQL warehouse or other system, you need all of those things to have coherent and
similar policy, you actually have to move that down, right? You have to move it beneath all of
those very different engines, and actually into the storage layer. So composability is really causing a lot of change and friction in the ecosystem right now.
So would you say that, let's say, access controls is another component of the stack we are talking about?
I would probably add access controls. I think someone mentioned views,
but like, you know,
reusable IR or
a view type concept
is definitely there.
I would also say
the catalog
as another sort of
reusable component of the storage layer
and how we talk to catalogs.
Here again,
everything has been talking like the Hive Thrift protocol for so long that we really
need something to replace it.
Like the iceberg is coming out with a REST protocol
to try and do that.
So there are a lot of fairly niche components
even within that storage layer.
Chris, I think you want to add something.
I wanted to add another one on top of the stuff that Ryan's been talking about.
This is something I spent a lot of time thinking about,
and that's data model.
So, and really, you know, data description.
One thing I didn't mention in the introduction
is I have an open source project
I've been hacking around for a year that is trying to unify kind of nearline, offline and online
data models. So this is sort of thinking about, you know, how do you represent an integer? How
do you represent a date? And that's something that flows up and down all the way down to the
Parquet layer. Parquet is kind of punted on what the data model should be. It's very simplistic
and there's logical stuff. And then on top of that, you start compounding, you know, things all the way up to
the language layer. So it kind of runs the gamut. And it's something that I don't think we fully
nailed. You know, it'd be nice to just say, oh, we're going to use Postgres or we're going to
just use Hive or just going to use, you know, DuckDB's format or whatever. But inevitably,
what we seem to end up with is like a lot of coercion and sort of munging.
I think Arrow has wrestled with this a lot.
You know, I was talking beforehand
before we started recording
about their schema flat buffer,
which is a really good reference
if you want to look at like an attempt
to try and model what data looks like
across a bunch of different systems.
It's super non-trivial.
So that's another one I'd like to throw in there
that I would love to see more progress on.
I'll stop there.
Yeah.
Pedro, you also want to add something?
Yeah, no, I think just
I think to the point that Chris and
Ryan just raised, I think
our current at least model is
that language and execution should be
decoupled and they should communicate
via one API.
This API would probably be the IR,
but that essentially means that anything related to the data model,
like SQL, MDX, non-SQL, like anything,
all of that should be resolved on the language layer
in addition to kind of metadata operations, security,
resolving app call, checking if users can access particular columns.
All of that should be encapsulated in a way
orthogonal to execution, and they
should communicate via an IR.
So even things like if you want to express
graph computation, then this IR
should have nodes that express
graph execution
primitive. And all of that, so in a
way, all those things should be decoupled from execution
and execution should only take
this IR as input.
And of course,
there's, I think, security details
of what you need to carry
a security token
to make sure that it can actually
pull this data from storage.
But like all the logic of checking
if people have access to columns,
like resolving ACLs,
doing privacy checks,
like all that stuff
should be decoupled from the IR.
It doesn't mean that it necessarily
should be part of the SQL parser library.
It could be something that you have many language libraries that generate an IR,
and then there's some processing that happens on this IR so that those echo checks,
privacy checks can actually be decoupled, can be orthogonal from whether users are expressing things using SQL or non-SQL.
But we see all of that as being, in a way, orthogonal to execution.
So execution should just mean, OK, this is the computation I need to execute.
It's already checked.
It's already saved.
Let me actually go and execute it.
Chris, please go on.
I think you want to make a comment here.
Well, I actually just wanted to, maybe Pedro would be the best person, but can you define what
you mean by IR? I think that's something that
maybe not everybody intuitively knows
is they've not been knee-deep
in databases for a long time.
Yeah, no, I think that's a good point. I think
IR is a term that we
sort of borrowed from compilers,
but it's essentially this idea of having some
intermediate data structure that can
represent your computations
in a way that you can execute that without ambiguity.
So essentially that in most query engines, that means the query, the physical query plan.
But I just call this IR because that's kind of the term used in compilers for the same idea of kind of decoupling front end and back end.
Ryan.
So I completely agree.
I think IR and substrate or similar projects
is one area that I'm most excited about
because it is super useful, right?
Being able to exchange query plans
basically gives you views.
Being able to pass off something from any language, whether that is SQL, or
hopefully something eventually better. You know, like, that is all really cool. But I think one
aspect that I want to bring in here is that it's probably not enough. There's always going to be that guy who doesn't want to use a data frame API in Python to do his processing.
He wants to jump into Python code.
And like people have attempted this with like taking Java bytecode and translating it into SQL operators.
And it's a gigantic mess. So like, you have to have either some willingness to
use a language that produces IR, or the rest of the components in the stack actually need to
support everything with even stronger like orthogonality. So like, I think that when, when it comes to building at least a storage layer, the storage
layer doesn't actually get to assume that you're going to use IR and an optimizer or any particular
execution, right? We need to be able to secure the data, no matter what you're using. We need to,
you know, be able to give you that data no matter what you're
using and have like very well-defined protocols and standards at that level, at least.
Ryan, one question though here, and then I'll give the microphone like to Wes because I think
he can also like add a lot to that. You mentioned, I mean, I understand what you're saying about
like storage and storage data
hub is decoupled from that, but there is one thing and that connects with what like Chris
was also talking at some point about like the type systems and like the data types and
the model itself, which at the end goes down to storage too, right?
Like these data somehow needs to be represented and and it has to be, like, able to serialize,
deserialize, whatever types you have there,
and this is something
that, like, goes through, let's say,
all the different components.
So, how do you
deal with that? I mean, because
what I hear so far, and sorry for that,
is that, oh, store ads,
like, well, we can't stay
away from that.
Pedro says, oh, these things will be resolved
by the front-end parts,
like the parser and whoever generates the IR.
It's like a hot potato that you throw to someone else.
And at some point, we have to deal with it.
So there was a reason that this whole thing was a monolith, right?
And I think that's what we are starting to surface here.
These APIs at the end, communicating with the openness that we want to have to is not that easy. Chris's point, right? Which is if Iceberg and Arrow and our IR all used similar type systems,
then we would be a whole lot better off. I do not doubt that. If Wes and I had agreed
10 years ago on the set of types that we would support, it would be a whole lot easier.
And that's why Substrate,
when they started that project, they took a look at all the type systems out there and said,
we're only going to allow types in if it's supported in like two or more large open source projects. I think one of them was Arrow, one of them was Iceberg, you know, I think Spark and
some others, you know, so that is definitely a problem
where we could use a a more coherent standard but let me also explain and argue for
you know fracturing here it is the way it is because there's a huge trade-off.
And dealing with that trade-off is why we have so many different approaches.
I think Arrow takes the side of the trade-off
to be more expressive and say,
hey, if you want to use four, two,
or eight bytes for that type,
you can go ahead and do that.
Whereas on the Iceberg side,
we're trying to keep the spec very small
to make it easy for people to implement.
And there's just a fundamental trade-off there,
and you've got to strike the right balance.
Anyway, sorry.
I've talked for a long time.
Wes, you wanted to add something?
Yeah.
I mean, on this kind of discussion
of IRs and the relationship
between the front end of a data system
and the back end,
I mean, I think one of the earliest
and probably most successful systems
in the composable data stack
is Apache Calcite,
which is
created by Julian Hyde.
The idea was it's
a database front-end
as a Java library, so it
does SQL parsing,
query optimization,
query planning, and it
can emit an optimized
query plane on the other end, which
could be used for physical execution
or a logical query plan,
which can be turned into a physical plan and then executed.
But I think that Calcite really was really important
in terms of socializing this idea of a modular front end
that could take
responsibility for those parts of building a database system or a data warehouse that are
you know traditionally something that that system developers would want to have a lot of control over
i think substrate is interesting because it's it's provided for like standardizing on what's that thing that your
parser optimizer query planner
emits. And that's
something that historically was not standardized.
And so when people would want to use
Calcite, they would implement Calcite, but they'd have
a bunch of Java code using Calcite
and then they'd have
a bridge between Calcite
and their system to go
from the logical query plan into execution.
And so it's definitely been a journey
to get to where we are now.
And obviously, down in the weeds,
we have the issue of the data types
and trying to agree on all the data types
that we're going to support in all these different layers,
which creates a lot of complexity.
Yeah, Before I give
the microphone to Pedro because he
would like to add something here,
I do have to ask
the two of you, Wes and Ryan,
why you couldn't
agree on the types
10 years ago?
10 years ago, we were still screwing up timestamp.
I think it's still-
That is correct.
Yeah.
I think the answer to that is because there was a lack of composability, which means that
people were implementing the same thing over and over in slightly different ways.
That makes sense. Pedro, you wanted to add something.
Please, go ahead.
Yeah, no, just adding on the conversation about data types.
I think that is complicated because we're talking about different things.
There's many different levels of data types.
I think on this discussion, there are at least three, right?
There's the storage data types, which are the things that we, the storage and the file
format actually understand, which are usually things like integers, floats, strings, and, but then like
really kind of primitive data types.
Then there are kind of logical data types that the execution can understand, which are
things like maybe timestamps, like sometimes the storage also understand timestamp, but
there's, you know, logical data types that the execution can understand, but there are
also maybe user defined data types, which are kind of higher level types that users can define.
I think some examples are things like sometimes when people are defining IDs, they don't want them to be just integers.
They want actually like a higher level data type that just maps to an integer, but adds some level of semantic, right?
So I think there's different levels and those different levels have different trade-offs and some of them are easier to extend.
Some of them are more efficient than others.
But I think that's why we think that this model of defining things that should be resolved
in the language, like things resolving user IDs into integers, those are things that should
be resolved in the language.
It should be transparent to the execution type that the execution needs to understand
so we can efficiently process those things.
Things like, for example, defining functions that are based on those types.
That only works if the execution understands those types.
And then there are also types that need to be understood by the storage, which is a lot
more around kind of storage efficiency, size, and related to encoding.
So they're kind of different levels.
So depending on which types exactly we're talking about, but anything related to more logical types, data model,
again, I could say that all those things
should be resolved on the language layer
and then just capture in the IR.
Okay, that makes sense.
One last question here about types,
and I'll shut up about types.
I promise.
And I want to ask Chris,
because Chris can make this connection
with the part we are talking about,
which is the database systems,
but there's also applications out there.
There are application developers.
There are people out there who generate the data
that we store and then we process.
And in many cases,
these people don't necessarily
have or need to understand what's going on with the data processing infrastructure that we have,
but they are feeding us with the data. So Chris, when it comes to the type system that we are
talking about or the formats that we are using to represent the information and move it around, from your perspective, is there something else that has to be added to solve this problem end-to-end?
Something else to be added. I think in my mind and sort of my intention with ReCap was to have something
a substrate or an IR, but more specific to the metadata layer, which is a way to describe
in the abstract, the data that is flowing across the online, nearline and offline world
that would account for a large amount
of the coercion that we see now.
I think there's always going to be
a little bit of coercion
because to not have type coercion,
you essentially need something
that looks a lot more like Q,
which is this sort of academic,
very academic looking project
that is essentially not usable
for the average engineer.
It's just too complicated.
And so to Ryan's point around the complexity
around all this stuff,
you can't make it too complicated
for these application developers to use.
And then as soon as you try
and make it a little more simple,
you end up with some form of coercion.
But I think the thing that I would like
is it's some common way to describe this,
the data sets across the different stacks and layers.
The closest I've seen to this instantiation,
aside from Recap, is actually what Arrow does.
They essentially have two different layers.
One is the schema flat buffer layer that describes,
the way Pedro was talking about it,
a very specific, like, here are the bytes.
You can have a float, and the float can be 16, 32, 64, 128, right?
But most developers don't want to say, like,
float 16, float 32, or whatever. So what? But most developers don't want to say like float 16,
float 32 or whatever.
So what Arrow ends up having on top of that is like an actual implementation
that gives you decimal 128 as an actual type.
And so there's sort of two tiers to it.
But, you know, for better or for worse,
that schema stuff is mostly wrapped up and used by Arrow.
And I would like to take that out of Arrow
and use it across all these systems
so that you can sort of portably
move around the data description
from one vertical to the next.
That's sort of the area that I would like to
see improved.
Ryan, you want to add something here?
So
Chris's
description here just triggered
something in my head, which is I have a little soapbox about
losing structure of data and constantly coercing types. And I really think that one of the
promising factors or, you know, promising aspects of composable data systems is like, stop losing structure, you know, share tables instead
of sharing CSV, right? It's always pretty ridiculous that I'm dropping CSV or JSON,
right? I'm destroying the structure of my data set in order to push it over to you.
So like, I think that it will actually get better hopefully when we have you know more ability to
share the actual representation and do so securely and some of these other things mature but i also
entirely agree that like we need to get to the point where we you know have some idea of a format that can actually do that
exchange as well.
Okay. Enough with types.
Eric,
I've monopolized the conversation.
Oh, it's been
so good.
Yeah, I think one thing
that's just hearing the conversation
has been so fun because I've
heard multiple times, oh man, we've come so far. And then also like, well, yeah, that's just hearing the conversation has been so fun because i've heard multiple times oh man we've
come so far and then also like well yeah that's you know that is a problem and i think the common
thread through all of that seems to be this desire for open standards and there are different areas
of the composable stack you know where that seems to be a big need.
And so I'd just love to hear where have we come in the last several years as far as open standards?
And then what are sort of the frontiers that are most important?
Wes, maybe we can start with you.
Just give us a brief history sort of of your view of where we've come.
Yeah.
I think there's...
I mean, I think, you know,
the main things that came out
of the Hadoop ecosystem was
open standards for
file formats.
So basically the foundations
of what we now call
open data lakes.
So we ended up with,
of course, multiple competing standards,
so Parquet files and
ORC files and some other
open standards like
Avro for developing
RPC client-server protocols,
things like Thrift and Protobuf, which have been widely adopted for building client-server protocols, things like Thrift and Protobuf, which have been widely
adopted for building
client-server interfaces.
I think in the last
10 years, moving up the stack
into
in-memory data transfer,
at in-process
memory transfer or
inter-process memory transfer
with Arrow.
That was a hard one, hard one battle, but it's great.
It's been great to see that, you know, achieve wide, you know,
wide adoption.
I think interoperable or like open standards for computing and like more of the computing layer is more of like an emerging thing that's becoming like that's starting to happen that historically
there really wasn't very much of and so i think we've gone from an era of like these limited open
standards for data interchange data storage to starting to think about more you know more of the
runtime you know what happens inside of processes rather than just how we store data
at rest or move data on the wire.
Makes total sense. Any other thoughts from the rest of the panel?
Yeah, I think maybe just adding to what Wes said.
I see that, again, my mental model is that there are two things.
One is defining what are the APIs and standards.
And the second thing is actually having implementation for those things.
And those are, they don't really go hand to hand.
Sometimes there's no standard and sometimes there is a standard, but there are multiple implementations and they're not compatible or the opposite might also be true.
But I think maybe to your question, to what are the open standards and APIs,
like if we follow the model I presented
and go around the stack,
like if you go,
start from the storage layer,
usually the storage layer
is just some sort of block API, right?
So this is, you know,
already pretty well understood.
You usually have some notion of file,
handle, offset, and size.
So this is just how you pull blocks.
And then there's this idea
of how do you interpret what those blobs of data mean, right?
And then I think like, well, like Wes mentioned, I think like Parquet, ORC, Avro, those are
kind of well understood, even though the implementation is all over the place.
They can have many Parquet readers and writers implementation.
They're not necessarily compatible, but there is a standard.
So you start from decoding those things.
Then there's how do you represent those things in memory, which is, I think, what Apache
Arrow defines really well.
So if you need to represent this columnar data set in memory, how do you lay out this
thing in memory?
So I think Arrow solves that problem.
Then there's the question of if you need to process this data and apply functions, apply, you
know, different operators, what is the relational semantic you follow?
So I think there's another discussion.
So if you look at Spark, Spark has a certain semantic, which kind of loosely follows NC
SQL.
Presto has a different semantic.
MySQL, Postgres, and you name it, there are probably like 50 different semantics you can
find, and none of them are compatible with each other.
They all sort of look the same.
They have similar functions, but they're never compatible.
So there's this idea of like, what is the standard for the semantic that your operations
and your functions need to follow?
Then if you go up another layer, there's a discussion on how you represent this computation.
So how do you know that, well, you need to scan this data, then you need to sort it,
then you need to apply filter and then you need to join and you need to shuffle this.
So this is what Substrate was supposed to do, or it's meant to do, just essentially
having an open standard for representing this computation.
Then if you go up, there is a discussion on what are the APIs to how users represent
this computation, right?
Which is how well we have SQL, which is probably the worst standard of all of those.
It's very loosely defined.
There are just so many different implementations.
They're never compatible between each other.
There's also discussion of non-SQL API.
So you have Pandas, you have PySpark.
They're all non-SQL, but still there's no standard.
So I'd say that this is probably the highest level on top of that.
There might be even higher levels of how developers interact with those things.
Like maybe you can have ORM or different sorts of abstraction that actually map
into non-SQL APIs or map into IRs.
So that might be some other APIs on top of that, but I don't think there's any
very industry standards on those. So at least
that's my mental model if you go across the stack. Some of them, for some of those areas,
there exists some standards, but they are not as strongly defined as we would like them to be.
Yeah. Would you say, I mean, it kind of sounds like roughly not perfectly, but,
you know, sort of from bottom to top, as you described it, is sort of the sliding scale of maturity, right?
Like the stuff at the top, you know, sort of least representative of open standards.
Would you say that's sort of generally true?
What do you think about that?
I think not necessarily.
I think it depends on which projects came first.
What is the, I don't know, how much people actually adopt them in practice.
I'm not sure if there's a correlation to kind of how deep or how high up on the high arc.
And sorry, there's one, I think, very obvious layer that I mentioned that I forgot to mention,
which is a table API, right?
Which is somewhere between the, I think, the storage layer process.
I think that's a big one, which is, well, we have Iceberg, but we also have Hudi.
We also have Delta Lake.
We have Meta Lake inside Meta. So there's another example of, you know, there is an open standard, but it's not, you know,
100% adopted everywhere, which is also, I mean, good, but not great.
Sure.
Chris?
Yeah, I was just going to answer, I think, Eric, your question on sort of where things
are the most mature.
And I really think it has a lot to do
with who wins a given space, right? And so I think if you look at what drives a lot of things,
it's like the Arrow API or data frames or whatever. And so everything is having to integrate with that,
right? And so I think as there is, you know, theoretically one winner, we can all dream,
one winner in the storage layer, then that will sort
of solidify what that protocol is going to look like.
But the more people you have competing, the more chaotic it is and the harder it is.
Yeah.
I think what really drives a lot of the APIs actually is sort of organic through whoever
wins gets to decide what the API looks like.
So in that regard, I don't think it's bottom up the way you described. I think it's kind of decide what the API looks like. So in that regard, I don't think it's bottom-up the way you described.
I think it's kind of middle-out.
I think the API layer is really enforcing people to fit into that a lot more than bottom-up.
Yep, makes total sense.
One thing I'd love to discuss is we kind of already got into data types, you know, which is certainly, you know, a trade-off, I would say.
When you think about composability as compared with, you know, sort of a singular monolithic system, what are some of the other trade-offs?
Like, how would you define some of the other trade-offs of the composable system?
And Pedro, maybe we can start with you.
I think that's interesting, right?
I think some of the discussions we usually have with people working on this space is,
I think what we mentioned before, like in a lot of cases, it's harder to build a composable
system than it is to build a monolith, right?
So sometimes if you're just optimizing for speed and you just want to build a prototype
and have an MVP customer running on top of it as fast as you can, it's easier to just prototype and just create a new monolithic system.
Where we think that this fails is that it's usually easy to create a first version.
So you can run a very simple query that supports this workload.
But then a few months from now, they need to support a new
operator and they need a new function. And then as this thing grows, I think it kind of slows down.
And then I think in the long run, it just, it doesn't pay off. But it's a lot harder when you're
starting something like, should we actually spend a few months trying to understand, I don't know,
engaging with the Arrow community, understanding how Arrow works, or understanding how Velux works.
And if you need to make any changes,
you need to engage with the community.
And it's a lot easier in a lot of ways
to just kind of fork all those things
and kind of make sure you can move a lot faster.
So I think this is one of the obvious trade-offs.
There's also this kind of bias developers have
that it's something we elaborate on the paper as well,
that we like to think that we can do things better, right?
So I'm not going to use Arrow.
Like if I need to create this, I'm probably going to write this better.
So, so we see this kind of pattern with users, especially like with well,
experienced engineers over and over.
People don't want to reuse something because they think they can just do it,
do it better and in some cases they can, but in a lot of cases it's also not true.
There's also this part of just people prefer to write their own code than to understand other
people's code. So instead of spending a month understanding Valox, I was just going to go and
create something that I fully understand in a few weeks. And then six months later, when you leave
the team, then the next engineer has the same problem. I think there's a lot of those kind of
fallacies that we hear over and over. Some of them are kind of fair, like this time to market. I do feel like it's true, but you usually end up paying the price on
the long run. Again, like I mentioned, we elaborate some of that on the paper as kind
of the reasons why composability hasn't happened before. It's just because there's a lot of
kind of those internal biases that engineers have, and some of them are kind of driven
by business needs as well.
Can I...
Sorry, Eric.
I want to ask something better here.
So, because you mentioned about, like,
why composability didn't happen earlier.
And actually, it is, like, a question that I have
about data management systems.
Because data management systems are, like, very complex,
but they are not the only complex systems we have out there.
We have operating systems,
and composability in operating systems
has been a thing for a very long time.
Same thing also in a little bit of a different way,
but also with compilers.
We see the difference between having the front-end
and LLVM and other systems on the back-end
and all these things.
But why in the database systems it took us that long
to get to the point of appreciating
and actually implementing this composability?
And this is a question to all of you, obviously, guys,
because you have all of you a very extended experience on that.
So please.
I think that part of this comes down to, you know, commercial interests, which I think is a big part of the data industry, right?
At least where we sit at the storage layer.
Storage provides opportunities that the execution layer can take advantage of for better performance. And if you can control the opportunities and you control the execution layer,
you can make something that just fits together really nicely
and has excellent performance.
And at least so far in the storage world,
it has not been a thing to get your storage from another vendor. Like, that is, you know, really weird thing that is happening
now, that the Databricks and Snowflake, or, you know, choose your other vendors, Redshift
can share the same data sets. And it comes down to who controls that data set, who controls the opportunities that are presented to the other execution layers.
Like the world gets really weird in this case.
And I think that, you know, part of it is just how we've historically architected these systems.
Right. I think of the Hadoop sort of experiment as this Cambrian explosion of, you know, data
projects that questioned the orthodoxy of how we were building data warehouses.
And that led to, you know, pretty primitive separation of compute and storage that we
then, you know, matured and eventually got to this point where, yeah,
you can, using projects like Iceberg, you can safely share storage underneath these
execution engines.
And that's what is really pretty weird right now.
But all throughout our history, we have not been able to actually share storage, share it reliably, share it with
high performance, and things like that. So I think that, you know, the business model of all those
companies that have never had to share storage and are built around like, hey, we sit on all your
data and, you know, lease it back to you for compute dollars.
That has been a very powerful driver in the other direction.
That makes sense.
Pedro, you want to add something?
Yeah, I think maybe adding to what Ryan said and addressing your question of why we think composability is more important for data systems,
but why isn't it such a big thing for compilers and operating systems, for example? I see maybe a lot of that just driven by variety. So how many widely used C++
compilers can you name and how many operating systems can you name and how many data systems
can you name? So there's a lot more, I would say even wasted engineering effort in redesigning and
redeveloping those things for data systems than they are in operating systems and compilers.
I think a lot of that is just because the APIs of those systems are actually user-facing,
right?
So users interact directly with databases.
So users' needs and user requirements are evolving a lot faster than requirements for
operating systems and compilers.
So I feel like the APIs of those systems are a lot more stable and they don't evolve as fast.
So I think those systems are also a lot more mature, right? So making them composable,
there may be less incentives to make them composable because they're only a handful
of implementation of those. But data systems where there is a place where you literally have
hundreds, probably thousands of different engines that all have some sort of degree of repetition.
I think there's a lot more incentive than, okay, let's actually see what are the right
libraries that we can use to accelerate those engines, especially because workloads have
been evolving and they're going to continue evolving in the future, right?
So the workloads we had 10 years ago are very different from the ones we have today, and
they're probably going to be very different from the workloads we're going to have five years from now.
So not just making the engines we have more efficient, but we need to be more efficient as a community on how do we adapt data systems as user workloads evolve.
Eric, back to you. Sorry for interrupting again.
Oh, yeah. Well, actually, we're fairly close to the buzzer here.
We got a couple more minutes.
And so one thing that I would love to hear from each of you is what new projects you're
most excited about.
You know, we've talked a lot about sort of the history of open standards and projects
that are, you know, that have pretty wide adoption.
But I'd just love to know what excites you
to sort of look at the newest stuff out there.
So Ryan, why don't we start with you?
I'm pretty excited about IR and projects like Substrate.
I'm also excited about some of the other more composable
or newer APIs in this layer.
Iceberg just added views.
We are also standardizing catalog interaction
through a REST protocol,
really trying to make sure that everything
that goes into that storage layer
has a open standard and spec around it. And I think that is going to really open up not just
composable things, but towards the modular end of the spectrum where stuff just fits together
nicely. You can take any database product off the shelf and say, hey, my data is over here.
Talk to it.
So I'm pretty excited about how easy it will be to build not just these systems, but actual
like data architecture based on, you know, these ideas.
Very cool.
Chris, you're next on my screen.
Yeah, I think I'm going to call out one that's sort of obscure.
There's a fellow over at LinkedIn working on a project called Hop- and kind of create this single view of streaming queries and also kind of data warehousing queries
and they've done some really interesting experiments to sort of plug it into kubernetes
and stuff so that's something that i think is really fascinating. It essentially kind of uses Kubernetes
as like a metadata layer. And then it has a JDBC implementation that will allow you to query,
like basically do a join between Postgres and Materialize or between, you know, data that's in
Iceberg and data that's in MySQL. And so it's really an experimental project.
It's super interesting.
And the guy that's working on it,
Ryan Dolan,
is really,
has a lot of interesting ideas.
So that's the one I want to call out.
Very cool.
All right, Costas, you're next on the screen.
Did you think I was going to ask you?
Oh, me?
Yeah.
Yeah.
That's a good question.
Actually, I'll probably mention something that is not that much about
in the context that we are talking,
a bit more of an extended context.
I'm very excited about virtualization technologies
like Divizor, for example, and firecracker, and how these
can change the way that we build data systems.
One of the really, in my opinion, like hard problems when you build solutions and products
in that space is how you deliver, like how you build multi-tenancy and how you can deliver
that in a way that you can build as a vendor margins
and at the same time provide the isolation guarantees
that people need out there for their data.
So this interaction between virtualization and on top of that,
like how you build systems, it's something that I find extremely interesting.
There are some companies
like Modal, for example, experimenting
and using Divisor.
It's very interesting.
Anything that
takes the data systems and
delivers them to the users
out there in ways that
are, let's say, a little bit closer
to the stuff we've seen
with applications.
It's super, super interesting, I think. And we say a little bit closer to the stuff we've seen, like with applications. It's super, super interesting, I think.
And we see a little bit more on the OLTP space, like the Neon database and like
these like vendors that are like a little bit like more like ahead of the curve,
let's say they're like compared to the OLAP systems. But I think there's a lot of opportunity also for the OLAP systems to exploit these technologies
and build some amazing new stuff.
So that's what I would say.
All right, Petra.
Yeah, I think there are a lot of open source projects that come to mind, but I think
specifically on this composability area, I would say personally and just making a quick plug here is Velux.
I think that's the project closer to my heart.
We're making really good progress.
We're getting to a point where more than 20 different companies are engaging and with us helping develop.
We have more than 200 developers.
So we're making some really quick progress.
It's integrated into Presto, integrated into Spark. We're seeing like 2 to 3 X
of efficiency wins on those systems, which is huge. So I think this is the project
that I'm very closely involved. So it's super close to my heart. We're also working with
hardware vendors to add support for hardware acceleration in kind of a transparent manner.
So I do feel like it's going to become kind of even more popular than
it is today.
There's also a discussion on file formats, right?
So I think, well, today Apache Parquet is probably the biggest one,
but I think there's a consensus in the community that it's already getting closer to the end of its life.
So there's a discussion of what it's next,
what's after Parquet and what this format looks like
and how do we actually create this format
and create a community around this.
So I would say that we're probably going to see more projects
specifically on this area of file formats soon.
I think going up the stack substrate
was something that I was super interested on,
like actually having the way of expressing computation
across engines standardized,
I think was a super interesting proposition.
Even though I think the actual adoption in the existing system, it's been a little slower.
I think there's also discussion that, well, from a business perspective, like why would you invest into using Substrate instead of your own thing?
Like what is the value of that asset?
I think from Bellux, there was a clear value of actually making your system cheaper. So I think for Substrate,
maybe there's still a discussion on how exactly do we frame
this and how exactly do we increase the
project for larger companies. But I think in general,
another project that's super interesting to me.
And Wes, bring us home.
Yeah. I mean, bring us home. Yeah.
Like Pedro, I'm really
excited about the progress
in modular
composable execution engines,
Bellocs and DuckDB being
two prime examples. Another one,
DataFusion, it's a Rust-based
query engine in the
Arrow ecosystem.
And so my company voltron data you know we've been we basically were building a system you know maybe inspired by the name of the company
you know to be able to take advantage of the best of breed and modular execution engines and so i
think we'll see more and more systems built to you know take advantage of this trend as time goes on
rather than you than building your own
execution engine that you can just use the best available engines for your particular hardware,
SIMD architecture, or whatever characteristics of your data system. Another thing that I'm really
excited about real quick is something we haven't really talked much about, which is the language
API, the language front end for interacting with these systems.
I think people, there's kind of an awakening
and for a long time people were awakened to it,
but that SQL is kind of awful.
And so there's a bit of a movement
to define new query interfaces
that do away with some of the awfulness of SQL,
but also can hide the complexity
of supporting different dialects.
So a couple of projects there,
pretty interesting.
Malloy, created at Google
by the former Looker people.
There's another project called PRQL.
Yep.
Perkle.
That's very cool.
You know, I created a project called IBIS for Python, which is like a data frame API
that generates many different SQL dialects, you know, under the hood.
There's a big team, pretty good sized team working on IBIS now.
Thomas Neumann and Victor Lais from TU Munich have a new project called SaneQL in a paper at CIDR 2024 discussing the many ways
in which SQL is awful and proposing a new way of writing relational queries. So I think that's,
you know, I think, you know, since we're a lot of us are very focused on building these
building, you know, viable systems for solving these problems. And
I think to be able to move on to more like, hey, well, how do we make people even more productive?
And that includes all the way down to the code that they're writing to interact with these systems,
that that user productivity is going to become an increasingly an important
focus area, I think, for the coming few years.
Yeah, absolutely. Ryan had mentioned that earlier, and I was
like, ooh, should I jump in and go down that rabbit
hole?
One question.
You mentioned
earlier
call sites, but
I haven't heard from anyone about
anything that is happening when it comes
to optimizing, which is
a big part of these systems.
At the end, that's the part that actually takes the computation,
optimizes it, and makes it efficient, and does all the magic there.
Is there anything interesting happening in this space?
Would you like to see anything happening there?
I'll go. I think for me and our team, we have been deeply discussing this,
basically using the same ideas
of having more modular execution.
Can we have more modular optimizer?
I think the first reaction
that people have is that,
well, that's unthinkable,
that the optimizer is very specific
to how the engine works.
But if you actually stop
thinking about this,
there are ways to build this.
So we actually,
we have someone on the team
prototyping
some of those ideas. I think
where this stops is basically on
prioritization. How much value
does that provide and why people would
invest on this right now?
So I think that's where things
stand. I think from more
academic and scientific perspective,
it's super interesting. It's something that I would love
to spend some more time,
but I see maybe less business value
in investing in this
versus investing in things
like common table formats,
faster execution, and better language.
So I do feel like this is going to happen
at some point.
And as we started looking at this space,
we saw that many other partners
were interested in this.
I think it's more a matter
of what are the incentives.
Makes sense. Anyone else who would like to add about optimizers here?
I think it depends on what type of optimizer, right? Cost-based optimization is definitely
more tied to the engine. We could probably share a whole bunch of rule-based optimizations and things like that. And I actually think that sort of work is going to happen as we coalesce around an intermediate representation.
Someone's going to write a good optimizer for Substrate at some point.
Or we'll be able to translate to and from easily with it.
And we'll just have that ability, I think.
But then it's like all those designs around,
well, now that I need to incorporate
how my engine actually acts
and what the real costs are, that gets pretty hairy.
Yeah.
I think once we, just quickly adding to this,
that was essentially the discussion
we had, like, are there actually common things that we can extract from the optimizer?
And I think just adding more color to what I mentioned, we saw that at least like all
the optimizers, they have ways of, okay, how do you define what are the physical capabilities
you have?
How do you cost those capabilities?
I think it was a discussion on providing you have the right APIs
so that engines can actually say, well, I support merge join, index joins, and hash joins, and those are the costs. The part of actually exploring the plan space, costing those things,
all that part is very common. So I think the idea was just how do we define the right API
so that they could be reused across the engine? But again, I think it's was just, you know, how do we define the right API so that they could be reused across engines?
But again, I think it's something that we just stopped on this part of why should we fund this?
And maybe that's something we will look into in the future.
Yeah, makes sense.
Okay, one last thing.
And that's like a question specifically for Ryan.
What is going to substitute Apache Ranger for the enterprise, at least?
Come on, it's hopeful. Someone has to go
and fix this.
I think it's
solving the wrong problem.
So I'm going to answer a different
question that I think
is hopefully relevant.
I am
not bullish on sharing
policy or representation because a lot of different systems, the edges have very different behaviors.
So like a future table permission like you have in Tabular versus like a inherited database level permission or something like that, right?
Like what do you do with objects as they go into and out of existence?
I don't think that sharing policy is the right path
because we would have to come up with a union of all the policy choices
and capabilities out there.
I think sharing policy decisions is the right path.
So what we're incorporating into the Iceberg
Catalog REST API, or that protocol, is the ability to say, this user has the ability to read this
table, but not columns X and Y. And the benefit there is it doesn't matter how you store that. It doesn't matter if you're using ABAC or RBAC or whatever you want to use for that
internal representation.
The catalog just tells you the decision.
It says, this user can read it, but not these things.
Or they can't read this at all or something.
And so I think that is going to be the best way
to standardize an exchange, which is essentially you have to pass through the user or context
information necessary for making these decisions. And then the decision comes back in a easily
represented way because it's a concrete decision from the policy rather than
storing the policy and all of the ways you could interpret that policy. Yeah, I think we need to
have like a dedicated episode, like just talking about that stuff, to be honest, but like for
another time, like the theme today was like different. But Eric, back to you.
Yeah, look, we have a couple episodes here.
We probably should talk about data types. We should probably talk about the death of SQL and access policies.
So we can line up a couple more episodes here.
Gentlemen, thank you so much for joining this.
This has been so helpful.
We've learned a ton.
I know our listeners have as well. So thank you
for giving us some of your time. Thank you. Yeah, thanks. We hope you enjoyed this episode
of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified
about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C
at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com. Thank you.