The Data Stack Show - 175: The Parts, Pieces, and Future of Composable Data Systems, Featuring Wes McKinney, Pedro Pedreira, Chris Riccomini, and Ryan Blue

Episode Date: January 31, 2024

Highlights from this week’s conversation include:Introduction of the panel (0:05)Defining composable data stack (5:22)Components of a composable data stack (7:49)Challenges and incentives for compos...able components (10:37)Specialization and modularity in data workloads (13:05)Organic evolution of composable systems (17:50)Efficiency and common layers in data management systems (22:09)The IR and Data Computation (23:00)Components of the Storage Layer (26:16)Decoupling Language and Execution (29:42)Apache Calcite and Modular Frontend (36:46)Data Types and Coercion (39:27)Describing Data Sets and Schema (42:00)Open Standards and Frontiers (46:22)Challenges of standardizing APIs (48:15)Trade-offs in building composable systems (54:04)Evolution of data system composability (56:32)Exciting new projects in data systems (1:01:57)Final thoughts and takeaways (1:17:25)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. We have a truly incredible panel here to discuss the topic of composable data stacks. So many topics to cover today.
Starting point is 00:00:35 So let's get right into introductions. And I'm just going to do it in the order that it shows up on my screen. Chris, do you want to start out by giving us a quick background and intro? Sure. Yeah. My name is Christopher Comiti. I have spent the last 20 years of my career at two companies, mostly LinkedIn, where I spent a lot of time on streaming and stream processing and was the author of Apache SAMSA, which was an early stream processing system, kind of similar to Flink. And most recently at a company called WePay, which is acquired by JPMorgan Chase, where I ran our payments infrastructure, data infrastructure and data engineering teams for a stretch of time. I've also written a book for new software
Starting point is 00:01:14 engineers, kind of a handbook, because I was tired of saying the same thing in one-on-ones over and over again. I've been involved in open source. I was a mentor for the Airflow project and helped guide it through Incubator on Apache. I also do a little bit of investing. And so that's where I spend a chunk of my time now. And I, yeah, write a little newsletter on all things systems infrastructure. That's me in a nutshell. Very cool. Wes, you're up. Yeah, I'm Wes McKinney. I'm a serial open source project, open source software developer. I've created or co-created a number of popular open source libraries, Pandas and IBIS for Python, Paciero, kind of in-memory data infrastructure layer.
Starting point is 00:01:59 It's very relevant to the topic of today's show. I've been involved in a bunch of companies, most recently a co-founder of Voltron Data, building accelerated computing software for the Composable Data Stack and Posit, a data science platform company for R and Python. I'm an author of the book, Python for Data Analysis.
Starting point is 00:02:29 So popular reference book for Python data science stack. And I also do a fair bit of angel investing in and around next generation data infrastructure startups. Very cool. Ryan, you're next on my screen. Oh, thanks.
Starting point is 00:03:10 I'm Ryan Blue. I'm the co-creator of Apache Iceberg. Very cool. Ryan, you're next on my screen. architect big data systems, especially in object stores. I'm also a co-founder of Tabular, where we sell an iceberg-based architecture that has security and data management services baked in. I left Netflix to found Tabular and Netflix. At Netflix, we were on the open source big data team. So I got to work on Parquet and Iceberg and replace the read and write paths and Spark and various other things. Very cool. And Pedro.
Starting point is 00:03:40 All right. Hello, everyone. I'm happy to be here once again. I'm Pedro Pedreira, software engineer. I've been at Meta for a little bit over 10 years, always involved in projects around data infrastructure, a little bit closer to analytic engines, log processing engines. So it's been most of my career just kind of developing databases and data processing engines. And I think about in the last five years, I started getting a little closer to this idea of composability and how can can make the development of those engines more efficient.
Starting point is 00:04:10 So I started working on a variety of projects related to the space. One of the projects that we eventually open sourced that got a little more visibility on the industry was Bellox, which was recently open sourced to this idea of making execution more composable for data management systems. But inside Meta, I work with a variety of teams with most of the warehouse compute, large warehouse compute engines like Presto, like Spark. So kind of this data processing area for analytics, developing efficient query engines, that's sort of the thing I do. Very cool. All right. Well, I just want to dive right into it. And Wes, I'm going to point the first question at you and then have the rest of the panel, you know, sort of weigh in with, you know, agreements or disagreements or comments, but let's try to define what a composable data
Starting point is 00:05:03 stack is. The term composability, you composability has been thrown out a lot. There are even companies sort of co-opting the term for marketing purposes, which always adds a lot of confusion out there in the marketplace. But can you give us a definition of what a composable data stack means to you? Yeah, so it's a project or collection of projects that serves to address a data processing need, but where the component systems are built using common open source standards that lend themselves to efficient interoperability, either efficient or simple interoperability. So the different pieces that you assemble to create the ultimate solution for your data platform, you know, can be achieved without the developer having to write nearly so much, you know, glue or custom code to fit the
Starting point is 00:05:57 pieces together. So those points of contact between different the different components of the stack are based on well-defined open standards that are all kind of agreed upon and shared amongst the different component systems. Any additions or disagreements from the crew? Yeah, I think maybe just adding to what Wes said, I see that as maybe two different aspects. One is, I think, this idea of having open APIs and standards upon which different components can communicate, but there's also the idea of using common components, right?
Starting point is 00:06:32 So I think at least how we see this internally, this idea of how can we factor out common components between engines as libraries, and how can we define common APIs and the kind of common standards between them to communicate, right? So if you look at the industry or the projects on this area, there's usually those two things. One is just defining the API and the standard, and the other one is actually implementing things that do something with those standards.
Starting point is 00:06:56 I think there's this idea of just providing those components that communicate via common APIs and making sure that they're somewhat interchangeable. I'd like to add a question here, because we are talking about APIs and libraries and all these things, but who are these projects? Who are these, let's say, if we had to define a minimum set of what defines these, let's say, set of APIs that we can compose data systems with today, what would that be? And I'll start with you, Wes, because I think a lot of that stuff started with Arrow, defining, let's say, the vocabulary that we use today.
Starting point is 00:07:49 So what's the least of these tools that we need in order to do that? Yeah, I mean, I think the easiest way to think about it and the way that I often explain to people is to think about all of the different layers of a traditional database system. So historically, you would have database companies like Oracle that would create these vertically integrated systems that are responsible for implementing every layer of the stack, data storage, metadata management, physical query execution, query planning, query optimization. And then all the way down at the user level you have the
Starting point is 00:08:26 the user api which would generally be you know generally be sql and so if you think about this vertically integrated stack of technologies and you start to think about the logical components of the system you can start to think about okay well you know are there pieces of the stack which could be peeled off and turned into reusable components? And if you want to have a reusable component, per se, let's just say, storage, you think you start thinking about designing open source file formats, or open source systems to say, like, if you want to turn something into a reusable component, like designing composable or to make the components of a vertically integrated system separable and reusable is a lot more difficult and a lot more engineering. Ryan, what's your take on that? I think it's pretty funny. You're right. It is a lot harder. But I think the Hadoop world taught us that you don't actually have to do that work.
Starting point is 00:09:54 Hive tables, no one ever did that work. It was just unsafe. And sometimes you clobbered the result that someone else was getting. And like, you know, we lived with unsafe transactions in the storage layer for a really long time, but it was still super useful. Because a lot of the time, you only had one person changing a table at a time and you were reading historical data
Starting point is 00:10:19 and it just sort of worked. So I think we actually backed into, at the storage layer, at least, making it more reliable and having the behavior and guarantees that we wanted to have. Pedro, please. Yeah, I think I would even go further
Starting point is 00:10:38 to what Wes was saying. I'll say that most companies, they don't even have the right incentives to invest in composable components, right? Because I think, like we said, developing components is a lot more expensive, right? There's a lot more thought on what are the APIs, like it's a separate project. You know, if you need to open source this, there's a cost of maintaining this open source community.
Starting point is 00:10:56 So if you're developing a single engine, it's a lot more efficient for you to just write this as a small monolith because you have full control of that. It's a lot easier to evolve. It's a lot easier to control the direction of the different features and the architecture is going. But actually thinking through what are the right APIs, working as a community, identifying how this should work on other entities, it's a lot more expensive. So I think that's why historically most of the companies, they just, you know, they focus on developing the system, focus on the particular workload
Starting point is 00:11:29 they have in mind. I think where this breaks is if you're a company who actually needs to maintain too many of those systems, then you start, you know, economically starts to make sense. Okay, let's actually see
Starting point is 00:11:41 what we can share between those things. And I think this, in addition to open source, they can get into a point where a lot of those components are already available and already pretty high quality. I think that's why we're getting to this inflection point where people are actually rethinking their strategy as a kind of proprietary monolithic software
Starting point is 00:11:58 and thinking a little more about composability and open source and open standard. Now, that makes a lot of sense, but a quick question here and I want your take on that first, Pedro, and then I want to ask the rest of the folks here because I want the perspective from both someone who works in a hyperscale company like Meta, but also trying to figure out how that reflects to the rest of the world out there because not everyone's meta, right? So you talk about this inflation point. At some point, you need this modularity.
Starting point is 00:12:35 It emerges as a need. Can you tell us a little bit more of how this was experienced by you? Because I'm pretty sure there was some kind of evolution. There was Hive, then we started having the rest of the system, then we reached the point where you even had to take out the execution engine itself and make it a module on its own with Velux. So it was a little bit more of like how these happen, like inside a company like Meta. Yeah, sure. I think a lot of that just comes because data workloads are always evolving, right? So first you want to execute large map-produced jobs, then you want to execute
Starting point is 00:13:16 SQL queries, then there's stream processing, then there's log analytics, then there's transactional. So I think there's a lot of different types of data workloads. And I think the fact is just that we cannot build a single engine to support all of them. So this kind of drives what we call specialization. So what we end up doing is you develop a single engine to support each kind of slice of this workload.
Starting point is 00:13:38 So we have one engine that supports really large ETL SQL-like queries. You have another one for interactive dashboards. You have a series of engines for kind of transactional workloads. You have another one for interactive dashboards. You have a series of engines for kind of transactional workloads. You have a stream processing engine. You have now like a training engine that can feed PyTorch and keep the GPUs busy.
Starting point is 00:13:56 So it's just because we have so many data workloads, it kind of drove this requirement of specialization. I think the problem is that those things were done a lot more organically than intentionally, right? So it's just, well, there's a new workload evolving. People just go create a team and they start kind of developing a new engine from scratch, right? And then you get to a point where we have 20 of those. And then if you really look closely at them, like they are not the same, but a lot of the components are very similar.
Starting point is 00:14:23 So I think specifically, I think to your question around execution, if you look at the execution of all those engines, they're very similar. Not just, of course, looking at different analytic engines, but even if you look at something like a stream processing engine, it's not exactly the same, but the way you define functions and you execute expressions and you do joins, all of that is very similar. So that's how we started this idea of, let's first look at the execution and see what are the common parts
Starting point is 00:14:49 and what we can factor out as a library. And then just integrate and reuse within those libraries. So this is how we created Valux, which is something we're integrating across execution. And what we saw is that the more we talk to other companies and we talk to the community, the more people are really interested on that because developing those things is a very expensive project. It costs you, I don't know, hundreds of engineers and it takes
Starting point is 00:15:15 you 10 years. So if it's something that can actually join forces of much larger community or just reuse an open source project that already does all those things in a very efficient manner, that's just, you know, it saves you a lot of effort. But this is how we kind of got into this idea for execution, but there are similar projects targeted to other parts of the stack as well. Okay. That makes sense. Chris, I think you have something to add here.
Starting point is 00:15:37 Yeah. I, so I'm more or less agree with what Pedro was saying. I think that the key word there was sort of the organic aspect of this. And I think Ryan called this out as well as looking back to the early days with HDFS and stuff. I think the big evolution of which S3 is just a continuation is the separation of storage and compute. And I think Pedro is focused much more on the query engine aspect of it. I think that probably is a symptom of being at Meta, which is a very large company. But the alternative sort of, I don't know, storyline that I think people go through is they get their data on S3 and they're like, oh, okay, I need to query it.
Starting point is 00:16:19 And then like Ryan said, well, there's no ACLs. And so now you need some kind of ACL thing on top of the query engine. And then, oh, you need a data catalog or some form of information schema. And so very organically, you start building out these components. But because it's kind of piecemeal, because initially you just wanted to query your logs, right? And then you start getting streaming data or OLTP data in there. You start adding stuff over time. And so I think that has been more my journey,
Starting point is 00:16:47 is more one of not so much going horizontally across a bunch of different query engines, or maybe I'm not sure horizontally, vertically, but not so much going across a bunch of different query engines, but starting to add more and more features that a normal database would have, especially being in fintech most recently. It's like, you know, you take security very seriously and data
Starting point is 00:17:06 discovery is a whole thing there. So I think that's another point of view on how this stuff evolved. Yeah, that makes sense, Chris. I want to ask you, because you have also experienced, let's say, not only the part of the data where it's at rest and they are getting processed, but also in the capturing phase of data, the delivery of the data. Do you see this concept of composability that we are talking about, which let's say comes a little bit more from
Starting point is 00:17:35 the data warehousing or the OLAP systems, but do you see this kind of concept being part also of these systems in front, systems like Kafka or even OLTP? What's your take on that? Yeah, absolutely. I mean, taking streaming, for example,
Starting point is 00:17:53 when I was working on Samsung, and I think Link handles this as well now, there's streaming and the sort of near-line ecosystem is pretty adjacent to batch. And so many of the streaming query engines can also do batch processing on top of HTFS or S3. It's a very concrete example. Going even farther upstream,
Starting point is 00:18:12 there are LTP databases that are experimenting with sort of bridging the gap as well, whether that's something like Materialize, so pseudo OLAP, or something like Neon that has a tiered storage layer that includes S3, or it's made persistent, not as write-ahead log. So definitely, you can kind of look in any direction and see disaggregation and components being overlapped or shared.
Starting point is 00:18:40 Yeah, makes sense. Ryan, I think you wanted to add something. Oh, yeah. Yeah, makes sense. Ryan, I think you want to add something? and we needed them to work on the same data sets. And we needed to have this sort of architecture that all plays nicely with streaming and batch and ad hoc queries and, you know, anything from a Python script to Snowflake or Redshift. Whereas I think Pedro's perspective is kind of fun because he's coming at it from like,
Starting point is 00:19:22 how do we build engines and share components like the optimizer, which is a lot of fun because he's coming at it from like, how do we build engines and share components like the optimizer, which is a lot of fun as well. And, you know, Wes as well, where, you know, can we, can we have really high, high bandwidth transfer of data between those components within the engine itself? So I think there are like two separate ends of this conversation.
Starting point is 00:19:44 Yeah. Yeah, we have more of the user side of things and the builder side of things. Wes, I think you also wanted to add something on the previous thing. Yeah, I think it's interesting. I think the way that we arrived
Starting point is 00:20:00 even at this concept of composable data stack or composable data systems, you know, was a little bit, you know, it was a little bit organic. So when I got involved with what became Apache Arrow, I was needing to define basically an ad hoc table format or an in-memory data representation for data frames or tables so that I could hook pandas up to systems in the Hadoop ecosystem. And there were many other systems that had defined in-memory tabular columnar formats, either for transferring data, for example, between Apache Hive and clients of Apache Hive, or many database systems had built in- columnar formats that were essentially implementation details of their execution engine. And they had no interest in exposing those memory formats to the outside world.
Starting point is 00:20:54 And so as I was finding myself basically, you know, starting to create an ad hoc solution to the problem that I had, which was connecting Python to these other systems, it was only at that moment that we, you know, there was a, you know, a collective realization that like, we should try to create some piece of technology that could be used in a variety of different circumstances for solving that problem rather than creating yet another ad hoc solution that's incompatible with every other solution to that problem.
Starting point is 00:21:26 And so I think as time has gone on, people find themselves reinventing the same wheels. And then finally, if you have the bandwidth or the motivation to build a software project or an open source project or an internal corporate project that is more reusable or more composable and you have the experience to do it the right way, then I think that's what's caused this to happen now as opposed to, you know, 10 or 15 years ago when the open source ecosystem
Starting point is 00:21:56 was comparatively a lot more nascent, emerging, whereas it's a lot more mature and mainstream now. Yeah, that makes a lot of sense. Pedro, you want to add something, so please. Yeah, no, I think just quickly addressing Ryan's point. I think it makes sense for us when we started looking at this space was a lot more from a practical perspective, right? How can we be more efficient as an organization?
Starting point is 00:22:21 How can we... And a little more from a software development perspective. But as we made progress, we tried to get a little more from a software development perspective, but as we made progress, we tried to get a little more scientific with that as well, right? So essentially, if you stop, if you remove how the engines are developed today
Starting point is 00:22:35 and the standards and components we have and just think about what are the different layers, like what are the common layers between every data management system? So we kind of define an architecture saying that every single data management system has a language layer, which is essentially take something from a user.
Starting point is 00:22:54 Sometimes it's a SQL statement. Sometimes it's PySpark or Panda or something non-SQL, but you take this. So there's another component that is just how you represent the computation, right? So you take the user input and you create an IR, which is, I think Substrate was one project targeting kind of standardizing the IR. But if you look at every single system from analytics to transactional to data ingestion to machine learning, anything, you have a language, it translates to IR. There's a series of transformations that you do on this IR, both for metadata, resolving views, app calls, security, all sorts of things. And at some point, you get you an IR
Starting point is 00:23:30 that is ready for execution. It goes through an optimizer. So every single engine has, or sometimes the optimizer just doesn't do anything. But there's a system that takes this IR and generates an IR ready for execution. There's some code or some component that can actually execute this IR given a an IR ready for execution. There's some code or some component that can actually execute
Starting point is 00:23:45 this IR given a host and given resources, which is a little more what we're targeting with VALUX. And then you can, of course, there's a lot more details. There's the environment where you run those things, which is the runtime, and then it goes from MapReduce, Spark, to, I don't know, now I heard that Redshift
Starting point is 00:24:02 can run on several architectures, but it's just this environment that we call the runtime. So we kind of define this and we say that if you look at every single data management system today, they all compose of those layers. And of course, those layers are not completely aligned between them and they don't use open standards. So there's a discussion of what exactly the project's addressing each one of those components and what are the right APIs. But if you look at all those, so all those engines, they kind of follow this mode. So this is sort of the mental model we have internally. Like I said, Velux addresses the execution part, but you also have some other efforts on the language part, on the IR, even on the
Starting point is 00:24:39 kind of common optimizer it makes sense ryan yeah and i'll i i also want to ask you like based on this like mental model that's better described where iceberg feeds right but you also want to add something well i was gonna gonna initially say that our our experience creating iceberg was largely like wes's where we sort of backed into it by saying, how do we make people more efficient? How do we make these things work together without stomping on one another's results and things like that? And what we ended up with was kind of like Pedro's talking about, we said, Hey, what are the existing concepts for this space that we should reuse? You know, how do people expect it to work? And I actually really like
Starting point is 00:25:31 Pedro's, you know, breakdown of the different layers, right? The language layer, the IR, the optimization layer, the execution layer. And then the, I guess, environment, environmental sort of layer, I forget what you called that one. And then underneath that, I think storage, and that's where Iceberg fits in, which is weird and orthogonal. And another thing that interests me here is what is moving between layers. So, you know, security is traditionally done at that very top point where you understand what the user is trying to do, and then you have enough information to say whether or not they can do it. But if you have multiple different systems, right, if you're talking like, hey, streaming or maybe a process, or some SQL warehouse or other system, you need all of those things to have coherent and
Starting point is 00:26:32 similar policy, you actually have to move that down, right? You have to move it beneath all of those very different engines, and actually into the storage layer. So composability is really causing a lot of change and friction in the ecosystem right now. So would you say that, let's say, access controls is another component of the stack we are talking about? I would probably add access controls. I think someone mentioned views, but like, you know, reusable IR or a view type concept is definitely there.
Starting point is 00:27:13 I would also say the catalog as another sort of reusable component of the storage layer and how we talk to catalogs. Here again, everything has been talking like the Hive Thrift protocol for so long that we really need something to replace it.
Starting point is 00:27:34 Like the iceberg is coming out with a REST protocol to try and do that. So there are a lot of fairly niche components even within that storage layer. Chris, I think you want to add something. I wanted to add another one on top of the stuff that Ryan's been talking about. This is something I spent a lot of time thinking about, and that's data model.
Starting point is 00:27:57 So, and really, you know, data description. One thing I didn't mention in the introduction is I have an open source project I've been hacking around for a year that is trying to unify kind of nearline, offline and online data models. So this is sort of thinking about, you know, how do you represent an integer? How do you represent a date? And that's something that flows up and down all the way down to the Parquet layer. Parquet is kind of punted on what the data model should be. It's very simplistic and there's logical stuff. And then on top of that, you start compounding, you know, things all the way up to
Starting point is 00:28:30 the language layer. So it kind of runs the gamut. And it's something that I don't think we fully nailed. You know, it'd be nice to just say, oh, we're going to use Postgres or we're going to just use Hive or just going to use, you know, DuckDB's format or whatever. But inevitably, what we seem to end up with is like a lot of coercion and sort of munging. I think Arrow has wrestled with this a lot. You know, I was talking beforehand before we started recording about their schema flat buffer,
Starting point is 00:28:53 which is a really good reference if you want to look at like an attempt to try and model what data looks like across a bunch of different systems. It's super non-trivial. So that's another one I'd like to throw in there that I would love to see more progress on. I'll stop there.
Starting point is 00:29:08 Yeah. Pedro, you also want to add something? Yeah, no, I think just I think to the point that Chris and Ryan just raised, I think our current at least model is that language and execution should be decoupled and they should communicate
Starting point is 00:29:23 via one API. This API would probably be the IR, but that essentially means that anything related to the data model, like SQL, MDX, non-SQL, like anything, all of that should be resolved on the language layer in addition to kind of metadata operations, security, resolving app call, checking if users can access particular columns. All of that should be encapsulated in a way
Starting point is 00:29:47 orthogonal to execution, and they should communicate via an IR. So even things like if you want to express graph computation, then this IR should have nodes that express graph execution primitive. And all of that, so in a way, all those things should be decoupled from execution
Starting point is 00:30:03 and execution should only take this IR as input. And of course, there's, I think, security details of what you need to carry a security token to make sure that it can actually pull this data from storage.
Starting point is 00:30:13 But like all the logic of checking if people have access to columns, like resolving ACLs, doing privacy checks, like all that stuff should be decoupled from the IR. It doesn't mean that it necessarily should be part of the SQL parser library.
Starting point is 00:30:28 It could be something that you have many language libraries that generate an IR, and then there's some processing that happens on this IR so that those echo checks, privacy checks can actually be decoupled, can be orthogonal from whether users are expressing things using SQL or non-SQL. But we see all of that as being, in a way, orthogonal to execution. So execution should just mean, OK, this is the computation I need to execute. It's already checked. It's already saved. Let me actually go and execute it.
Starting point is 00:30:56 Chris, please go on. I think you want to make a comment here. Well, I actually just wanted to, maybe Pedro would be the best person, but can you define what you mean by IR? I think that's something that maybe not everybody intuitively knows is they've not been knee-deep in databases for a long time. Yeah, no, I think that's a good point. I think
Starting point is 00:31:15 IR is a term that we sort of borrowed from compilers, but it's essentially this idea of having some intermediate data structure that can represent your computations in a way that you can execute that without ambiguity. So essentially that in most query engines, that means the query, the physical query plan. But I just call this IR because that's kind of the term used in compilers for the same idea of kind of decoupling front end and back end.
Starting point is 00:31:42 Ryan. So I completely agree. I think IR and substrate or similar projects is one area that I'm most excited about because it is super useful, right? Being able to exchange query plans basically gives you views. Being able to pass off something from any language, whether that is SQL, or
Starting point is 00:32:08 hopefully something eventually better. You know, like, that is all really cool. But I think one aspect that I want to bring in here is that it's probably not enough. There's always going to be that guy who doesn't want to use a data frame API in Python to do his processing. He wants to jump into Python code. And like people have attempted this with like taking Java bytecode and translating it into SQL operators. And it's a gigantic mess. So like, you have to have either some willingness to use a language that produces IR, or the rest of the components in the stack actually need to support everything with even stronger like orthogonality. So like, I think that when, when it comes to building at least a storage layer, the storage layer doesn't actually get to assume that you're going to use IR and an optimizer or any particular
Starting point is 00:33:15 execution, right? We need to be able to secure the data, no matter what you're using. We need to, you know, be able to give you that data no matter what you're using and have like very well-defined protocols and standards at that level, at least. Ryan, one question though here, and then I'll give the microphone like to Wes because I think he can also like add a lot to that. You mentioned, I mean, I understand what you're saying about like storage and storage data hub is decoupled from that, but there is one thing and that connects with what like Chris was also talking at some point about like the type systems and like the data types and
Starting point is 00:33:56 the model itself, which at the end goes down to storage too, right? Like these data somehow needs to be represented and and it has to be, like, able to serialize, deserialize, whatever types you have there, and this is something that, like, goes through, let's say, all the different components. So, how do you deal with that? I mean, because
Starting point is 00:34:17 what I hear so far, and sorry for that, is that, oh, store ads, like, well, we can't stay away from that. Pedro says, oh, these things will be resolved by the front-end parts, like the parser and whoever generates the IR. It's like a hot potato that you throw to someone else.
Starting point is 00:34:39 And at some point, we have to deal with it. So there was a reason that this whole thing was a monolith, right? And I think that's what we are starting to surface here. These APIs at the end, communicating with the openness that we want to have to is not that easy. Chris's point, right? Which is if Iceberg and Arrow and our IR all used similar type systems, then we would be a whole lot better off. I do not doubt that. If Wes and I had agreed 10 years ago on the set of types that we would support, it would be a whole lot easier. And that's why Substrate, when they started that project, they took a look at all the type systems out there and said,
Starting point is 00:35:36 we're only going to allow types in if it's supported in like two or more large open source projects. I think one of them was Arrow, one of them was Iceberg, you know, I think Spark and some others, you know, so that is definitely a problem where we could use a a more coherent standard but let me also explain and argue for you know fracturing here it is the way it is because there's a huge trade-off. And dealing with that trade-off is why we have so many different approaches. I think Arrow takes the side of the trade-off to be more expressive and say, hey, if you want to use four, two,
Starting point is 00:36:18 or eight bytes for that type, you can go ahead and do that. Whereas on the Iceberg side, we're trying to keep the spec very small to make it easy for people to implement. And there's just a fundamental trade-off there, and you've got to strike the right balance. Anyway, sorry.
Starting point is 00:36:39 I've talked for a long time. Wes, you wanted to add something? Yeah. I mean, on this kind of discussion of IRs and the relationship between the front end of a data system and the back end, I mean, I think one of the earliest
Starting point is 00:36:57 and probably most successful systems in the composable data stack is Apache Calcite, which is created by Julian Hyde. The idea was it's a database front-end as a Java library, so it
Starting point is 00:37:14 does SQL parsing, query optimization, query planning, and it can emit an optimized query plane on the other end, which could be used for physical execution or a logical query plan, which can be turned into a physical plan and then executed.
Starting point is 00:37:34 But I think that Calcite really was really important in terms of socializing this idea of a modular front end that could take responsibility for those parts of building a database system or a data warehouse that are you know traditionally something that that system developers would want to have a lot of control over i think substrate is interesting because it's it's provided for like standardizing on what's that thing that your parser optimizer query planner emits. And that's
Starting point is 00:38:10 something that historically was not standardized. And so when people would want to use Calcite, they would implement Calcite, but they'd have a bunch of Java code using Calcite and then they'd have a bridge between Calcite and their system to go from the logical query plan into execution.
Starting point is 00:38:29 And so it's definitely been a journey to get to where we are now. And obviously, down in the weeds, we have the issue of the data types and trying to agree on all the data types that we're going to support in all these different layers, which creates a lot of complexity. Yeah, Before I give
Starting point is 00:38:47 the microphone to Pedro because he would like to add something here, I do have to ask the two of you, Wes and Ryan, why you couldn't agree on the types 10 years ago? 10 years ago, we were still screwing up timestamp.
Starting point is 00:39:10 I think it's still- That is correct. Yeah. I think the answer to that is because there was a lack of composability, which means that people were implementing the same thing over and over in slightly different ways. That makes sense. Pedro, you wanted to add something. Please, go ahead. Yeah, no, just adding on the conversation about data types.
Starting point is 00:39:30 I think that is complicated because we're talking about different things. There's many different levels of data types. I think on this discussion, there are at least three, right? There's the storage data types, which are the things that we, the storage and the file format actually understand, which are usually things like integers, floats, strings, and, but then like really kind of primitive data types. Then there are kind of logical data types that the execution can understand, which are things like maybe timestamps, like sometimes the storage also understand timestamp, but
Starting point is 00:40:00 there's, you know, logical data types that the execution can understand, but there are also maybe user defined data types, which are kind of higher level types that users can define. I think some examples are things like sometimes when people are defining IDs, they don't want them to be just integers. They want actually like a higher level data type that just maps to an integer, but adds some level of semantic, right? So I think there's different levels and those different levels have different trade-offs and some of them are easier to extend. Some of them are more efficient than others. But I think that's why we think that this model of defining things that should be resolved in the language, like things resolving user IDs into integers, those are things that should
Starting point is 00:40:37 be resolved in the language. It should be transparent to the execution type that the execution needs to understand so we can efficiently process those things. Things like, for example, defining functions that are based on those types. That only works if the execution understands those types. And then there are also types that need to be understood by the storage, which is a lot more around kind of storage efficiency, size, and related to encoding. So they're kind of different levels.
Starting point is 00:41:00 So depending on which types exactly we're talking about, but anything related to more logical types, data model, again, I could say that all those things should be resolved on the language layer and then just capture in the IR. Okay, that makes sense. One last question here about types, and I'll shut up about types. I promise.
Starting point is 00:41:22 And I want to ask Chris, because Chris can make this connection with the part we are talking about, which is the database systems, but there's also applications out there. There are application developers. There are people out there who generate the data that we store and then we process.
Starting point is 00:41:42 And in many cases, these people don't necessarily have or need to understand what's going on with the data processing infrastructure that we have, but they are feeding us with the data. So Chris, when it comes to the type system that we are talking about or the formats that we are using to represent the information and move it around, from your perspective, is there something else that has to be added to solve this problem end-to-end? Something else to be added. I think in my mind and sort of my intention with ReCap was to have something a substrate or an IR, but more specific to the metadata layer, which is a way to describe in the abstract, the data that is flowing across the online, nearline and offline world
Starting point is 00:42:42 that would account for a large amount of the coercion that we see now. I think there's always going to be a little bit of coercion because to not have type coercion, you essentially need something that looks a lot more like Q, which is this sort of academic,
Starting point is 00:42:57 very academic looking project that is essentially not usable for the average engineer. It's just too complicated. And so to Ryan's point around the complexity around all this stuff, you can't make it too complicated for these application developers to use.
Starting point is 00:43:11 And then as soon as you try and make it a little more simple, you end up with some form of coercion. But I think the thing that I would like is it's some common way to describe this, the data sets across the different stacks and layers. The closest I've seen to this instantiation, aside from Recap, is actually what Arrow does.
Starting point is 00:43:30 They essentially have two different layers. One is the schema flat buffer layer that describes, the way Pedro was talking about it, a very specific, like, here are the bytes. You can have a float, and the float can be 16, 32, 64, 128, right? But most developers don't want to say, like, float 16, float 32, or whatever. So what? But most developers don't want to say like float 16, float 32 or whatever.
Starting point is 00:43:48 So what Arrow ends up having on top of that is like an actual implementation that gives you decimal 128 as an actual type. And so there's sort of two tiers to it. But, you know, for better or for worse, that schema stuff is mostly wrapped up and used by Arrow. And I would like to take that out of Arrow and use it across all these systems so that you can sort of portably
Starting point is 00:44:07 move around the data description from one vertical to the next. That's sort of the area that I would like to see improved. Ryan, you want to add something here? So Chris's description here just triggered
Starting point is 00:44:23 something in my head, which is I have a little soapbox about losing structure of data and constantly coercing types. And I really think that one of the promising factors or, you know, promising aspects of composable data systems is like, stop losing structure, you know, share tables instead of sharing CSV, right? It's always pretty ridiculous that I'm dropping CSV or JSON, right? I'm destroying the structure of my data set in order to push it over to you. So like, I think that it will actually get better hopefully when we have you know more ability to share the actual representation and do so securely and some of these other things mature but i also entirely agree that like we need to get to the point where we you know have some idea of a format that can actually do that
Starting point is 00:45:25 exchange as well. Okay. Enough with types. Eric, I've monopolized the conversation. Oh, it's been so good. Yeah, I think one thing that's just hearing the conversation
Starting point is 00:45:41 has been so fun because I've heard multiple times, oh man, we've come so far. And then also like, well, yeah, that's just hearing the conversation has been so fun because i've heard multiple times oh man we've come so far and then also like well yeah that's you know that is a problem and i think the common thread through all of that seems to be this desire for open standards and there are different areas of the composable stack you know where that seems to be a big need. And so I'd just love to hear where have we come in the last several years as far as open standards? And then what are sort of the frontiers that are most important? Wes, maybe we can start with you.
Starting point is 00:46:18 Just give us a brief history sort of of your view of where we've come. Yeah. I think there's... I mean, I think, you know, the main things that came out of the Hadoop ecosystem was open standards for file formats.
Starting point is 00:46:35 So basically the foundations of what we now call open data lakes. So we ended up with, of course, multiple competing standards, so Parquet files and ORC files and some other open standards like
Starting point is 00:46:54 Avro for developing RPC client-server protocols, things like Thrift and Protobuf, which have been widely adopted for building client-server protocols, things like Thrift and Protobuf, which have been widely adopted for building client-server interfaces. I think in the last 10 years, moving up the stack into
Starting point is 00:47:15 in-memory data transfer, at in-process memory transfer or inter-process memory transfer with Arrow. That was a hard one, hard one battle, but it's great. It's been great to see that, you know, achieve wide, you know, wide adoption.
Starting point is 00:47:38 I think interoperable or like open standards for computing and like more of the computing layer is more of like an emerging thing that's becoming like that's starting to happen that historically there really wasn't very much of and so i think we've gone from an era of like these limited open standards for data interchange data storage to starting to think about more you know more of the runtime you know what happens inside of processes rather than just how we store data at rest or move data on the wire. Makes total sense. Any other thoughts from the rest of the panel? Yeah, I think maybe just adding to what Wes said. I see that, again, my mental model is that there are two things.
Starting point is 00:48:22 One is defining what are the APIs and standards. And the second thing is actually having implementation for those things. And those are, they don't really go hand to hand. Sometimes there's no standard and sometimes there is a standard, but there are multiple implementations and they're not compatible or the opposite might also be true. But I think maybe to your question, to what are the open standards and APIs, like if we follow the model I presented and go around the stack, like if you go,
Starting point is 00:48:51 start from the storage layer, usually the storage layer is just some sort of block API, right? So this is, you know, already pretty well understood. You usually have some notion of file, handle, offset, and size. So this is just how you pull blocks.
Starting point is 00:49:04 And then there's this idea of how do you interpret what those blobs of data mean, right? And then I think like, well, like Wes mentioned, I think like Parquet, ORC, Avro, those are kind of well understood, even though the implementation is all over the place. They can have many Parquet readers and writers implementation. They're not necessarily compatible, but there is a standard. So you start from decoding those things. Then there's how do you represent those things in memory, which is, I think, what Apache
Starting point is 00:49:31 Arrow defines really well. So if you need to represent this columnar data set in memory, how do you lay out this thing in memory? So I think Arrow solves that problem. Then there's the question of if you need to process this data and apply functions, apply, you know, different operators, what is the relational semantic you follow? So I think there's another discussion. So if you look at Spark, Spark has a certain semantic, which kind of loosely follows NC
Starting point is 00:49:56 SQL. Presto has a different semantic. MySQL, Postgres, and you name it, there are probably like 50 different semantics you can find, and none of them are compatible with each other. They all sort of look the same. They have similar functions, but they're never compatible. So there's this idea of like, what is the standard for the semantic that your operations and your functions need to follow?
Starting point is 00:50:17 Then if you go up another layer, there's a discussion on how you represent this computation. So how do you know that, well, you need to scan this data, then you need to sort it, then you need to apply filter and then you need to join and you need to shuffle this. So this is what Substrate was supposed to do, or it's meant to do, just essentially having an open standard for representing this computation. Then if you go up, there is a discussion on what are the APIs to how users represent this computation, right? Which is how well we have SQL, which is probably the worst standard of all of those.
Starting point is 00:50:48 It's very loosely defined. There are just so many different implementations. They're never compatible between each other. There's also discussion of non-SQL API. So you have Pandas, you have PySpark. They're all non-SQL, but still there's no standard. So I'd say that this is probably the highest level on top of that. There might be even higher levels of how developers interact with those things.
Starting point is 00:51:10 Like maybe you can have ORM or different sorts of abstraction that actually map into non-SQL APIs or map into IRs. So that might be some other APIs on top of that, but I don't think there's any very industry standards on those. So at least that's my mental model if you go across the stack. Some of them, for some of those areas, there exists some standards, but they are not as strongly defined as we would like them to be. Yeah. Would you say, I mean, it kind of sounds like roughly not perfectly, but, you know, sort of from bottom to top, as you described it, is sort of the sliding scale of maturity, right?
Starting point is 00:51:46 Like the stuff at the top, you know, sort of least representative of open standards. Would you say that's sort of generally true? What do you think about that? I think not necessarily. I think it depends on which projects came first. What is the, I don't know, how much people actually adopt them in practice. I'm not sure if there's a correlation to kind of how deep or how high up on the high arc. And sorry, there's one, I think, very obvious layer that I mentioned that I forgot to mention,
Starting point is 00:52:15 which is a table API, right? Which is somewhere between the, I think, the storage layer process. I think that's a big one, which is, well, we have Iceberg, but we also have Hudi. We also have Delta Lake. We have Meta Lake inside Meta. So there's another example of, you know, there is an open standard, but it's not, you know, 100% adopted everywhere, which is also, I mean, good, but not great. Sure. Chris?
Starting point is 00:52:36 Yeah, I was just going to answer, I think, Eric, your question on sort of where things are the most mature. And I really think it has a lot to do with who wins a given space, right? And so I think if you look at what drives a lot of things, it's like the Arrow API or data frames or whatever. And so everything is having to integrate with that, right? And so I think as there is, you know, theoretically one winner, we can all dream, one winner in the storage layer, then that will sort of solidify what that protocol is going to look like.
Starting point is 00:53:08 But the more people you have competing, the more chaotic it is and the harder it is. Yeah. I think what really drives a lot of the APIs actually is sort of organic through whoever wins gets to decide what the API looks like. So in that regard, I don't think it's bottom up the way you described. I think it's kind of decide what the API looks like. So in that regard, I don't think it's bottom-up the way you described. I think it's kind of middle-out. I think the API layer is really enforcing people to fit into that a lot more than bottom-up. Yep, makes total sense.
Starting point is 00:53:40 One thing I'd love to discuss is we kind of already got into data types, you know, which is certainly, you know, a trade-off, I would say. When you think about composability as compared with, you know, sort of a singular monolithic system, what are some of the other trade-offs? Like, how would you define some of the other trade-offs of the composable system? And Pedro, maybe we can start with you. I think that's interesting, right? I think some of the discussions we usually have with people working on this space is, I think what we mentioned before, like in a lot of cases, it's harder to build a composable system than it is to build a monolith, right?
Starting point is 00:54:20 So sometimes if you're just optimizing for speed and you just want to build a prototype and have an MVP customer running on top of it as fast as you can, it's easier to just prototype and just create a new monolithic system. Where we think that this fails is that it's usually easy to create a first version. So you can run a very simple query that supports this workload. But then a few months from now, they need to support a new operator and they need a new function. And then as this thing grows, I think it kind of slows down. And then I think in the long run, it just, it doesn't pay off. But it's a lot harder when you're starting something like, should we actually spend a few months trying to understand, I don't know,
Starting point is 00:55:00 engaging with the Arrow community, understanding how Arrow works, or understanding how Velux works. And if you need to make any changes, you need to engage with the community. And it's a lot easier in a lot of ways to just kind of fork all those things and kind of make sure you can move a lot faster. So I think this is one of the obvious trade-offs. There's also this kind of bias developers have
Starting point is 00:55:18 that it's something we elaborate on the paper as well, that we like to think that we can do things better, right? So I'm not going to use Arrow. Like if I need to create this, I'm probably going to write this better. So, so we see this kind of pattern with users, especially like with well, experienced engineers over and over. People don't want to reuse something because they think they can just do it, do it better and in some cases they can, but in a lot of cases it's also not true.
Starting point is 00:55:40 There's also this part of just people prefer to write their own code than to understand other people's code. So instead of spending a month understanding Valox, I was just going to go and create something that I fully understand in a few weeks. And then six months later, when you leave the team, then the next engineer has the same problem. I think there's a lot of those kind of fallacies that we hear over and over. Some of them are kind of fair, like this time to market. I do feel like it's true, but you usually end up paying the price on the long run. Again, like I mentioned, we elaborate some of that on the paper as kind of the reasons why composability hasn't happened before. It's just because there's a lot of kind of those internal biases that engineers have, and some of them are kind of driven
Starting point is 00:56:23 by business needs as well. Can I... Sorry, Eric. I want to ask something better here. So, because you mentioned about, like, why composability didn't happen earlier. And actually, it is, like, a question that I have about data management systems.
Starting point is 00:56:38 Because data management systems are, like, very complex, but they are not the only complex systems we have out there. We have operating systems, and composability in operating systems has been a thing for a very long time. Same thing also in a little bit of a different way, but also with compilers. We see the difference between having the front-end
Starting point is 00:56:59 and LLVM and other systems on the back-end and all these things. But why in the database systems it took us that long to get to the point of appreciating and actually implementing this composability? And this is a question to all of you, obviously, guys, because you have all of you a very extended experience on that. So please.
Starting point is 00:57:20 I think that part of this comes down to, you know, commercial interests, which I think is a big part of the data industry, right? At least where we sit at the storage layer. Storage provides opportunities that the execution layer can take advantage of for better performance. And if you can control the opportunities and you control the execution layer, you can make something that just fits together really nicely and has excellent performance. And at least so far in the storage world, it has not been a thing to get your storage from another vendor. Like, that is, you know, really weird thing that is happening now, that the Databricks and Snowflake, or, you know, choose your other vendors, Redshift
Starting point is 00:58:17 can share the same data sets. And it comes down to who controls that data set, who controls the opportunities that are presented to the other execution layers. Like the world gets really weird in this case. And I think that, you know, part of it is just how we've historically architected these systems. Right. I think of the Hadoop sort of experiment as this Cambrian explosion of, you know, data projects that questioned the orthodoxy of how we were building data warehouses. And that led to, you know, pretty primitive separation of compute and storage that we then, you know, matured and eventually got to this point where, yeah, you can, using projects like Iceberg, you can safely share storage underneath these
Starting point is 00:59:12 execution engines. And that's what is really pretty weird right now. But all throughout our history, we have not been able to actually share storage, share it reliably, share it with high performance, and things like that. So I think that, you know, the business model of all those companies that have never had to share storage and are built around like, hey, we sit on all your data and, you know, lease it back to you for compute dollars. That has been a very powerful driver in the other direction. That makes sense.
Starting point is 00:59:52 Pedro, you want to add something? Yeah, I think maybe adding to what Ryan said and addressing your question of why we think composability is more important for data systems, but why isn't it such a big thing for compilers and operating systems, for example? I see maybe a lot of that just driven by variety. So how many widely used C++ compilers can you name and how many operating systems can you name and how many data systems can you name? So there's a lot more, I would say even wasted engineering effort in redesigning and redeveloping those things for data systems than they are in operating systems and compilers. I think a lot of that is just because the APIs of those systems are actually user-facing, right?
Starting point is 01:00:35 So users interact directly with databases. So users' needs and user requirements are evolving a lot faster than requirements for operating systems and compilers. So I feel like the APIs of those systems are a lot more stable and they don't evolve as fast. So I think those systems are also a lot more mature, right? So making them composable, there may be less incentives to make them composable because they're only a handful of implementation of those. But data systems where there is a place where you literally have hundreds, probably thousands of different engines that all have some sort of degree of repetition.
Starting point is 01:01:09 I think there's a lot more incentive than, okay, let's actually see what are the right libraries that we can use to accelerate those engines, especially because workloads have been evolving and they're going to continue evolving in the future, right? So the workloads we had 10 years ago are very different from the ones we have today, and they're probably going to be very different from the workloads we're going to have five years from now. So not just making the engines we have more efficient, but we need to be more efficient as a community on how do we adapt data systems as user workloads evolve. Eric, back to you. Sorry for interrupting again. Oh, yeah. Well, actually, we're fairly close to the buzzer here.
Starting point is 01:01:47 We got a couple more minutes. And so one thing that I would love to hear from each of you is what new projects you're most excited about. You know, we've talked a lot about sort of the history of open standards and projects that are, you know, that have pretty wide adoption. But I'd just love to know what excites you to sort of look at the newest stuff out there. So Ryan, why don't we start with you?
Starting point is 01:02:17 I'm pretty excited about IR and projects like Substrate. I'm also excited about some of the other more composable or newer APIs in this layer. Iceberg just added views. We are also standardizing catalog interaction through a REST protocol, really trying to make sure that everything that goes into that storage layer
Starting point is 01:02:45 has a open standard and spec around it. And I think that is going to really open up not just composable things, but towards the modular end of the spectrum where stuff just fits together nicely. You can take any database product off the shelf and say, hey, my data is over here. Talk to it. So I'm pretty excited about how easy it will be to build not just these systems, but actual like data architecture based on, you know, these ideas. Very cool. Chris, you're next on my screen.
Starting point is 01:03:29 Yeah, I think I'm going to call out one that's sort of obscure. There's a fellow over at LinkedIn working on a project called Hop- and kind of create this single view of streaming queries and also kind of data warehousing queries and they've done some really interesting experiments to sort of plug it into kubernetes and stuff so that's something that i think is really fascinating. It essentially kind of uses Kubernetes as like a metadata layer. And then it has a JDBC implementation that will allow you to query, like basically do a join between Postgres and Materialize or between, you know, data that's in Iceberg and data that's in MySQL. And so it's really an experimental project. It's super interesting.
Starting point is 01:04:27 And the guy that's working on it, Ryan Dolan, is really, has a lot of interesting ideas. So that's the one I want to call out. Very cool. All right, Costas, you're next on the screen. Did you think I was going to ask you?
Starting point is 01:04:40 Oh, me? Yeah. Yeah. That's a good question. Actually, I'll probably mention something that is not that much about in the context that we are talking, a bit more of an extended context. I'm very excited about virtualization technologies
Starting point is 01:05:01 like Divizor, for example, and firecracker, and how these can change the way that we build data systems. One of the really, in my opinion, like hard problems when you build solutions and products in that space is how you deliver, like how you build multi-tenancy and how you can deliver that in a way that you can build as a vendor margins and at the same time provide the isolation guarantees that people need out there for their data. So this interaction between virtualization and on top of that,
Starting point is 01:05:40 like how you build systems, it's something that I find extremely interesting. There are some companies like Modal, for example, experimenting and using Divisor. It's very interesting. Anything that takes the data systems and delivers them to the users
Starting point is 01:05:57 out there in ways that are, let's say, a little bit closer to the stuff we've seen with applications. It's super, super interesting, I think. And we say a little bit closer to the stuff we've seen, like with applications. It's super, super interesting, I think. And we see a little bit more on the OLTP space, like the Neon database and like these like vendors that are like a little bit like more like ahead of the curve, let's say they're like compared to the OLAP systems. But I think there's a lot of opportunity also for the OLAP systems to exploit these technologies
Starting point is 01:06:31 and build some amazing new stuff. So that's what I would say. All right, Petra. Yeah, I think there are a lot of open source projects that come to mind, but I think specifically on this composability area, I would say personally and just making a quick plug here is Velux. I think that's the project closer to my heart. We're making really good progress. We're getting to a point where more than 20 different companies are engaging and with us helping develop.
Starting point is 01:06:59 We have more than 200 developers. So we're making some really quick progress. It's integrated into Presto, integrated into Spark. We're seeing like 2 to 3 X of efficiency wins on those systems, which is huge. So I think this is the project that I'm very closely involved. So it's super close to my heart. We're also working with hardware vendors to add support for hardware acceleration in kind of a transparent manner. So I do feel like it's going to become kind of even more popular than it is today.
Starting point is 01:07:28 There's also a discussion on file formats, right? So I think, well, today Apache Parquet is probably the biggest one, but I think there's a consensus in the community that it's already getting closer to the end of its life. So there's a discussion of what it's next, what's after Parquet and what this format looks like and how do we actually create this format and create a community around this. So I would say that we're probably going to see more projects
Starting point is 01:07:50 specifically on this area of file formats soon. I think going up the stack substrate was something that I was super interested on, like actually having the way of expressing computation across engines standardized, I think was a super interesting proposition. Even though I think the actual adoption in the existing system, it's been a little slower. I think there's also discussion that, well, from a business perspective, like why would you invest into using Substrate instead of your own thing?
Starting point is 01:08:19 Like what is the value of that asset? I think from Bellux, there was a clear value of actually making your system cheaper. So I think for Substrate, maybe there's still a discussion on how exactly do we frame this and how exactly do we increase the project for larger companies. But I think in general, another project that's super interesting to me. And Wes, bring us home. Yeah. I mean, bring us home. Yeah.
Starting point is 01:08:45 Like Pedro, I'm really excited about the progress in modular composable execution engines, Bellocs and DuckDB being two prime examples. Another one, DataFusion, it's a Rust-based query engine in the
Starting point is 01:09:03 Arrow ecosystem. And so my company voltron data you know we've been we basically were building a system you know maybe inspired by the name of the company you know to be able to take advantage of the best of breed and modular execution engines and so i think we'll see more and more systems built to you know take advantage of this trend as time goes on rather than you than building your own execution engine that you can just use the best available engines for your particular hardware, SIMD architecture, or whatever characteristics of your data system. Another thing that I'm really excited about real quick is something we haven't really talked much about, which is the language
Starting point is 01:09:42 API, the language front end for interacting with these systems. I think people, there's kind of an awakening and for a long time people were awakened to it, but that SQL is kind of awful. And so there's a bit of a movement to define new query interfaces that do away with some of the awfulness of SQL, but also can hide the complexity
Starting point is 01:10:07 of supporting different dialects. So a couple of projects there, pretty interesting. Malloy, created at Google by the former Looker people. There's another project called PRQL. Yep. Perkle.
Starting point is 01:10:23 That's very cool. You know, I created a project called IBIS for Python, which is like a data frame API that generates many different SQL dialects, you know, under the hood. There's a big team, pretty good sized team working on IBIS now. Thomas Neumann and Victor Lais from TU Munich have a new project called SaneQL in a paper at CIDR 2024 discussing the many ways in which SQL is awful and proposing a new way of writing relational queries. So I think that's, you know, I think, you know, since we're a lot of us are very focused on building these building, you know, viable systems for solving these problems. And
Starting point is 01:11:05 I think to be able to move on to more like, hey, well, how do we make people even more productive? And that includes all the way down to the code that they're writing to interact with these systems, that that user productivity is going to become an increasingly an important focus area, I think, for the coming few years. Yeah, absolutely. Ryan had mentioned that earlier, and I was like, ooh, should I jump in and go down that rabbit hole? One question.
Starting point is 01:11:33 You mentioned earlier call sites, but I haven't heard from anyone about anything that is happening when it comes to optimizing, which is a big part of these systems. At the end, that's the part that actually takes the computation,
Starting point is 01:11:51 optimizes it, and makes it efficient, and does all the magic there. Is there anything interesting happening in this space? Would you like to see anything happening there? I'll go. I think for me and our team, we have been deeply discussing this, basically using the same ideas of having more modular execution. Can we have more modular optimizer? I think the first reaction
Starting point is 01:12:13 that people have is that, well, that's unthinkable, that the optimizer is very specific to how the engine works. But if you actually stop thinking about this, there are ways to build this. So we actually,
Starting point is 01:12:23 we have someone on the team prototyping some of those ideas. I think where this stops is basically on prioritization. How much value does that provide and why people would invest on this right now? So I think that's where things
Starting point is 01:12:37 stand. I think from more academic and scientific perspective, it's super interesting. It's something that I would love to spend some more time, but I see maybe less business value in investing in this versus investing in things like common table formats,
Starting point is 01:12:53 faster execution, and better language. So I do feel like this is going to happen at some point. And as we started looking at this space, we saw that many other partners were interested in this. I think it's more a matter of what are the incentives.
Starting point is 01:13:10 Makes sense. Anyone else who would like to add about optimizers here? I think it depends on what type of optimizer, right? Cost-based optimization is definitely more tied to the engine. We could probably share a whole bunch of rule-based optimizations and things like that. And I actually think that sort of work is going to happen as we coalesce around an intermediate representation. Someone's going to write a good optimizer for Substrate at some point. Or we'll be able to translate to and from easily with it. And we'll just have that ability, I think. But then it's like all those designs around, well, now that I need to incorporate
Starting point is 01:13:54 how my engine actually acts and what the real costs are, that gets pretty hairy. Yeah. I think once we, just quickly adding to this, that was essentially the discussion we had, like, are there actually common things that we can extract from the optimizer? And I think just adding more color to what I mentioned, we saw that at least like all the optimizers, they have ways of, okay, how do you define what are the physical capabilities
Starting point is 01:14:19 you have? How do you cost those capabilities? I think it was a discussion on providing you have the right APIs so that engines can actually say, well, I support merge join, index joins, and hash joins, and those are the costs. The part of actually exploring the plan space, costing those things, all that part is very common. So I think the idea was just how do we define the right API so that they could be reused across the engine? But again, I think it's was just, you know, how do we define the right API so that they could be reused across engines? But again, I think it's something that we just stopped on this part of why should we fund this? And maybe that's something we will look into in the future.
Starting point is 01:14:51 Yeah, makes sense. Okay, one last thing. And that's like a question specifically for Ryan. What is going to substitute Apache Ranger for the enterprise, at least? Come on, it's hopeful. Someone has to go and fix this. I think it's solving the wrong problem.
Starting point is 01:15:14 So I'm going to answer a different question that I think is hopefully relevant. I am not bullish on sharing policy or representation because a lot of different systems, the edges have very different behaviors. So like a future table permission like you have in Tabular versus like a inherited database level permission or something like that, right? Like what do you do with objects as they go into and out of existence?
Starting point is 01:15:49 I don't think that sharing policy is the right path because we would have to come up with a union of all the policy choices and capabilities out there. I think sharing policy decisions is the right path. So what we're incorporating into the Iceberg Catalog REST API, or that protocol, is the ability to say, this user has the ability to read this table, but not columns X and Y. And the benefit there is it doesn't matter how you store that. It doesn't matter if you're using ABAC or RBAC or whatever you want to use for that internal representation.
Starting point is 01:16:32 The catalog just tells you the decision. It says, this user can read it, but not these things. Or they can't read this at all or something. And so I think that is going to be the best way to standardize an exchange, which is essentially you have to pass through the user or context information necessary for making these decisions. And then the decision comes back in a easily represented way because it's a concrete decision from the policy rather than storing the policy and all of the ways you could interpret that policy. Yeah, I think we need to
Starting point is 01:17:13 have like a dedicated episode, like just talking about that stuff, to be honest, but like for another time, like the theme today was like different. But Eric, back to you. Yeah, look, we have a couple episodes here. We probably should talk about data types. We should probably talk about the death of SQL and access policies. So we can line up a couple more episodes here. Gentlemen, thank you so much for joining this. This has been so helpful. We've learned a ton.
Starting point is 01:17:44 I know our listeners have as well. So thank you for giving us some of your time. Thank you. Yeah, thanks. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.