The Data Stack Show - 185: The Evolution of Data Processing, Data Formats, and Data Sharing with Ryan Blue of Tabular

Episode Date: April 10, 2024

Highlights from this week’s conversation include:The Evolution of Data Processing (2:36)Ryan’s Background and Journey in Data (4:52)Challenges in Transitioning to S3 (8:47)Impact of Latency on Que...ry Performance (11:43)Challenges with Table Representation (15:26)Designing a New Metadata Format (21:36)Integration with Existing Tools and Open Source Project (24:07)Initial Features of Iceberg (26:11)Challenges of Manual Partitioning (31:49)Designing the Iceberg Table Format (37:31)Trade-offs in Writing Workloads (47:22)Database Systems and File Systems (55:00)Vendor Influence on Access Controls (1:01:58)Restructuring Data Security (1:03:39)Delegating Access Controls (1:07:22)Column-level Access Controls (1:14:19)Exciting Releases and Future Plans (1:17:47)Centralization of Components in Data Infrastructure (1:25:37)Fundamental Shift in Data Architecture (1:28:28)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Costas, we've talked a lot about databases, database technology. You know, it's been a common theme on the show.
Starting point is 00:00:39 But today, we're going to dig really deep into that world, that high scale. So Ryan Blue is our guest. He helps create Iceberg, which is now part of the Apache Foundation. And it's going to be a great story. I mean, I am really interested in hearing the background of the challenges that they faced at Netflix, you know, where this was originally developed. And then it's above my pay grade, but I am really interested if you would be willing to ask him about file formats, because that is actually another interesting thing that
Starting point is 00:01:21 we haven't covered in great detail. I mean, we've done it here or there, but you know, that's a huge topic when it comes to Iceberg, when we think about data lakes. So that's another topic that I've been thinking about just as it relates to all of Ryan's experience. So hopefully I didn't steal your thunder on the file. But what do you want to ask about? Yeah. Hopefully I didn't steal your thunder on the file submits question, but what do you want to ask about? Yeah, I mean, first of all, I know that most people, when they think about Ryan, they think of Iceberg.
Starting point is 00:02:00 But what is, I think, extremely interesting is that Ryan has been around for a very long time. He has been part of building some very foundational pieces of technology that we are using today. Like things like Avro, Parquet, and obviously like the table formats like Iceberg is. So outside of anything technical that we will be talking about with him, one of the things that I will spend quite some time with him is do a little bit of history. Why things actually happen the way that they happen. We touched with him, and it's, in my opinion, super interesting. It's about how, when it comes to data processing, there are actually two parallel tracks of development
Starting point is 00:02:47 that happened in the past 10, 15 years. One which is coming primarily from the database folks that were building database systems. And another one is coming actually from people that were primarily distributed systems people. And that's where things like MapReduce came, stuff like Hadoop, and like all these big data technologies
Starting point is 00:03:12 that we are talking about. And we will see that. And there are like some very interesting comments and points that are made of like how we reinvented some things or we did some things like differently, why this happened and Ryan gives like a very
Starting point is 00:03:31 interesting perspective into the evolution of these systems and how they happened and why and outside of that we'll talk a lot about file formats which is also quite of a hot topic Parquet for example has been out for a while there are like a lot of conversations of like,
Starting point is 00:03:46 we need to update it. There are some actually new things coming out these days. So I think it's a very good time to do a refresher on what file formats are and for storing data and how they differ between them and how they differ to table formats like Iceberg, right? And on top of that, we'll talk also like a little bit about like Tabular, his company, and also about some other like really interesting things that are
Starting point is 00:04:16 happening right now in in the space. So make sure you listen to the episode. It's very interesting. Ryan has like a lot to share and we have a lot to learn from him. Agreed. Well, let's dig in and talk about Iceberg and all the other fun stuff. Let's do it. Ryan, hello.
Starting point is 00:04:38 It's very nice to have you here again. We had a panel a couple of weeks ago together, but now we'll have the opportunity to be just the two of us, actually, today. And talk about many interesting things, spanning from table formats to security and anything that has to do with data lakes out there. But before we get to that, let's do a quick introduction. Tell us a little bit about you. How did you get into that stuff? Why you decided to work on these things?
Starting point is 00:05:10 And what you're doing today. Thank you. And thanks for having me back on the show. It's fun to have these conversations and do all these things. Yeah, so I accidentally fell into the big data area. I was not a fan of like databases or database systems. When I had college classes on them. I was really more of a distributed systems, you know, kind of engineer. And as I was doing distributed systems in the government, I got more and more into big data because everything was a distributed systems problem at the time with the early days of Hadoop.
Starting point is 00:05:55 So I eventually worked at Cloudera and then moved on to Netflix. And that was because I was working on the Parquet project. So I was on the Parquet and Avro sort of data formats team at Cloudera. And then we use them very heavily at Netflix because we needed obviously a whole lot better performance just reading bytes off disk. Okay, that's great. You said Parquet and Avro. And I have to ask you, what's the difference between the two and why we need to? Well, so Avro is a record-oriented format. So you sort of blend together all of the columns and you keep chunks of records together, individual records together. Of course, Parquet is a columnar format. So you reorganize and store all of column A, then all of column B, etc. And the reason that you do
Starting point is 00:06:55 that is so that you can basically keep the columns separate. If you only need columns A and E, you can read them effectively as large chunks, which is, like I was saying, what we were interested in at Netflix. Because what we didn't want to do was read a ton of data, every single column, every single value from S3. What we wanted to do was read just the pieces that we need, because you're talking a very different bandwidth and latency profile for all of those requests when you're working with S3 instead of HDFS. It's also just a generally good practice to store your analytic data in a columnar format.
Starting point is 00:07:40 Yeah, so in a way you went from distributed systems to not just databases, but to the very, very core part, the storage of that, how we lay out data out there. Yeah. Well, that was actually an accident. I joined a different team at Cloudera, and then three months in, my team took over data formats and that was pretty much all of what we did and i'd done like specifications for apis and things in the past so i was pretty familiar with like you know bytes on disk or supporting something forever those sorts of things that definitely came in handy when we were talking about Parquet
Starting point is 00:08:25 and Avra. Yeah, 100%. Okay, so you joined Netflix and moved away from HDFS to S3. And what happened there? Like, what were, like, the new things that you showed there that were, like, different and hopefully interesting compared to, like, the stuff that happened on the VFPS? Well, so we had all the same problems. They were just 10 times worse because the latencies were 10 times higher. So the Hive table format that we were using at the time had to list a whole
Starting point is 00:09:00 bunch of directories to figure out what files are in those directories. And that was okay when you were talking seven milliseconds to the name node. When we moved to S3 and it's 70 milliseconds, all of a sudden you're talking like 10, 15 minutes of planning time for regular queries. And it was awful. So we started band-aiding and sort of patching all these solutions. So we parallelized the file listing in Spark. But then you're still running up against things like, well, the listing might be inconsistent. So you're not only checking S3, you're also checking some source of truth in DynamoDB.
Starting point is 00:09:44 That was our Semper project. And we just kept creating these sort of workarounds and band aids. We actually had our own implementation of the S3 file system that was based on the Presto one. And it was funny, because we would turn off certain checks, you know, to meet the requirements of a file system. You have to know that, you know, when you list a directory, that it's not a file or vice versa. And we would turn off those checks because, you know, if you need to check if something's a directory in S3, you actually need to check, does the object exist? And, you know, say, hey, I'm a directory, the underscore dollar folder dollar paths. But you also have to do a listing to determine if it's a
Starting point is 00:10:34 directory and not a file. And you have to have some behavior to say, like, hey, if it is a file and a directory, what do we do? So it was all just sort of bonkers. And we said, you know what, we're just going to turn off those checks. And our jobs ran a whole lot faster because we actually treated S3 like an object store instead of trying to put this facade of a file system on it. Yeah, yeah. Kind of reminds me a little bit of this other thing that professionals say in database classes,
Starting point is 00:11:07 that you have to bypass the operating system in a way to start working with that stuff. So I think we see that on a different scale. But you mentioned latency as one of the main reasons behind the frustration and the problem that we were trying to shoulder. But can you elaborate a little bit more on what latency means and what from 7 milliseconds to 70 milliseconds does at the end to the user and why this is becoming a problem?
Starting point is 00:11:40 Because 70 milliseconds on its own doesn't... Anyone would be like, okay, whatever. Who understands the number of 70 milliseconds on its own, does it like... Anyone would be like, okay, 70, whatever. Who understands the number of 70 milliseconds, right? So what is happening there that snowballs down to a big problem at the end? Yeah, I mean, it's just the minimum amount of time that any request to a service takes. So that's what I think of as latency. Of course, you could have requests
Starting point is 00:12:06 that take much longer, right? But there's a certain bottleneck if there's some minimum that you can't do better than. And in the case of the Hive tables that we were using, there were just a lot of those requests. So for everything we need to check, like, hey, is this directory a file?
Starting point is 00:12:29 And that was part of listing, right? And like I said, you can go in and make some assumptions and turn off this or that, but if you're making multiple requests to S3 for every quote-unquote directory that you need to list, then you're talking like, you know, that adds up to maybe 200, maybe 300 milliseconds. And if every single partition takes a substantial part of a second and you're listing say a thousand, like that's a bad situation. And when you want your queries to
Starting point is 00:13:06 ideally take, you know, a few seconds, that planning time was just completely, you know, there was just a huge problem that we couldn't work around. And I think there were also enough problems with the other aspects of this. You know, so the fundamental problem is that tables were keeping the state of the table in both the file system and a metastore. The metastore says, hey, here's a list of directories you need to go look at. And the file system or S3 object store was responsible for what is in those directories so you've got the latency problem where multiple requests stack up and you're doing you know thousands of operations which also by the way turns out to be super expensive just in s3 request costs alone if you have this small files problem, you can actually
Starting point is 00:14:06 have request costs that exceed your storage costs. So don't get into that. That's a future topic. But another thing that was a problem here was that Hive can only push filters to the partition level. So you can say, oh, okay, well, I can select these partitions, but then you don't have any information about which files in that partition match your query.
Starting point is 00:14:34 And that was another order of magnitude, if not two orders of magnitude improvement when we said, hey, let's keep more metadata about what's in those files so that we can select individual files and not these several hundred megabyte groups or more large groupings of files. Okay, question.
Starting point is 00:14:56 Why we have this separation with the metadata? Why we didn't keep just everything on the metastore, right? And we had to have some stuff on the metastore and then some on the file system, because I would assume that the metastore is optimized for accessing metadata, right? It's like the file system is file system. So what was behind this decision back then to design things like that? I think that the Hive table format really...
Starting point is 00:15:31 And when I say table format, I'm referring to this space of things that we sort of labeled as, like, how do I keep track of which files are in my table? In a way, table formats have always existed. Certainly Oracle and MySQL and Postgres have some way of baking their storage down to files, but we never see or care what that is. It's only really come to light in this Hadoop era where we did it sort of naively. And to answer your actual question,
Starting point is 00:16:02 I think that the problem was that we didn't design it, right? Like me, we were all distributed systems engineers. And so we went for something simple that seemed to approximate a table if you squint. And what we had was, okay, we'll just lay out our files in directories. Directories have some property that we can filter by, and then we'll figure it out from there. And, you know, we're using commodity hardware, so who cares if you can't, you know, select individual files?
Starting point is 00:16:36 I'm sure it'll be fine. It just was not well thought through because I think everyone was trying to move so fast in those early days. Hadoop was like a Cambrian explosion of projects. And in the early days of Hive, they were working on both the storage representation as well as an entire SQL engine. And I think the SQL engine was the more interesting, harder problem at the time. And we sort of just let the table sit there. And it did evolve.
Starting point is 00:17:08 So you also asked, why did we not just put everything in the metastore? Well, the metastore was sort of an afterthought as well. We had suddenly thousands of partitions and thought, oh, it's really slow to go list and discover the partitions. And we sort of need this ability to swap partitions at times. And so the meta store was bolted on after the fact, and it just reflected this already existing convention that we're going to use directories to track what's in our table.
Starting point is 00:17:42 Mm-hmm. That makes total sense. Okay, and what's next? What happens next? So you are facing, you're trying to scale the infrastructure there, obviously, like, okay, you're on Netflix. At that time, Netflix probably was also going through an explosion in growth, so in this type of company, an explosion
Starting point is 00:18:00 of growth means explosion in data. So, what do you do? Like, how do you, I mean, there are only so many options that you can turn off on a file system, right? So what did you do next? Well, it was that. Like, we're running out of Band-Aids, right? We also realized that the techniques
Starting point is 00:18:22 and sort of Band-Aid approaches that we took, those were all specific to Netflix. So we had this massive development cost of like, every time a new Spark version came out, we had to update it, right? We had to port our changes to that new Spark version, and it was a giant pain. And we also hadn't even begun to fix some of the other problems. So problems like users expect SQL behavior. We were getting a lot of data engineers with backgrounds in Vertica or Teradata or these other like data warehouse systems, and we would have to retrain them. And that retraining was not like, hey, here's a list of the things you can't do, including write from Presto, rename columns and things like that. Which, like, can you imagine if you're a data engineer and on your first day of work, they hand you a list of like, don't rename columns, don't write from these systems. We couldn't actually attack any of those problems either because we were spending all our time
Starting point is 00:19:25 on band-aids and things like that. So what we realized was these were all the same problem. It fundamentally was that our format for keeping track of what is in a table is too simplistic. We didn't have the metadata to do better filtering for faster queries. We were treating S3 very badly and not using it how it's intended to be. In addition, bringing in problems like eventual consistency with listings that could affect and just completely destroy query results. And then we had all these usability challenges. You know, our workarounds were restrictive and you couldn't write from Presto and things like that. And we realized these were
Starting point is 00:20:11 all the same problem. It was that our view of what makes a table was way too simple and we needed to replace that. And that we could actually fix a lot of things if we did. I mentioned earlier the small files problem. And another thing that we were hitting was, if you don't have the ability to change your table atomically without corrupting the results that someone else sees, another one of those minor usability problems, you actually can't fix those small files problems. So you're talking about assigning a person to go make sure that data was written correctly the first time.
Starting point is 00:20:53 And it creates a massive amount of work for data engineers that the table representation was too simplistic. So, you know, part of what we wanted to do in addition was go fix those situations automatically. There was no reason why your database shouldn't be maintaining your table data. It was just that we didn't trust ourselves to do it for some reason. Yeah, yeah, yeah. I understand. So what's... Okay, so tell us a little bit about the journey there. So how you started trying to solve this problem in not a band-aid way, right? And how did you end up with the solution?
Starting point is 00:21:33 What the solution looked like, right? Well, so what we knew we wanted to do was track state somehow, right? Basically track all of the files in a given table. We wanted to do that and solve these sort of table-level concerns. And we'd been accumulating this list.
Starting point is 00:21:54 I mentioned working on the Parquet project. And we kept having people come to the Parquet project with problems that were not file-level concerns, but were table-level concerns. Like, hey, I've got two files with different schemas. And I don't know what the table schema is. Can you guys, you know, build a merge schema tool that tells me what the table schema is? We're like, well,
Starting point is 00:22:16 if one file has column E and the other file doesn't, was it added? Or was it deleted? You really don't know. And so, you know, we kept having to say, like, this isn't something that Parquet can solve. So we grabbed a whole bunch of those problems, grabbed that list. We said, okay, we know we want to track metadata at the file level so that we can do better pruning, so that we can make modifications, like fine-grain modifications to that file list, and do things like atomic commits and things like that. And so, you know, we basically designed for those constraints and, yeah, built essentially a metadata format with a very simple commit protocol to make all that happen. And how was that implemented inside the... because you are describing
Starting point is 00:23:09 an environment that's quite complex, right? There are many tools out there. You mentioned Presto, you mentioned Hive, you mentioned Spark, and maybe others. And suddenly you come in and you say, hey, we're going to monitor the data that are important for these systems, right? Because if these systems ignore these metadata, then what's the point? So there is, in my mind at least, and correct me if I'm wrong, there is a problem here on how you start a project like that.
Starting point is 00:23:39 Because it's not just laying out a specification out there or even implementing a way to store and manage this metadata. You also have to integrate that with a number of tools. So how did this happen? How did it work inside Netflix, I guess, initially? And then what's the story after that? Because we have an open source project that came out of that, right? Tell us a little bit about this journey.
Starting point is 00:24:07 So luckily, and I think the reason why this made sense for Netflix as an investment, was that so much of our time was spent delivering new Spark versions or delivering new Presto versions or things like that, where we thought that we could long-term reduce the amount of time to release a new version because we wouldn't have to port all of our band-aid fixes over to that new version and update them and then test that it all worked. We wanted a system where we could just plug in using the existing APIs. That didn't actually work out in Spark. We had to replace the API in Spark to actually plug in. But Trino and Presto, it was called Presto at the time,
Starting point is 00:24:55 Trino had a very good API that we could cleanly plug into. So at least there was that body of time that we could either spend constantly maintaining or building something new to reduce that maintenance work so that gave us the essentially ability to spend or invest in this area and from there we decided you know what we needed because we had multiple different projects that were all going to interact with this thing, we needed a core library. The core library is what an engine will use to essentially say, hey, give me data matching this.
Starting point is 00:25:35 Here's a filter, give me the files back. So we built that library and then the ability to commit in that library. And then that's what we integrated into these other systems. Luckily, we had, because we'd been porting our own sort of hacky solution from version to version, we had a lot of knowledge of the inner workings of those projects. So with one core library, we were able to integrate that fairly easily into the projects. All right. And that's what became Iceberg, correct?
Starting point is 00:26:10 Exactly. I'd love to hear from you if you remember that first version of Iceberg, right? The first thing that you put out there. What was the features that Iceberg had in this first version? And the reason I'm asking for that is because, I'm sure many things have evolved through the years, but it's always interesting to see these first choices that they are making because they also, in a way, say what was the big problem at that time
Starting point is 00:26:43 or what could be solved on a timely manner. Tell us a little bit about that. Okay. I apologize if I get any of this wrong, because this is from memory. But I knew that I wanted to get rid of manual partitioning. Because so many of these things that inspired Iceberg Features were constant problems that we had people coming to us and asking about. And we knew that
Starting point is 00:27:13 if we were replacing this bottom layer in our software stack, we needed to get it right because we were only going to do it once. So we included a number of other things in the initial version that we thought were very important. So first of all was that hidden partitioning. Don't make users need to know the physical layout of the table in order to effectively query it.
Starting point is 00:27:38 Same thing with schema evolution across multiple formats. So we said if we're going to get you know, fix this rename problem or fix the ability you know, if you drop and add a column, you resurrect zombie data, right? We knew we had to fix those things. So we had schema evolution,
Starting point is 00:28:00 hidden partitioning, and metadata you know, file, essentially file-level metadata tracking and still the same atomic swap that we use for commits that we do today was all in the initial version. What was missing was we didn't yet have the metadata push down. So we didn't have column level metadata in there
Starting point is 00:28:31 at first. We added that very quickly afterwards for that file level pruning, because that was actually a huge win. Because going to users in Netflix, we said, said hey we can solve a lot of your problems and they said that's awesome I don't really care about you know not being able to rename columns and we're like but that's a correctness issue that you spent a week last month fixing and they were like yeah but it sounds hard to migrate my tables. So we ended up going in and building some of these bigger features in order to entice people to move over. Because if we were just replacing Hive tables with something that had the same performance profile as a Hive table, but was safe,
Starting point is 00:29:25 people just didn't see the need to move over. So we pretty quickly added the early use cases were around scale, you know, going in and getting to scale that we couldn't before, especially in planning time, our early slides talked about taking a job that took, I think, 10 minutes to plan, even in parallel, and making it run in 45 seconds. And I think it took more than 24 hours to run because we couldn't do that file level pruning. We were reading every single file in a huge set of partitions, and there was just no need. You could get that down by 100x. So that was really cool. The other was actually thanks to hidden partitioning. What we could do was we built a flink sink that would write directly into event time partitioning, rather than processing time partitioning. And that was huge for our users because they always want to deal with downstream event time.
Starting point is 00:30:31 No one wants to deal with processing time because then you've got to have multiple filters. You've got to say, okay, well, I'm willing to take this window of when the data may have arrived, but I'm really looking for this data in terms of when it actually occurred. It's a massive pain. And if you can deliver data and partition it from the get-go in event time, and then maybe use automatic compaction and things to take care of the small files problem,
Starting point is 00:31:02 that relieves an enormous burden on data engineers for actually the way they think about and manage that data. You mentioned partitioning a couple of times already. And I mean, it is a big part of working with a lake architecture. Can you tell us a little bit about... Again, you mentioned you knew that one of the first things that should be delivered there was the automatic partition thing, right?
Starting point is 00:31:30 And change the way that people interacted with partitioning the data. Tell us a little bit more about how it happened before from the perspective of the user, right? Like the data engineer was going and working with the data there. How they were doing it before and what changed after Iceberg
Starting point is 00:31:46 was integrated there. Yeah. So it was very primitive before. Partitioning is the idea that you just, essentially, going back to that earlier definition of database split across multiple directories. A partition is a directory. And so we would break down data into these directories based on special columns called partition columns. So imagine you have data flowing into this table. You need to have a
Starting point is 00:32:22 column for how you want to break it up by directory. So you might take the processing time, and then you might derive a date from that processing time and store all the records for that date in the same directory. Now, there are a number of problems with the way that actually happened. So that whole process is the same as what Iceberg does. The difference is that Iceberg does that automatically. We say, oh, we know the relationship between your layout and this directory structure and the original timestamp field this is coming from. In Hive, it was all manual.
Starting point is 00:33:05 So you needed to supply the date for any given record. And that sounds easy. I can derive a date from a timestamp. But it's actually not. I mean, first of all, like, well, originally, Hive didn't have date types, so you would derive a string. Or at Netflix, we were still using what we called a date int, which is basically the integer 2013-06-11, right?
Starting point is 00:33:37 Something like that. And so you'd have to derive these things yourself and put data in there, which means if you do that wrong, you are putting data in the wrong place and no one will ever find it. So if you use the wrong time zone, right, if you use the wrong date format and all of these things could just be silently wrong. Not only that, users that were reading the table needed to know that they were querying by this other special date call and not just the original timestamp because
Starting point is 00:34:15 if they say oh well i'm looking for event time between you know t0 and t1 they go do that, but Hive has no idea how to turn that into a partition selection. And so they would have to say, okay, well, I'm looking for this range of timestamps and you can find them in this set of days. And that translation,
Starting point is 00:34:40 again, is like, are you using the right time zone to derive the days for your start and end time and you also get really wacky queries if you go down to our level partitioning because you've got to say okay well this partition column between here and here and then on the first day between you know 11 a.m and midnight and on the last day between zero you know midnight and 11 a.m. and midnight and on the last day between midnight and 11 a.m., it was just a total mess. And if you get any of that wrong, you get a full table scan and partitioning doesn't
Starting point is 00:35:14 help you at all. So this was just full of user errors. And a lot of places built up libraries that they would use for doing this right like netflix had daydant libraries and it wasn't awful but it was a huge tax on people and what we said in iceberg was like let's just make all that automatic you should figure a table with something that says hey uh you know chunk up this timestamp into day size ranges and use that. And then if you have any filter whatsoever on that timestamp column,
Starting point is 00:35:53 we can bake that down into the partition ranges that we need. And we can also then use the additional metadata of profile column ranges on your timestamp column for that purpose. That's what we ended up doing. And I think it was really helpful because it freed people up to no longer worry about those low-level details. In retrospect, when we look back on what we've done it's really restoring the table abstraction from sql it's saying like hey users of tables don't need to think about how this bakes down into files yeah and that frees us up as the people managing those files and and managing the table data
Starting point is 00:36:42 to say like hey we can have real database functions on this table data. We can have automatic compaction. We can build these things. And it's not going to screw up users who, for some reason, look under that abstraction layer and want to do specific things. So it both makes data engineers a lot more productive and means that we can do more as a data platform. Mhm. Yeah, yeah, 100%. Okay, and you mentioned, we talked a little bit about how you integrate these with different query engines out there, right? So there was a library that was built. But outside of the library, how do you build a table format?
Starting point is 00:37:27 As a software project, right? I know it sounds almost like a silly question that I'm asking, but what does it mean? We know if we want to build a web application, we kind of have a mental model of what that means, right? But when you're starting to build a table format, what does this mean in terms of building it as a software artifact? What are the choices that you make? How do you represent these? How do you store them? Tell us a little bit more about that. Not about the integration with the libraries,
Starting point is 00:37:59 obviously there are APIs that have to be built there to access the metadata, but more about, let's say, the format itself, like how you, as the designer of a system, you approach designing it and implementing it. The first stage was just rapid iteration. So we bit off this problem area of we know we need to track individual file level. We know we need atomic changes to that file level. And then it's about solving the next challenge
Starting point is 00:38:33 and the next. So we initially said, okay, we're going to have a file that tracks every file in the table. And then we want some way to atomically swap from one file to the next. And that's essentially the basic design, right? And then you layer on new things that help you gain performance.
Starting point is 00:38:55 So if you think of this conceptually as a big list of files, that's great. But you don't actually want to rewrite the entire list of files for a table every single time. So the next insight is, okay, well, we're not going to have one single manifest file for everything in a table. We're going to have a set of manifests that have different lifespans. So older data goes in one, you know, medium age data goes in another, new data goes in another, and so on. And you basically have the ability to swap out certain files to create different versions of the table that gives you, you know, lower write amplification. Right. So that was, you know, one of the very early things that we ran into it was like okay
Starting point is 00:39:45 hey we're not going to be able to rewrite all the table metadata every single time and in fact like some of the you know multi-petabyte tables have like tens of gigabytes of metadata so it really is not practical and then you say okay well how do we keep track of these manifest files? Well, we need something that holds their metadata, right? And so you introduce this manifest list level where you can, and actually we keep partition ranges and metadata in that list. So when we're planning a query, we don't even need to go to every single metadata file in the table. We can go and say, okay, which metadata files do we need to read? Read those metadata files to get the data files. And then the layer above that is just, where do we track all the current versions of the table? and what do we use for basically an atomic swap? Because the commit protocol is super simple. It is a pointer to the current table state.
Starting point is 00:40:54 And you can swap that pointer if you say, hey, I'm replacing version 4 with version 5. The metastore swaps v4 for v5, and you're done. If someone else gets there first then you need to rewrite and essentially retry that that commit operation so it's all very simple solving you know one problem after another and we just iterated on it quickly i think the original library like took four or five months to write, and then it's largely unchanged from there. I shouldn't say largely unchanged. The design is very similar. Yes, yeah, yeah, 100%. And okay, I want to ask you a little bit about atomicity,
Starting point is 00:41:38 because you mentioned this a couple of times. And it sounds important, right? Obviously, you don't want someone writing, someone else also writing, and making a mess there, right? But how big of a problem is in a system that it's not, let's say, an OLTP system, right? Like something that is primarily for writing and updating data
Starting point is 00:42:03 converted to a system that's probably reading. We write the whole idea behind all app systems is that we're going to move the data, prepare the data in a way that's going to be very efficient to read. So we can go there, and when we query the data, be really performant, right?
Starting point is 00:42:20 And scale up to large amounts of data. So it's like a dominant reading operation. So in what scenarios, like atomicity, when we write, is becoming important? And how much of, let's say, similar scale of a problem is compared to the transactional databases, right?
Starting point is 00:42:44 So atomicity is always important because you want to make sure your changes either are entirely applied or not at all and fail. Because I always compare this to cutting and pasting between spreadsheets. It's something that you're going to do. You're going to grab a ton of rows and you're going to copy from here and paste in the other. Well, what if that paste operation came back and said, hey, we got like 350 out of 400 rows, but we can't tell you which ones.
Starting point is 00:43:20 Right. You'd be like, what? And not only that, what if that spreadsheet were you know driving some dashboard that a c-suite person is currently looking at yeah right clearly none of that is okay right you don't want to have to figure out and reconcile which changes were made and which were not and you also don't want some important query to be lying while you're doing that reconciliation. So atomic changes are always important. Now in the analytics space, the changes are larger and fewer. So you might have hundreds or thousands or even millions of rows changing in a single operation.
Starting point is 00:44:08 And you're also, you don't need the state of any given row immediately. So in a transactional system, I might make several updates within milliseconds of one another and I always want it to be up to date and so your architecture includes you know in memory caching of changes a write-ahead log to ensure durability and on like a lot of things to make sure small changes are always up to date in the analytic world though because you're willing to take that trade-off, you're willing to say, okay, there can be no runtime infrastructure other than a catalog. So it's not like you have something maintaining a write-ahead log on a local disk. Yeah. Like, you write it all into S3, you swap that pointer over to, you know, make it live,
Starting point is 00:45:21 and then that is your table state. And basically, the only thing that knows about it is the metastore that tracks the current table pointer or current table metadata, and then all the readers go directly to S3. And that allows us to scale well beyond
Starting point is 00:45:37 what a metastore, the Hive metastore would. Okay, that's a great point, actually, because I think in many people's minds, atomicity is more about competing writers, but it's not just that. There are many more things that are happening. As you said, you start writing, something might fail.
Starting point is 00:45:57 What do you do in that case? You don't want a state that is just a bad state, right? You need to keep track of that and make sure that either you write everything or you don't write. So that's, okay, I get that when it comes to the analytical use cases where you have, let's say, a query engine that's going to query the data
Starting point is 00:46:19 to do interactive analytics and stuff like that. But then usually in these environments, we also have ETL, right? Where the workloads there are like more mixed between like reading and writing. And the reason I'm asking for that is because you said like, okay, we can make an assumption here
Starting point is 00:46:35 in the trade-off that, okay, we can sacrifice a little bit of like writing, right? But we want to make sure that what we write is correct. And at the same time, when we read, we're like super, super at the same time, when we read, we're super, super fast. But when we start having, let's say, more mixed workload, where writing is also important, because when you are ingesting data, for example,
Starting point is 00:46:55 you might have to write a lot of data there, and you have to be more performant in the writing part. How do we manage that? If it is a problem, say to me, you know what, no, it's not. We can write data from Kafka to Iceberg and do it very fast and everything is great. We don't have issues there. But is there a difference in the trade-offs when we are also about writing, and how do we deal with that in terms of designing? So it's all about the amplification of that write. How much do I have to rewrite in order to express this change?
Starting point is 00:47:41 In Iceberg, we use a Git-like model where everything is a full file, a full tree structure of your table data. And you can go read that independently as opposed to an SBN-type model where you're just accumulating diffs against the last version. And you have to sort of reconcile those over time. Right amplification then means like, you know, if I want to change a single row, well, I've got to rewrite that data file. And then I've got to rewrite the manifest that tracks it. And then I've got to rewrite the manifest list to swap the new manifest for the old one. And I've got like, so there is quite a bit of rewriting and write amplification going on for a single change. Generally, that's okay if you're coming in with a bulk or batch workload. And I include micro batch in here. So streaming data from Kafka and writing it into an iceberg table works extremely well,
Starting point is 00:48:40 because you create, you know, say, 100 files and commit them all at once and the cost of that commit is you know you're getting potentially millions of rows in every commit and so that's what makes the analytic world really work is getting a lot of work done in a single commit and the trade-off is really that you can use this underneath so many different engines because they're not sharing anything. The minute you introduce that runtime infrastructure, that write-ahead log or something like that, other than like a caching tier, I'm not talking about that. But if you're talking about having a write-ahead log so that very fine-grained commits
Starting point is 00:49:25 can be durable and served, then you're creating a bottleneck. The beauty of Iceberg is that the scalability is the scalability of the underlying object store. And S3 does pretty well. Yeah. You see this in a lot of databases where, you know, say you have a Trino cluster sitting there. And, you know, you had some infrastructure underneath that Trino cluster to be able to, you know, satisfy queries and do this, essentially this finer grained changes to your data really quickly, assuming that exists. Well, if you have 1000 processes all start up and hit that at once, you've got this thundering herd problem where it just it takes down, you have not scaled up your cluster to the point where it can handle that workload. Whereas if you're talking iceberg, just sitting in an S3, you can do any of that, you can scale up 10 Presto clusters
Starting point is 00:50:25 at the same time that you have a Hadoop cluster running a thousand Spark workloads. None of that actually affects the scalability of the system. Okay, that makes a lot of sense. Okay, I have to ask you that. So you mentioned S3, right? It scales really well and all that stuff, but there's not only S3 out there.
Starting point is 00:50:50 Pretty much every cloud provider has the equivalent of S3. And okay, it's okay to be political correct and say everyone is doing great, right? And what Microsoft has is also doing great, and what GCP has is also doing great. But have you seen, because you're in a very unique position to have to understand very well and work
Starting point is 00:51:10 with block storage, to go and deliver the performance. Have you experienced differences? Is there also any anecdotal information that you find interesting to share
Starting point is 00:51:26 about the differences between how these systems are operating and are built? Because the APIs are pretty much the same. They're not crazy different, I would say, but there's huge technology investment behind that to make it work, right? So I'm sure there are like interesting yeah how many okay so to me i'm a very like first principles in design kind of person so to me an object store is an object store yeah right like they are fundamentally scalable
Starting point is 00:52:01 because of the design which comes down down to you've got a block store with some sort of distributed index on top of it. And so you've got a key-based index, and that's a key value store. And then you've got a block store that's also a key value store. And those things work very well and scale well as distributed systems, because you can always know from metadata who is responsible for either this key or this block. So yeah, those two key value stores that make up an object store, that makes a lot of sense to me as a scalability thing. I think where you get into the performance differences and trade-offs
Starting point is 00:52:50 is when you talk about the guarantees on those indexes. This is why S3 didn't have consistent listing for the longest time. And this is why read-after-write consistency why S3 didn't have consistent listing for the longest time. Yeah. Right? And this is why, you know, read after write consistency and sort of like the guarantees are hard. Because, you know, you have this key value store,
Starting point is 00:53:16 you distribute state to nodes, and then you have a problem of how do we keep those nodes in sync? Yeah. And what if you hit the wrong node? And of course you're going to have duplication and replicas and sort of that problem. And then it's how well can we manage the trade-offs of keeping these things in sync
Starting point is 00:53:38 and also parallelizing in order to have that workload scale up. And I think that's where you see different performance trade-offs. So back in the day, S3 didn't have consistent listing, but Azure Storage did. And I looked at that and didn't think, oh, hey, that's great. I'm like, oh, okay, so Azure is saying I'm going to wait for my listing
Starting point is 00:54:08 request until like everything is reconciled. And in S3, I'm not. So I should expect lower latency on the S3 call, higher latency on the Azure call. And, you know, you can choose whichever one for design or convenience but I really see like I want to keep with the underlying trade-off that they're making on our behalf yeah and you know I think that's kind of where we ended up as everyone said hey consistent listing is conceptually important enough that we're waiting we're willing to pay that latency penalty. Yeah, that makes total sense. Cool. All right. So we're talking about tables here. So we're talking about database systems. And the database systems, they are data management systems, right? It's not just a file system at the end. Not that the file
Starting point is 00:55:05 system doesn't have controls there to manage, let's say, how people access or how they access the data and what you can do with the data. But obviously a file system is a much more simplistic system compared to a database when it can't how you interact
Starting point is 00:55:21 with the assets that are exposed to the system, right? And there's a lot of... I mean, when it can't, like, to how you interact with the assets that are, like, exposed through the system, right? Sure. And there's a lot of, I mean, people that are coming, like, from databases, they are obviously, like, exposed to a very rich environment of, like, how you can say, hey, you know what? Like, Ryan can have access to this table, but, like, Kostas cannot, right? Or maybe you can't have controls even on, let's say, the columns themselves and, like, all that stuff.
Starting point is 00:55:45 Or what I can do, can I create or I cannot create, right? Obviously, something like S3 has some controls, but these controls, first of all, have been designed in a completely different way for different users, not people who are querying data, right? I would find it very interesting if anyone who wants to query data,
Starting point is 00:56:03 they're like, oh, let me create a role here in IAM to make sure that data is directed by the appropriate person, right? Not only is that usability just very foreign to a database consumer, but it's also intended for a completely different audience, right? The IAM policy is generally managed by an administrator or something like that. I mean, I would not want that level of complexity placed on individual database users just trying to give one another access to the data set they just created. 100%. And if anyone hasn't tried to create an IAM policy out there and see how it works, I would suggest them to go and see it. Because it's a really good experience.
Starting point is 00:56:58 And that says a lot about the people who are taking care of that stuff. I have a huge appreciation for the people managing these things. They are complex for a good reason, right? I'm not saying that there's something wrong with them, but they are a very different audience. So how do we, let's say, start adding what has been pretty much like a standard thing when it comes to database management systems out there, when it comes to security, access controls, and all that stuff. But now, we are building on top of a file system, a distributed file system, and we
Starting point is 00:57:32 want, at some point, to expose these things, these kind of functionalities, also to these users, show how does it work, if it works, how it was in the past and how it is today, right? Because I'm sure that if we take the Hadoop days and compare them with today, probably there's a gap there. So it would be great to also share a little bit of the history, to be honest, because you've seen that. So it would be good for people to get an idea of how kind of a wild waste it was in the past when it comes to who has access to what on the data there. Yeah, well, I mean, the original Hadoop days
Starting point is 00:58:12 and I guess to some extent today because it translated into object stores, it was very much like the Hive table format where we'd already exposed individual files and what's underneath this table level abstraction to users. And so we wanted to have both sort of security models.
Starting point is 00:58:34 And we ended up with something that was just horribly confusing, because some people wanted to just go access files underneath these locations directly. And some people wanted to treat everything like tables. And I think you want only one abstraction level for security. So that was a huge problem in the Hadoop and Ranger days, where you had to have separate policies for whether you're coming in through HDFS or you're coming in through a table abstraction. And it just got really messy. Not only that, the world got a whole lot harder in the Hadoop ecosystem
Starting point is 00:59:16 because the assumption in the Hadoop ecosystem is that you're using multiple engines. And these engines don't really know about one another. So you've got Hive, Presto, Trino, Sparge, Pig at one point, like all these different flink, all interacting with the same tables at the same time. And that was something that technologically Iceberg fixed, the ability to work with those tables at the same time without corrupting results and things like that. But the security model was never something that Iceberg, you know, wanted to wade into because that's super messy.
Starting point is 00:59:58 You know, the solution was you lock down files in Ranger, you lock down tables in Ranger, and you sort of hope that those things are consistent. It's not a great strategy, in my opinion. That was all then moved into S3 and the data lake world when Hadoop sort of gave way to these object stores. But we still have generally a problem where you have multiple security models in mind. And what we see probably the most common is that you lock down Trino with Ranger and you leave Spark wide open through IAM policy. And you're like, well, if you have access to run Spark jobs, you just have access to everything.
Starting point is 01:00:46 Or maybe you lock it down by bucket, but then any physical access policy in IAM is very inflexible. Because it's not like you can just go make a programmatic change to say, oh, hey, swap these permissions or do this other thing. Because you might have hundreds or thousands or millions of objects in that object store
Starting point is 01:01:12 that you need to swap the policy on. And not only that, how do you translate policy from a table-level policy like you can read this table down to those objects. It's a massive, ugly problem. Yeah. Okay. I have also
Starting point is 01:01:35 a feeling from seeing the industry, how it operates out there and especially vendors that are building companies, right? Access controls and security in general is also like a way to create like a lock-in and like kind of keep the customer in and like monetize more on that. And I think that kind of creates also
Starting point is 01:02:02 not the right environment like to find a solution at the end. Because no one is in a way incentivized to cooperate with the other engines that they go and work on the data. I might be wrong. I'd love to hear your opinion on that. Because I've seen it, right? Everyone says like, okay, let's say we have data peaks with Spark and they have their way of doing things, right? You have Starburst with Trino. Again, they have their ways of providing access controls out there and controlling the data and all these things, but all of them at the end, and for a good reason, it's not like they are bad actors or whatever. They have an incentive to try and keep
Starting point is 01:02:51 the user in their own gated environment for these kind of features, which are extremely important also in the enterprise. If we're talking about a small company, maybe they don't care that much, but if you have like thousands and thousands of users and you're a publicly traded company, yeah, like access control like to data is like an important thing, right? So how do you see, and because you're like also like in a very unique position as like with the Iceberg and Tabular also to be interacting with all these actors that are creating this industry.
Starting point is 01:03:28 What's your feeling of how things are working today and where are they going? Because you mentioned there is a gap. So where do you see things going? I think we are just now coming to
Starting point is 01:03:43 the realization that we have to share things. Before, traditionally, databases have been storage and compute glued together, and you do things a certain way. shared until we got to Hive. And Hive did it so poorly that we didn't really have time to look at the problems other than like correctness issues all the time. Now we have projects like Iceberg that solve that core problem. And now we're looking at, I think, big questions about how databases are going to be restructured in the next decade. So I put this very centrally in that bucket of problems. We previously saw security as something that happens at the query layer. The query gets in, we parse it, we figure out what it needs to access, we check all
Starting point is 01:04:40 those accesses before moving on. And doing this at the top level means that I need that in Trino and I need that in Flink and Spark and Snowflake and whatever else touches my data. And not all those things can be trusted. Spark and Flink, they accept user code, so we can't just make security decisions and process. The other two are very opinionated about what they support and what permissions mean. And things like select, that's easy. Things like who gets what access when you create a table, that's really hard. And so I look at this as really a restructuring of given now that we have this shared storage layer,
Starting point is 01:05:32 how do we secure data? And to me, the clear answer is the storage needs to take on data security. It needs to take on access controls because what users want is a common set of identities, a common set of access controls defined on the table or the storage level and implemented everywhere. So I really think that, and this is one reason why Tabular, my company, built an access control system that handles all of those concerns and
Starting point is 01:06:07 plugs in nicely to other things. I think that is now going to be one of the biggest concerns moving forward. Because I don't think that with GDPR and CCPA and other new regulations on data coming online, that we're going to be able to just give blanket access to anyone that can write a Spark job anymore. That makes sense. You say the responsibility moves to the startup layer. It makes sense when you want to have all these multiple different engines going and working there.
Starting point is 01:06:49 And I get how things can work with a system that's, let's say, more relational, right? In terms of creating access controls and all that stuff. But as you mentioned about Flink and Spark, we are talking about systems here that you can literally write and execute any type of code, right? How do you deal with that when you have to be agnostic, right? Because now you are the storage, you have to control the access there, but you have different, let's say, ways that people interact with the data.
Starting point is 01:07:19 So how do you solve this problem? Well, so for Tabular, we are primarily coming from a structured or relational model. So when you load a tabular table, we provide you with temporary access credentials to be able to access the files underneath that table with the permissions that you had when you loaded it. So if you have read permissions but not write permissions, we'll give you credentials that can be used against S3 with that. Now that also has a lot of assumptions baked into it about how we manage the files, where
Starting point is 01:07:58 they go, how we break down tables. And so it's a whole system inside the catalog to make sure that the access controls are not violated when we hand out credentials. But that's how we map our logical table-level permission system to the physical file access in Tabular. When you look more broadly outside of just the tables and that structure, then you need some strategy, some similar strategy for doing that, right? You need to be able to delegate permission
Starting point is 01:08:39 to that underlying storage system. And that delegation naturally has to make assumptions about where you're going to be placing files and where files exist that you have read access to. That makes sense. Okay, so we have two potential personas here interacting with Tabular, right? One is, let's say, the actual user,
Starting point is 01:09:04 like someone who's creating the data through a system that probably they don't even care what the system is as long as they can access their data. And the other one is like a vendor. Like someone, let's say, Trino. They want support to
Starting point is 01:09:19 delegate access controls to Tabular. What's the experience of each one of them when they are doing their work and they are relying on the tabular catalog to access and manage access controls? Well, so to the user level, it should be seamless.
Starting point is 01:09:48 Now, what we do is there's some administrator setup, and this is on the side of the person running Trino, to establish a chain of trust, right? We need to delegate trust to Trino so that it can represent its identity or preferably that user's identity to tabular so that when trino goes and loads a table we say oh this is trino acting on behalf of joe right we send a credential back that says hey joe can access this table here is a credential for doing it and then we trust trino to do that that responsibly so the user just knows that someone has set that up they've logged into trino and have provided their identity to trino and the chain of trust goes from there the administrator ideally i, gets that set up through an OAuth exchange. So hey, our ideal is a screen in Starburst or some other Trino provider that says, you know, connect to Tabular.
Starting point is 01:10:58 You click on it and Tabular shows you something just like, hey, whatever app would like to access your Google contacts, we show you one that has, you know, hey, we're tabular. Trino wants to be able to, you know, access your tabular catalog. And you say, okay, that sounds good. And if you're an administrator in tabular, you can set up that access and, you know, establish that chain of trust that allows Tabular to know this is a trusted Trino system that is set up by an administrator who in the
Starting point is 01:11:32 end owns the data in Tabular. Yeah, makes sense. And how it works for the window, right? I'm the creator of Trino, let's say and now I want to extend my access control infrastructure
Starting point is 01:11:48 that I have in my product, my system, like support tabular. What's the process there? It's like an API, it's like SDKs that someone would like to take and like integrating, how does it work? The open source Iceberg project has a catalog protocol that all of the engines should be using to talk to catalogs.
Starting point is 01:12:16 So this is a standard protocol, much like the Hive Thrift protocol, only again, we wanted to make sure that it was well-designed rather than being something that through ubiquity made a standard. We wanted to design it as a standard. So we carefully designed this with the ability to pass multiple forms of authentication through. So if I've given you a token, you can put that in an authorization header.
Starting point is 01:12:45 You can also use AWS native SIGV4, which is a way of taking your AWS identity and signing requests with that and then passing it on. So you have identity passed through that catalog API, and then we have the data sharing or access to underlying data also documented and passed through that REST catalog API. So you say, hey, here's my identity token. You know, load this table. Tabular then says, yep,
Starting point is 01:13:18 I can see that you have access to this table. I'm going to generate, you know, the underlying data credential for you and pass that back to you, then you can use that. And you might be Trino acting on behalf of you, or you might be just you with a Python process. And that all works in this model. Okay, well, that's great.
Starting point is 01:13:41 And when you say someone through Tag, TagGlore can have access controls, I assume that someone can control what tables a user can have access to. What else is there? Is it possible to define, let's say, which columns someone can have access to? Or how do you define this this is like role-based, or like, you know, it's a whole bottomless rabbit hole there of how these things can be managed. So what is supported there? And what a user should expect from Tabular on that front?
Starting point is 01:14:19 Yeah, so this is where, you know, actually, if you're going with just SQL access controls, you start running into some pretty hairy situations because column-level access is not real standard. So what we do is we have a system of labeling columns as sensitive. So you create a label and you create a policy for that label that says, don't show this unless you have select access on the label, or maybe mask it if you don't have select access on the label. And then you can label columns throughout your warehouse and we'll apply that policy.
Starting point is 01:14:56 Now we apply that policy in a couple different scenarios or ways. In a trusted system, if Trino comes to us and says, hey, I am Trino, I am trusted, we can send back something that says, well, you can access on behalf of this user columns A, B, and D, not column C. And we trust that Trino is going to enforce that. So that is something that we're standardizing now to be able, in the REST protocol, to be able to make those policy decisions and communicate them through the REST protocol for trusted endpoints.
Starting point is 01:15:34 Of course, for places like Spark, how do you do that? We have two modes. So one is you deny access to the table entirely because any data file has data that you can't see. And therefore, if you can't see all the columns, we deny access. You can see metadata.
Starting point is 01:15:57 You can't see data files. That works. Another mode is actually just more permissive where you can just lie to Spark. Okay. No, it sounds crazy, and this is not real security. But this is essentially taking care of the accidental leak. The accidental data leak rather than the malicious data leak. Because yes, if I give you
Starting point is 01:16:25 a key that can access a data file that has data you can't see, that's not real security because you can grab that key out of memory and then go download the file yourself. On the other hand, what most people are interested in is avoiding the accidental data leak. And for that, we simply say, this column doesn't exist. When you load that table in Spark, the column is not listed. And therefore, Spark has no idea how to project it or use that column. It's just going to ignore it. And that is pseudo security, but it's what a lot of people actually want in practice. So I think it's important that you get to choose as an administrator whether you want to fail all queries or just have this pseudo security. And if you want real column level security, then you go through an engine that is actually trusted to provide it. Yeah, that makes total sense. That's super interesting. Alright, so
Starting point is 01:17:26 we're close to the end here. I just wanted to ask you to share with us a few things that you're excited about in the future, both about Iceberg, Tabular, and the industry in general. Alright? Yeah.
Starting point is 01:17:42 So, yeah. What excites Ryan these days these days well i do want to highlight a couple of releases that just went out this week so python pi iceberg just had a release in which we we finally got right support so you can append or over append to or overwrite on partition tables, which is really great. We expect providing these tools to the Python world will really help that integrate with the rest of the data ecosystem inside of companies. And the same sort of thing,
Starting point is 01:18:20 the Rust first release of Iceberg Rust just went out this week. It is early. It probably is not directly useful, but it's really cool to see the interaction with Rust catalogs and basic ability to work with Iceberg metadata and read it. And we expect that to actually mature quite quickly. So I'm really excited about those. I'm also excited about the upcoming iceberg release
Starting point is 01:18:51 with view support. I mentioned restructuring the database world now that we have shared storage and security is one thing that also moves down to the the data layer but so do definitions of things like views right views are really critical as just shorthand for like oh apply this filter or you know remove these columns or you know apply some transformation and they're very useful so iceberg is now trying to standardize how we express views and can use them across engines. And that gets to kind of the IR conversation that we had the last time I was on the podcast.
Starting point is 01:19:36 That's our use case around an intermediate representation for SQL logical plans. What's the status of that project right now in terms of figuring out, let's say, how to be implemented? Because views are very interesting, as you said, very important, but also very hard, right? There are systems out there where, yeah, like if you like figuring out like
Starting point is 01:20:06 the semantics and how they can be equivalent between like different systems, right? Like all these things, they can become like very hard. And yeah, and it's always like the edge cases, but the question is like how many edge cases, like how long is like this long tail out there, right? Yeah. So what's the current stage of that and what we are looking into, right? Like, are you building your own IR, for example?
Starting point is 01:20:31 You're considering doing that, like defining a new way of representing, let's say, intermediate representation of a query? Are you evaluating some projects? And what's the collaboration also with the engines because at the end the engines are going to be executing this and they're portable at the end. Tell us a little bit more about that.
Starting point is 01:20:53 That is I think the big question. Right now what we've done is we've allowed the ability to create multiple representations of a view that are mandated to produce the same thing. So right now you can create a view through the API that has, say, SQL for Spark and SQL for Trino. And Trino will use the Trino SQL, Spark will use the Spark SQL, and you have compatibility that way. You could also have Spark say use the Trino SQL and just parse it and try to rely on the fact
Starting point is 01:21:31 that it is a standard language. That doesn't work for a lot of cases, but there are a lot of cases it does work for. For things like very simple transformations, things like running a filter statement, projecting columns. It's pretty simple. Even stuff like union all. So if you have a use case like I've got fast moving data coming over here andmoving data in this table. I can union those tables together. That works really nicely. But we do want to get to this state where there's some other intermediate representation that has really high fidelity to essentially translate between SQL plans
Starting point is 01:22:22 and this intermediate representation. So Trino is responsible for their translation. essentially translate between SQL plans and this intermediate representation. So Trino is responsible for their translation. Spark is responsible for their translation. And we all agree on what the intermediate representation says. That is further out. We're taking a look at IR projects now, and we ideally do not want to have one in the Iceberg project.
Starting point is 01:22:47 That would be very difficult. But even things like Substrate, which I think is the most well-known one out there, that is difficult to work with because it expresses literally everything. So even things like Boolean expressions are custom functions that are defined in YAML files that you can swap out. So things like equality, you've got to pull in a definition of that from somewhere. And it's a little too expressive. So we'll need to look into exactly how to make use of those types of projects and of course like you were saying the community side is tough because it's not that iceberg makes this decision and everyone follows along without complaining we've got to convince everyone that you know the representation that we are choosing
Starting point is 01:23:46 is reasonable and should be supported in Trino and Spark and Flink and whatever other commercial engines are looking at it. That's going to be difficult but at the
Starting point is 01:24:02 same time like super valuable if we can get it right yeah 100 i'm especially like for practitioners out there like okay like defenders like different kind of beast to tackle and like convince them why this is like important at the end i think it's the market that has like to enforce that but yeah like hearing like like, hearing, like, people having to migrate, like, high views, like, in big systems and move to other systems, like, it's a lot.
Starting point is 01:24:32 And it's a hard problem, and it is an impediment at the end of, like, the progress, right? Like, people get stuck in systems that they cannot move out of them just because of, like, the amount of resources that are needed to go and, like, migrate, right? That's what's so encouraging about this new world, though. In a world where we have shared storage,
Starting point is 01:24:52 the migration costs drop to zero. Well, not zero, but it's a step function down. Yeah. And then if we're able to move things like security and access controls and views, more and more things are going to be useful across engines. That's a radically different world, and it's really encouraging. I love that you talked about practitioners, because that's really who we're trying to serve with all of this. You know, people who spend their days, you know, making sure that the copy over to Redshift
Starting point is 01:25:33 or Snowflake is in sync with the original data. And like, are we securing, locking down these two things with separate permissions models, even though they're 99% the same permissions model, they're separate and they're stored separately and they're maintained separately, are those things in sync? And I think with the new design here, we can really do away with a lot of those challenges and just annoyances. I think we're moving to a world where we have centralized components and specialized components. And storage, views, encryption,
Starting point is 01:26:13 security and access controls, those are all going to be centralized components. And that's a really exciting world to be in. Hopefully then we have a simple OAuth process to plug in new specialized components like your Trinos and your Sparks and your Snowflakes. And that is a really cool data infrastructure world. I 100% I think like this, that's also like the only way that we can, you know, like starts accelerating, like how we build value
Starting point is 01:26:45 here because the more we turn these people, the practitioners, into operators instead of builders, which is how it was in the past, and for a good reason back then, the less we give them the space to go and create value.
Starting point is 01:27:02 And I do think that especially if you compare data practitioners and compare them to application engineers and see the difference of the tooling that one and the other has and how much they can focus on, hey, now I'm
Starting point is 01:27:18 building a new feature. Okay, all of us have to debug, but even that, it's much more streamlined as it is in the data world. That's what we should go after. That's the experience that practitioners should also have out there. Not spending 80, 90% of their time trying to figure, oh shit, did I copy these table
Starting point is 01:27:37 like rights from one system to the other? No, that's not what humans should be doing. They're much more interesting. And we need that. If we're talking about all this AI and all these things out there, how are all these things going to happen without data that are consistent
Starting point is 01:27:52 at the minimum, right? Yes, we have to make practitioners productive just with data in order to get to these other things. Because if there are people spending their 40 hours a week just looking at are all the access controls in sync?
Starting point is 01:28:10 Oh no. 100%. And that's what I think makes these next couple of years really exciting for anyone who's working in this space. I think it's going to be a very interesting place to be and this crazy opportunity to build very interesting things. I agree. It's an exciting time because
Starting point is 01:28:30 we're seeing this fundamental shift in data architecture. I don't think we've ever seen anything like this shift to centralization for storage and security and things. Yep. Yep. A hundred percent. So Ryan, thank you so much. We're a little bit over the time here, but it was like, that's an amazing conversation. Thank you so much. And I can't wait to have you back and continue the conversation. I think there are like more to talk about. So looking forward to do that again soon.
Starting point is 01:29:02 Thanks for having me on. It was a lot of fun. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.