The Data Stack Show - The PRQL: The Two Parallel Tracks of Development In Data Processing with Ryan Blue of Tabular

Episode Date: April 8, 2024

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building a...nd maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show prequel. This is a short bonus episode where we preview the upcoming show. You'll get to meet our guest and hear about the topics we're going to cover. If they're interesting to you, you can catch the full-length show when it drops on Wednesday. Welcome back to the Data Stack Show. Costas, we've talked a lot about databases, database technology. You know, it's been a common theme on the show. But today, we're going to dig really deep into that world at high scale. So Ryan Blue is our guest. He helps create Iceberg,
Starting point is 00:00:51 which is now part of the Apache Foundation. And it's going to be a great story. I mean, I am really interested in hearing the background of the challenges that they faced at Netflix, where this was originally developed. And then it's above my pay grade, but I am really interested if you would be willing to ask him about file formats. Because that is actually another interesting thing that we haven't covered in great detail. I mean, we've done it here or there,
Starting point is 00:01:25 but that's a huge topic when it comes to Iceberg and we think about data lakes. So that's another topic that I've been thinking about just as it relates to all of Ryan's experience. So hopefully I didn't steal your thunder on the file submit question, but what do you want to ask about? Yeah, I mean, first of all,
Starting point is 00:01:45 I know that like most people, I know that most people, when they think about Ryan, they think of Iceberg, but what is, I think, extremely interesting is that Ryan has been around for a very long time. He has been part of building some of
Starting point is 00:02:00 very foundational pieces of technology that we are using today, like things like Avro, Parquet, and obviously the table formats like Iceberg is. So outside of anything technical that we will be talking about with him, one of the things that I will spend quite some time with him is like,
Starting point is 00:02:27 do a little bit of like history, like why things actually happened the way that they happened. We touched with him and it's like, in my opinion, super interesting. It's about how when it comes to data processing, there are actually two parallel tracks of development that happened in the past like 10-15 years. One which is coming primarily like from the database folks that were building database systems. And another one is like coming actually from people that were primarily distributed systems people. And that's where things like MapReduce came stuff like Hadoop and like all these big data technologies that we are talking about and we will see that and there are like some very interesting comments and points that are made of like
Starting point is 00:03:16 how we reinvented some things or we did some things like differently, why this happened and Ryan gives like a very interesting or we did some things like differently, why this happened. And Ryan gives like a very interesting perspective into the evolution of these systems and how they happened and why. And outside of that, we'll talk a lot about file formats, which is also quite of a hot topic. RK, for example, has been out for a while. There are like a lot of conversations of like, we need to update it.
Starting point is 00:03:45 There are some actually new things coming out these days. So I think it's a very good time to do a refresher on what file formats are and for storing data and how they differ between them and how they differ to table formats like Iceberg, right? And on top of that, we'll talk also like a little bit about like Tabular, his company, and also about some other like really interesting things that are happening right now in the space. So make sure you listen to the episode.
Starting point is 00:04:19 It's very interesting. Ryan has like a lot to share and we have a lot to learn from him. MARK MIRCHANDANI- Agreed. Well, let's dig in and talk about Iceberg and all the other things. Let's do it. MARK MIRCHANDANI- All right, that's a wrap for the prequel.
Starting point is 00:04:35 The full-length episode will drop Wednesday morning. Subscribe now so you don't miss it.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.