Orchestrate all the Things - Data management in 2024. Featuring Peter Corless, Director of Product Marketing at StarTree, and Alex Merced, Developer Advocate at Dremio

Starting point is 00:00:00 Καλώς ήρθατε στο Αρχιετήριο. Είμαι ο Γιώργος Ανατιώτης και θα συνεχίσουμε τα πράγματα μαζί. Στοιχεία για τεχνολογία, δίδα, AI και ΜΕΔΙΑ και πώς μοιάζουν σε έναν άλλο, σωστά σε ευρωπαϊκές. Για πολλές οργανωτικές σήμερα, η διαχείριση δίδυνων προκύπτει την εμφανίσματηση των δίδυνων σε ένα από τους πιο μεγάλους δίδυνους δίδυνων. Amazon, Microsoft, Azure και Google, πλέον Snowflake και Databricks. Αλλά οι αναλύτες David Vellante και George Gilbert πιστεύουν ότι οι ανάγκες των μοδευτικών δεδομένων εργασίων, συγκέντρωτα με την εξελίξη του διαχείρισματος του open storage, μπορεί να προηγουμένουν στην εξέγερση του τι λέγονται 6 data platform.

Starting point is 00:00:37 Η υπόθεση του 6 data platform είναι ότι οι open data formats μπορούν να εξελίξουν την ανταγωνιστική, προηγουμένου την παραδοσία από τις πλευ την παραμόρφωση από τις πλατφόρμενες πλατφόρμενες μεταγωνιστικές διαδικασίες προς την ανεξαρτητική διαχείριση των δασκήνων και των ανοιχτών. Είναι ένα ενδιαφέρον και ένα που θα βοηθήσει τους χρηματοδοτές με την προσθέση των εταιρειών να συμμετέχουν για κάθε εργαλείο βασικά στον αξιόδητο αξιόδημα, ανάμεσα στον lock-in.

Starting point is 00:01:01 Αλλά πόσο κοντά είναι πραγματικά να το καταλάβουμε αυτό. Για να απαντήσουμε σε αυτή την ερώτηση, πρέπει να εξετάσουμε τα open data formats και το πιθανότητα των διαδικασίων σε κλουδάκια και formats, όπως και στον σημαντικό και κυβερνήτης λαό. Είχαμε συμβούλες με τον Πίτερ Κούρλες και τον Αλέξ Μέρσετ για να μιλήσουμε για όλα αυτά. Ελπίζω ότι θα το απολαύσετε. Αν θέλετε να δουλεύετε και να οργανωθείτε όλα αυτά, μπορείτε να εγγραφείτε στο podcast μου, που είναι εφαρμογός σε όλα τα μεγάλια πλατφόρμενα. I hope you will enjoy this. If you like my work and orchestrate all the things, you can subscribe to my podcast, available on all major platforms. My self-published newsletter,

Starting point is 00:01:31 also syndicated on Substack, HackerNin, Medium, and DZone, or follow and orchestrate all the things on your social media of choice. Hey, everybody. My name is Alex Merced. I'm a developer advocate at Dremio. And basically, my role at Dremio is, well, one to help enable people in using the Dremio. But more of my role has really been in helping educate and advocate for the Data Lakehouse architecture. And particularly for just helping educate people on table formats in particular.

Starting point is 00:02:00 So a lot of the articles and stuff that I've written has really been on sort of what are the different formats? What are the different considerations of the different formats? Now I'm starting to move more into sort of like, okay, now people know what the formats are. Now let's talk more about implementation and the tooling around that. So things like Project Nessie that enable more like Git-like operations on Iceberg and Dremio, which really kind of enables sort of a center point of access for a lake house. And just kind of talking about how to put all these things together that kind of really have a fast performance, easy to a lake house and just kind of talk about how to put all these things together that kind of really have an now a fast performant easy to use lake house that kind of unites everything and fulfills the promise of what daily houses were meant to be cool thank

Starting point is 00:02:35 you i'm peter corliss been around silicon valley for a long while but most recently i am at star tree where i'm the director of product marketing. And before that, I was at Cilidivi and also Aerospike. So I've been on both the OLTP side, as well as the OLAP side. And I'm just taking a look at the kind of meta patterns that are emerging in real world architecture. I've been writing up a lot of case studies, and it's interesting to see the patterns of what's being adopted by users in the real world. And sometimes that aligns with what us as vendors say and sometimes it's radically different. So that's where a lot of the thoughts I have today come from.

Starting point is 00:03:16 Okay, well, great. Thank you, Peter. And then, all right, I thought maybe it's good to start with a little bit of history. That's obviously table stakes for both of you, but maybe not necessarily for everyone who may be listening in. So I thought it may be a good idea to sort of retrace, let's say, maybe the last 10 years or so in data management. So going from, let's say, traditional data warehouses to Hadoop and to cloud storage and eventually data lake houses and, well, everything else. Sure. I love the pun, by the way, the table stakes pun. It wasn't intentional. But I think that that's a great point, is that tables themselves are not all one thing anymore, right?

Starting point is 00:04:07 We have highly optimized row stores, highly optimized column stores, some databases that purport to support both. And then even within row store or column store format, there's so many options to choose from. What are you optimizing for is a key thing. Like for instance, this morning I was listening to Lance, which is a new format that's coming out of LanceDB, and they're focusing upon very large blobs and making a column store that's aligned with machine learning. Now, not everybody is doing machine learning, but when that's important, they believe that they have some advantages there. And that's just one example out of thousands that are out now. And I'd like to cede to Alex because I think he's knee deep in the table formats right now.

Starting point is 00:04:55 Yeah. So I agree. There's a lot of really interesting sort of new projects coming along, like Apache Pymon and LanceDB. But kind of going back to sort of like, where did this all come from? Basically, we go back to like you back to sort of like, where did this all come from? Basically, we go back to like, you mentioned traditional data warehouses, where basically, basically, you had this physical infrastructure on premise in your in your office. And then basically, you had an issue with scaling, like if I needed more storage, well, I have to buy a node and I'm getting this compute I don't need when I need storage. When I need compute, I still got to get extra storage. So one, you this, the scaling was more expensive. And two, you couldn't just scale the drop of a hat. You kind of have to plan, okay,

Starting point is 00:05:28 how many nodes am I going to need ahead of time? So this kind of caused a lot of tension. And that's kind of what led to the impetus of sort of cloud data warehouses. So Snowflake allowing you to say, hey, we'll separate your compute and your storage and you can pay for what you need here in the cloud. But you still have to kind of move your data

Starting point is 00:05:44 into the data warehouse. You'd still have to kind of move your data into the data warehouse. You'd still have to move it into Snowflake. So you're still having duplicative data, data in your data lake and data in your data warehouse. So that's the idea of a data lake house being like, hey, you know what? We love this decoupling of compute and storage. We love the cloud, but wouldn't it be nice

Starting point is 00:06:02 if we don't have to duplicate the data and we can just operate over the data store you already have, your data lake, and then just take all that data warehouse functionality and start trying to move it on there, which has been a total order. Because in the earliest stages of the data lake, you had HDFS, where you had on-prem data lakes with the similar issues you had with on-prem data warehouses. But in order to enable SQL in the data warehouse, I mean, in the data lake, because what happened was Hadoop had a framework called MapReduce, which was really difficult to write for. So then they came out with Apache Hive,

Starting point is 00:06:33 which would take SQL and translate it to MapReduce jobs. And in order to have SQL, though, you need tables. So you got to figure out how do I represent a table on the data lake? And Hive decided to do that through a directory structure saying, hey, this folder is a table and any sub folder is a partition. We'll track those folders in the Hive metastore. And then that's how we're going to define a table. And that worked. That enabled Hive's ultimate goal, which is to be able to write SQL, translate it to MapReduce. But there was still a lot, like it was still really hard to

Starting point is 00:07:04 do granular, granular, more granular updates and deletes, do them with sort of asset guarantees. There's all these things that are still lacking from a traditional sort of data warehouse or database environment that made it sort of impractical to really make the data lake your home of all your data. Go for it. On this, for instance, if you go back a decade or more, partitioning wasn't a guarantee, right? Automatic partitioning, auto sharding, right? And that's one of the things that's become a presumption over the past decade is that your database will be auto partitioning, auto sharding, you know, auto balancing. All of those things came along with the movement to cloud services. And databases used to be far more oriented towards files in a file system. And these days, you're being presented with an API

Starting point is 00:07:52 to get your data. And it's more of the virtual data catalog, as opposed to physical files in a file system that you're managing, you're backing up and rotating and all the rest of that kind of stuff. So I think that that was one of the advantages of going to the cloud is making it so that you weren't so hands-on with the file system underneath. So I think that there's, and with that then means that the systems themselves had to become those administers. They had to become very intelligent on how they did sharding and partitioning and distribution and replication. And all of that has become a requirement. Like that's a baseline requirement these days if you're building a big distributed system.

Starting point is 00:08:30 And that's sort of like where we're at right now with like in the data lakehouse space. Because that's basically now you, the databases and the warehouses have automated all this stuff in the cloud. And now, like you said, that's like assumed. But in the data lakehouse world, that's still not quite there yet. Like you have Apache Iceberg Tables or Delta Lake Tables.

Starting point is 00:08:46 It's not assumed that they're just going to automatically partition themselves and all this stuff. That's where all these data management systems come in. Like this is what services like Tabular and Dremio now provide for like Apache Iceberg Tables or Databricks has different services

Starting point is 00:08:58 that they provide on for Delta Lake Tables or OneHouse provides for Apache Hoodie Tables. And that's sort of like the next like sort of big competition in the data lakehouse space because now, okay, everyone's agreed. I wasn't quite the right solution. So let's move to these newer formats that do kind of address that Apache iceberg, Apache hoodie and Delta Lake.

Starting point is 00:09:17 But now, okay, how, where are the services that allow us to, to, to maintain partitioning, to maintain compaction, to sort these data files and do all that data management. Where's that going to come from? Because there isn't this unified system now. It's all decoupled and modular. And in that, it's very interesting because in the Delta Lake world,

Starting point is 00:09:37 you're kind of just stuck with one option. Databricks provides all those services and there really isn't a good alternative for Delta Lake. So it's like oftentimes with Delta Lake, you really kind of are marrying Databricks while like Apache Icebreak, you have Tabular, you have Dremio, Snowflake has just gotten to the table management game and AWS has gotten to the table management game afar, not just being a catalog,

Starting point is 00:09:55 but also managing the tables under the hood and the data files underneath. And then one house is starting to make some splashes with their new one table project and getting more attention towards the hoodie project. So it's going to be an interesting sort of battle the next couple of years over that. Yeah. And I think the other thing that is becoming presumed in the platform is that things like normalization, sanitation, cleansing of data, right? Sanity of data, that is becoming now more of a core feature that's required because nobody wants to manage like a data swamp, right? People really do want to have their tables normalized,

Starting point is 00:10:33 sanitized against each other, that he has one customer record that's kind of universal. And we get into things like silver and gold tables, like how pure is pure. And I think that that's another thing that's been a movement in recent days, whereas in the past, all of that was a completely manual head-scratching thing that the high priests of SQL would answer for you. Like, when is my data actually going to be ready for analytics?

Starting point is 00:10:57 And it took a long while back in the day. So I think that's another kind of service that's also required. And I want to throw on top of this is that not only has there been this move towards the automation of a lot of these kind of data janitorial services, but now there's the movement from batch to real time, where we're seeing now like a good quarter of some organizations data is streaming data. And that is a relentless advance towards moving from reports that'll be ready in an hour

Starting point is 00:11:26 and reports that'll be ready in a day towards reports that'll be ready in seconds or are always up to date to the second, right? And I think that that requires a whole different level of approach because whenever you're optimizing something, you're optimizing for something and you're optimizing away from something else and the the techniques and efficiencies that you look for in batch data may actually be an anti-pattern for real-time analytics true but actually before we get to the part about real-time analytics let's revisit a little bit because to me and I was just telling Alex earlier before you joined Peter that you know it's been a while that I actually caught up with this space and I feel like I'm

Starting point is 00:12:11 like left ages behind already so I need to catch up so you have to bear with me gentlemen so one thing that's not entirely clear for me and I guess people like me as well is what exactly is the relationship between these alternate let's say these different data table formats that we now have across these lake house products and the way that seems more approachable to me to be able to get a grasp of that is well the relationship with Hadoop that you both pointed out at some point and HDFS and the reason I'm saying that is that well because for a number of reasons first Hadoop was I guess the first platform that well first of all it was at some point sort of synonymous with big with big data and data lakes and all of that and also it was the first platform that actually

Starting point is 00:13:02 introduced this separation between data and compute. It didn't quite go down in the cloud there, which is why it's sort of left behind now. However, I found interesting that there is one sort of connecting thread, let's say, to these table formats that are now prevalent. There's something called Apache Hudi, and apparently, to the best of my understanding, it's sort of Linus, let's say, dates back to Hadoop and its own HDFS, and it's sort of an evolution of that. Is my understanding correct? Is that the case indeed?

Starting point is 00:13:41 I would say that Apache Hoodie and Delta Lake, they both continued, built on what Hive built. So that whole sort of like where there's a directory and the subdirectories that determine the partitioning, they both kind of leaned in on that. So they have more backward compatibility theoretically with Hive. This has some trade-offs. So like Iceberg completely decoupled from that need to have all your files

Starting point is 00:14:00 organized into certain like subpartition folders and all that stuff. And because Iceberg did that, it's enabled possibilities of not... makes migration easier because you don't have to move all your files or rewrite all your files because you can just leave them where they're at. It also allows it for unique file structures, which allow to work around certain object storages like throttling and access limitations. But the nice thing about having everything in one folder is that you know the whole table

Starting point is 00:14:23 is in one folder and it's much easier to, let's say, zip it up and then send it over email to somebody. But I would say, I forget exactly when each one was created. But I mean, Hoodie was pretty early on over there at Uber. And its design is very, very much sort of stemmed or very clearly sort of stems from where Hive was at the time. They've added all sorts of different layers to it so basically you have the the folder and the nested folders and then you generally have the separate folder of metadata that is organized into a variety of different indexes a metadata table a variety of different abstract bloom filters all these different things that are used to help query the data faster i would say like

Starting point is 00:15:02 coding delta both kind of use that fundamental hive structure and then built a more robust metadata structure on top of it iceberg completely just decided to do something completely completely different okay all right and then another thing that's not very clear to me is the relationship of all these table formats with underlying cloud-native globe storage, like S3 or Google Storage and so on. So do they sit precisely on top of those? Yes. So essentially, they're agnostic as far as where you store your data.

Starting point is 00:15:36 So theoretically, you could use them on Hadoop or any storage layer. So basically, what they do is all of these formats, they create a standard for how metadata is written regarding referring to the files. And then that metadata is usually co-located with the files. So in a hoodie folder, Delta Lake folder, even in most Apache iceberg tables, there's a folder, there's a data, and then there's a metadata. But basically, instead of going directly to the data, like a traditional data lake pattern, where I would just go like in a hive table, I would just say, Hey, this folder is the directory. And then the engine would just do

Starting point is 00:16:08 a direct file listing of that folder, and then iterate through all the files in those folders to kind of build the table. What is going to happen with like Apache, Iceberg, Hoodie, and Delta, instead, the engine is going to go read this metadata, and be able to use the metadata, which has a lot of aggregate statistics to be able to determine, hey, which files are even relevant to my query. So basically, it skips the whole process of doing all these file listing operations and can build the list of files it needs to scan before it even touches any of those directories, allowing it for a much more performant scanning. One thing I'm also going to point out is that, for instance, you have some formats that are

Starting point is 00:16:43 optimized for in-memory, right? Like Apache Arrow. And then you have others that are far more aligned towards disk storage, right? And that would be Parquet. And there are attempts now to have common representations of data both in-memory and on storage, right? So that's the kind of the current holy grail that people are chasing is how do we make it such that stuff and this is because of the need to go to tiered storage. You want to have something because your tiers could go from in memory to SSD to some sort of blob storage, right? So that people really want flexibility in where and how they store their data. And if they have to do transmogrifications in real time, that kind of defeats a lot of purposes

Starting point is 00:17:28 of what they're trying to do with tiered storage. So I think that that's the current paradigm that people are trying to push is, is there a best, most universal format? But again, there's always these optimizations and trade-offs because what's efficient to be stored on disk, let's say a highly compressible file, which is great for storage might not be the best thing for in-memory, right? So, so if you

Starting point is 00:17:51 actually get into file formats and there's a whole, there's hours, there's like a Carnegie Mellon level course that's available on file formats. But, but in, in short, again, whenever you're optimizing for something, you're optimizing away for something else. But I'm very eager to see where that kind of universal sort of representation of data is taking us right now. If I think if I put myself in the shoes of users, my primary concern with any of those formats would be, so great. But is it actually interoperable? So can I theoretically at least have, you know, part of my storage on S3 and then another part on Azure storage and then another part

Starting point is 00:18:33 on Google Cloud storage and somehow join all of that together if I use Iceberg or Hoodie or what else? Something else? Yeah, yeah. I mean, bottom line, like as far as like, they're just going to allow the basically that those files, those Parquet files and storage

Starting point is 00:18:50 to be recognized as a table. Now being able to connect to multiple clouds at the same time, that's going to be dependent on the tool. So like a tool like Dremio, you can connect to those multiple clouds at the same time and it can recognize Iceberg tables

Starting point is 00:19:01 or Delta Lake tables on those stores. And then you can join them together and do it just like as if it were just one big database. recognize iceberg tables or delta lake tables on those stores and then you can join them together and do it just like as if it were just one big database now of course there are some caveats with that because when you're you know basically there'll be one cloud that's sort of your primary cloud and the other two cl and if you're bringing in files from all three clouds you might run into egress costs or network costs from transferring those files outside of each cloud so that's always a consideration it's like can you do multi-cloud?

Starting point is 00:19:26 Yes. Do you want to split your files evenly across the clouds? That might be expensive. But platforms like Dremio, Dremio definitely believes in that story where it's not just data lakehouse. Like just having being able to access your data as a database and a data warehouse is a good start, but no one's just going to have all their data in one place. There's always going to be sprinkles of data in other systems or in multiple clouds.

Starting point is 00:19:46 And that's why you need like good federation and virtualization that can exist at scale. And that's the primary sort of like two pillars of what Dremio is doing, providing you this platform for working out with the data lake house, giving that snowflake-like feel on your data lake and to being able to federate and then do that virtualization at scale across different sources. Okay, so it sounds like, yes, you can do it, but the actual integration or virtualization, or however you want to call it, it's not done on the protocol, the table format layer. It's actually done on the platform, on the tool layer, right? Exactly.

Starting point is 00:20:23 100%. And again, what are you optimizing for? There's a reason why there's so many CSV files still in the universe, on the tool layer, right? Exactly. 100%. And again, what are you optimizing for? There's a reason why there's so many CSV files still in the universe. They're not the most efficient way to store data, but they're ubiquitous because every tool can produce CSV. Every tool in the world does CSV. Parquet is going to make it five to eight times more efficient but you know until let's say a microsoft excel standardly outputs data in parquet file format people are still going to be making their csvs right yeah indeed and well so since you mentioned it and you know parquet is obviously that much faster and better than csv

Starting point is 00:21:00 why doesn't the ex why don't the excels of excels of the world support Parquet right out of the... That is a good question. I mean, I guess I would put it like, yeah, I don't know why. I would assume that that feature would be added. I mean, I would say probably just because Parquet is going to come towards a specific need for analytics when you get to a certain scale. So basically, like, you know, if you're dealing with smaller data sets in an Excel spreadsheet, you can a lot of people can go very far with just that because they're not the scale. The size of these files aren't that large.

Starting point is 00:21:36 Then you get to a certain point where the file getting so large, the number of records are so large. It makes a lot of sense to go to Parquet. It's going to operate a lot faster. And then eventually you start putting the data set across multiple Parquet files, and then it's going to make sense. So then apply a table format like Apache Iceberg or Delta Lake on top of that to get the performance. So it's an issue of scale. So I assume if I'm

Starting point is 00:21:54 Microsoft and I'm thinking about Excel, how many of my customers are potentially at that scale, that that's the next feature that I want to add, I assume would be the product sort of perspective, because since it's such a general use platform. But I have to imagine it'd be inevitable would be the product sort of perspective, because since it's such a general use platform, but I have to imagine it'd be inevitable before there's a save as Parquet files in

Starting point is 00:22:10 Excel. Parquet's definitely become ubiquitous at this point. I mean, it's you know, like a lot of platforms, like I'm sure, like I know in Dremio, probably in Snowflake, probably in other platforms, you can literally just upload a CSV and then drop it back down as Parquet files within a few clicks. So, I mean, more and more tools to do

Starting point is 00:22:28 that are coming. Yeah, I would assume, you know, sort of going from CSV to Parquet must be, again, very, very much standardized these days. Yeah, I'm just going to give a shout out to a company called CData. They actually do have a drop-in for excel to save this parquet so you can get as an extension i think that we'll just see how microsoft you know adopts that format in the future but but again you can do it and there's plenty of python jockeys out there that know how to do this the hard way but but you're right i mean this is the kind of thing that it separates the typical office worker from the data scientist. Right. OK, so we've already sort of covered one aspect of interoperability, let's say.

Starting point is 00:23:15 So the cross cloud aspect. And now I'm wondering about another one. So cross format interoperability. So if I have like my Iceberg tables and my Delta Lake tables and my Hoodie tables, is there some way to move from one to the other? Can I switch? Yeah, there's a couple different ways that enable that now. I'll give you some caveats at the end. But bottom line, there's also the tool answer.

Starting point is 00:23:42 So again, you can use a tool like a Dremio, like a Trino. There's a few other engines. All of them, I can't remember off the top of my head, but lots of tools that support Iceberg, Delta, and Hoodie. So then you can just add that tool layer, join them, work with them, transfer them between formats. But again, your tool might not. So is there a way outside of that to do that?

Starting point is 00:24:01 There's a couple of different options. If you're using Databricks, they have in the newest version of Delta Lake, Delta Lake 3.0, there's this feature called Uniformat. Right now, all it does is that you can have a Delta Lake table in your Unity catalog

Starting point is 00:24:13 and you can enable this. And periodically what it'll do is it'll write Iceberg metadata that's accessible. So essentially you'll have not a one-to-one copy, but a close enough copy because if you have like 20 transactions that occur back-to-back to your Delta Lake table, it'll end up batching them and only writing one new snapshot to the Iceberg table or the Iceberg metadata.

Starting point is 00:24:35 So it's not perfect, but it offers some exposure to Iceberg. And then OneHouse, they started this project called OneTable, which creates this utility tool that can transfer any format to any format. But it's just like a one-off transaction. I would say, hey, here's a table, run the tool, and it outputs the file. So that's great for migrations. That would be great for if a platform wanted to build an export tool where it can say, hey, I want to export this data set as X format. That's good for that, but it wouldn't necessarily enable, hey, I'm going to use my table as all three formats all the time,

Starting point is 00:25:08 or they're just completely interoperable. Just because writing that metadata three times would probably introduce too much latency to be doing that, to make sort of like a, hey, let's just write all three every time. And then plus the way they track snapshots can be quite different in certain parts. So for example, like hoodie, you have to have these additional columns for a Hoodie-based primary key. And it's not like an optional thing.

Starting point is 00:25:33 You have to have these two columns in your data, while columns like that aren't necessarily required in Delta and Iceberg. So there's differences between the formats that make it where you still have to kind of work with one, but it's much easier to convert between them. And one other thing to point out is that like these formats have come out of the Apache hoodie and Delta Lake camps and mostly to facilitate going into Iceberg. Just because there has been such momentum in the ecosystem for Iceberg and such growth in the Iceberg ecosystem, everyone's kind of realizing, OK, we need to make sure that our tools can still work with that ecosystem, even though we're not part of that ecosystem. And I think this key architectural question as to whether you leave the tables where they are in the systems where they are, with some sort of federated query, hopefully some sort of metadata modeling that's common across them, schema management. Everybody knows that schema migration is one of the toughest challenges of even running a single system, right? And then trying to do schema management changes across

Starting point is 00:26:29 three different systems that you're querying could be quite problematic, right? So I don't think this is a solved problem. I think vendors are starting to try and make it simpler. But the alternative, the fallback, is even if there's terabytes of data in a system, but it's not formatted the way I need to then ingest that into another system. And for instance, like Apache, you know, does that right where we are focused on real time analytics. But if there's a gold or silver table that you need historical data to combine with your real time data for complete view of your business, you just ingest those tables, right? And so it really depends upon the scale and scope of what you're talking about, the speed at which you need your results.

Starting point is 00:27:13 And so there's not a really a one size fits all best answer right now. And I think that if, when we talk about what we should be driving towards over the next decade, better ways of giving options to consumers, right, data consumers. How do they want to size up these problems? And how can we give them predictive ways to plan for these trade-offs? If you do it with, let's say, a federated query, these are the costs and benefits. If you want to do a kind of like, you know, a hybrid table of real time and batch data,

Starting point is 00:27:46 what's the trade-offs and advantages of that? And I think that people just don't even, we don't even have a grammar to describe those kinds of hybrid or complex data products these days. I think that the whole concept of let's say data catalogs, like we do have the DCAT standard, but a lot of the standards around making a data mesh really happen don't exist yet. And I think that that's the kind of thing we owe consumers

Starting point is 00:28:10 over the coming decade. It's better ways for them to understand where their data lives right now, and better ways for them to understand how to plan for the consumption of their data in the future. Yeah, well, thanks, Peter, because actually, the conversation was sort of naturally flowing towards federated systems. And I was going to ask you precisely to share a little bit about what you do with PNOT and how do you approach this federated issue? Yeah, so I think that the answer for a lot of people is just Trino right now, right? They'll make the federated query with a kind of like a higher level system like a Trino. There's some people that want to make GraphQL the lowest common denominators to be able to query any system so that you can hide and abstract the actual details of what the system

Starting point is 00:28:57 is in the backend so that if you change from vendor to vendor, the programmers, the developers themselves aren't changing their queries, right? But even GraphQL has its own limitations. There are some times where you want to have a native SQL query because there might be a difference in how null is handled between a Postgres and a different database, right? So I think that that's the issue is that there's also this dynamic between having a common API versus having specific semantics to get the most out of your data. Like for instance, I'll just throw this out there, like a graph database, you might really want to do a native graph database query because there's some very cool and sophisticated things you can do there that you wouldn't get out of generic SQL, right? So I really think that that's an evolving thing too,

Starting point is 00:29:48 is how do we federate queries? And when do we stay in our native rich semantics for a specific database design for a specific project? Yeah, well, since you mentioned GraphQL, I think part of the promise of the allure of GraphQL was precisely this being agnostic, but at the same time having this quick and dirty, let's say, interoperability layer that will take your generic GraphQL and then quickly pass it on to whatever underlying implementation you have. And in theory, you can have anything, be it a graph database or SQL or what have you. Yeah, Alex, how is your team taking a look at that kind of concept of federating queries or staying native for semantic, you know, advantages?

Starting point is 00:30:33 Understood. Bottom line is like Dremio, we do federate queries. So like sort of one distinction between sort of like, you know, what you're doing with Apache Pinot and StarTree, that you guys are like federating a lot of these are like really great like real-time sources well like in germio well we don't have any like you know kafka kinesis connectors right now but we do have connectors to pretty much databases data warehouse and data lakes on-prem and in the cloud so this allows people to kind of connect their data across lots of different places and also has a built-in semantic layer to kind of model that data. So generally what Dremio hopes to be

Starting point is 00:31:07 is just sort of like this unified access layer where basically wherever your data lives, you can connect to it, then model it virtually, which kind of gets back to sort of like what you were mentioning earlier, where basically a lot of these database systems now, you're really just sort of dealing with this virtual layer over the actual fundamental data.

Starting point is 00:31:22 Dremio aims to be sort of that data lake house's sort of virtual layer above all those different data sources. So we generally, our architecture lends towards sort of the gravity being on the data lake. So like you're going to get sort of like the best architecture when most of your data is on your data lake in ideally Apache Iceberg format. But there's so many different possible permutations of how you could use Dremio. Like I've seen one way the federation on Dremio has been used a lot is one for on-prem to cloud migrations. So companies will have a do cluster on-prem. They want to move to cloud. But the problem is, for the most part, a lot of the tooling on-prem is completely different than the tooling on cloud.

Starting point is 00:32:00 There aren't really many tools that go through both. And oftentimes tools like Trino have some through both. And oftentimes, tools like TreeDock has some flexibility there, but oftentimes at scale, it can be a little tricky. Like great for ad hoc, little gets a little trickier at very large scale queries. So with Dremio, they would first bring in Dremio connected to the on-prem cluster. They immediately see like the performance benefits. But more importantly, what they do is they create this unified access point. So from the end user, they're just accessing data from Dremio. They don't really care where the data exists.

Starting point is 00:32:29 They're just accessing the views that were curated. They're curating their own views from the data that they have been granted access to. And then what happens is they'll move the data over to the cloud, and then they can just adjust the queries in those views to now query the objects, the object storage copy of that. And then basically, there's a frictionless migration from the end user point of view. So there's no disruption and no having the need to retrain your fundamental end users at the end, making those kind of data movements really easy. So that's a popular use case.

Starting point is 00:33:04 So federation has, one, can allow you to do ad hoc analytics, which is generally what most people think of when they think like federated queries. But again, it enables like all sorts of different types of data movements. It enables data mesh because then your different domains and your different teams can curate from their own data sources. Now everyone has to agree and like, oh, we're all going to store our data here. We can connect those different sources and then curate those data products in one unified semantic layer. So Jeremio really tries to kind of create that one layer that unifies your data where it is and allows you to bring in data from everywhere, deliver it everywhere, and do it performantly. Because there's this feature called reflections that, I won't go too deep into it now,

Starting point is 00:33:35 but it really, really kind of fixes the scale issue when you start trying to do virtualization at scale by basically automating away the lots of the difficult materializations you would normally be doing otherwise. But so Federation is a very big part of what we do and what we think about. Right now it's mostly connected to the raw data sources, so not much real time yet. So that would be like a great place for a platform like StarTree and Apache Pinot.

Starting point is 00:34:00 And then, so there's some great options when it comes. And that's a great thing about the Data Lakehouse. You can use multiple tools for your multiple use cases and bring them all together and be able to have sort of this unified data platform that doesn't require 20 copies of the data. You're operating on a single copy of the data across multiple tools.

Starting point is 00:34:17 And that's this whole modular multi-platform promise. And in that kind of architecture that Alex just talked about, Peter, what would the role of Peanut be? Like doing analytics on incoming real-time streaming data and then sort of

Starting point is 00:34:36 delivering them to the lake house? Well, probably we'd be ingesting from the lake house. So probably what would happen let's let's presume upstream you have an oltp type system right and it's probably being fed from maybe mobile devices or something on the edge right iiot or something so some oltp system is getting information and it's doing individual row based transactions and it's probably doing that it may be a hundred

Starting point is 00:35:04 thousand maybe a million operations per second and And then from that, there's going to be change data capture. And like, for instance, Debezium is a common way of doing change data capture. And Debezium is going to be feeding all of those changes happening in real time straight on into Kafka topics or Pulsar or Red Panda, right? It's going to go into a real-time event streaming system. Now, that in and of itself may not be sufficient. And so you might have stream processing also happening in real time. And Flink is decorating and annotating and sanitizing and doing a lot of transformations on those real-time events, enriching them. And there's probably going to be two consumers. One is going to be the batch data warehouse, you know, running the critical business reports that need to happen.

Starting point is 00:35:50 But those are going to be happening over the span of minutes, hours, or even a day or more, right? And then on top of that, a different consumer would be an Apache Pinot, which is watching for these real-time signals happening at a million events per second. And so part of it will be informed from, Pinot, which is watching for these real-time signals happening at a million events per second.

Starting point is 00:36:11 And so part of it will be informed from, again, probably a silver or gold table from the data warehouse, because it needs to know not the whole complete history of everything, but maybe like the last week or last 30 days of information on what's going on with your enterprise. And so that batch data will be ingested, like StarTree has a data manager, which automates this whole process on top of Apache Kino. And so it combines and makes a hybrid table out of both your real-time and your historical data for a complete view of really important thing

Starting point is 00:36:39 that you need to monitor in real-time. And then for instance, we can do anomaly detection on top of that as well. So when is, and it's not just like a parametric, like, you know, if it's greater than this or less than that, that's very simple statistical, very simple, like parametrics, but we do statistical analysis. So it does this fall within your interquartile range, right? Or is this beyond that kind of stuff? So we use statistical analysis to watch for when things truly are an anomaly. So you're not just constantly hammered

Starting point is 00:37:07 with false positives, right? So that's the kind of thing that's really important for a business who's watching, let's say, their Black Friday sales. You anticipate, you want to see Black Friday go big, right? It should be falling out of your normal Monday to Friday business. And the question is, does this

Starting point is 00:37:26 equal or exceed our last Black Friday? And if so, by how much? And you don't have until Tuesday to figure out what's going on on Cyber Monday. Or if you're delivering pizzas, you don't have until after lunch hour to figure out how many people need to be delivering pizzas during lunch hour. You need to solve some of these problems in real time and there's just no way that you can wait for a batch analytics report. It has to be solved in real time. So I'm trying to wrap my head around how is that different from the kind of processing you can get from you know something like the platforms that you mentioned like Flink or Red Panda or... Sure, just as example, OLTP systems are fantastic row stores, and you can do all sorts

Starting point is 00:38:09 of like row-based caching, and there's tons of caching available when you're talking about row-based stuff. But with analytics, you're probably looking for really fast aggregations, right? And here, you don't have the luxury of just cubing everything all the time and keeping it all in memory you would die right so as an example there's a specialized kind of index for apache you know known as the star tree index which is where my company gets its name and that means that you can do fast aggregations without having to do a full-blown, all-dimensions-by-all-dimensions cubing. And it's efficient, and it's fast, and it's our secret sauce. It's why Apache Pino is Apache Pino, is that StarTree index.

Starting point is 00:38:54 But for instance, if you're trying to get that cab to that street corner like Uber does, they're trying to get a person who's got a pin drop on a map connected with their car, their ride. So there's, beyond just the StarTree index, there's an H3 index for geospatial indexing, right? So flexible indexing on top of data is also a key requirement. Not everything is a forward or reverse index. We got a lot of people doing a lot more kinds of, so the exact same data may be indexed multiple different ways depending upon what you're trying to get out of it are you trying to get aggregations are you trying to get pin drops to match within a certain radii you know all of this is is is the same data indexed different ways

Starting point is 00:39:38 for different use cases at the same time all right Now, let's get back to my favorite actual topic, as you must have figured by now. So, interoperability. We've talked a little bit about how it works or it kind of works, you know, on the table format, let's say, level. But there's also two other very important aspects for which I don't think we have adequate answers, but let's try and address them anyway. So we've talked about some semantics layer, and we've talked about governance as well. So are there is there any kind of way, besides the specific tool kind of layer, let's say, is there any kind of standardized way to address semantics and governance in this multi-table format world? Bottom line, when it comes to like, for the most part, semantics are going to have to generally

Starting point is 00:40:30 come, like a semantic layer is going to generally come from some sort of tool, because generally that's not, the way all the formats work is generally the metadata has information about the table, but not about the overall catalog of tables to be able to kind of document the semantics of all those tables and how they relate to each other at least a standard format of doing that doesn't exist yet there is something like that but only for iceberg tables there's something called a project called nessie which creates this sort of like open source catalog that not that that allows you to not only track and version individual iceberg tables but also but also version views. So that can almost make a semantic layer portable.

Starting point is 00:41:08 Always the issue there is your views are SQL, and SQL syntax of every tool is the same. So even if the SQL is portable via this catalog, there's some challenges there. So it'll be difficult to kind of create like a... There would have to be sort of like a... I mean, I guess ANSI SQL is sort of like, you have to just make sure that everything sticks to ANSI.

Starting point is 00:41:28 But far as interoperability in the catalog, that's getting there. Nessie eventually will have Delta Lake support. It's tried to have Delta Lake support in the past, but the Delta Lake project never took the pull request. But that project is open source and would eventually create that

Starting point is 00:41:43 portability to bring tables of multiple formats. So far as the semantic. Now, as far as governance, same thing also there. Nessie also allows you to actually create governance rules where you can make certain branches accessible to the users. Because basically what would happen is that each user would get a token from Nessie that they use to access the catalog from whatever tool, whether it's Spark, Trino, Gremio, whatever. And then technically, hey, their token may or may not be able to access that item in the catalog.

Starting point is 00:42:13 So theoretically, as the Nessie project builds up, that actually does create a layer where you can kind of make semantics and governance portable. But at the end of the day, at this current state of things, it's still fairly tool-driven. I guess the best you could do is you could also lock down the files. So basically, access rules on your cloud storage,

Starting point is 00:42:33 it doesn't matter what the format's in. If they can't access the files, they can't access the table. You can do that across any format. But you should still do file-level permissions on top of your table-level permissions just for the extra security. But that's kind of where things are at.

Starting point is 00:42:47 The Nessie projects, I think the tool and then competitively like Unity catalogs also trying to do that, but primarily for like Delta Lake tables and try to create like a more portable catalog for their Delta Lake tables. So those would be the two projects that I think are trying to solve that problem. And then Dremio has a productization of that messy open source project called Dremio Arctic, which is our internal catalog that leverages that. So basically, those tables that you create in Dremio are portable to all these other tools if you wanted to. So I was going to give a shout out actually to the federal government. I've been taking a look at what they're trying to do, the UK government there. And in fact, I'll share this link.

Starting point is 00:43:23 It's at the W3C. It's called the DCAT. DCAT 3 is their latest version. And it's an attempt to define a data catalog and vocabulary for data catalogs. And I don't think I've seen it widespread enough. I don't think that businesses have been embracing it the same way that governments have. But they now have, for instance, they have Freedom of Information Act and a whole bunch of other kinds of directives about how to share data products that the government has. And so they relied heavily upon this to say, well, here's a data source that you can get. Here's a bunch of files in a file system. And how do we describe what these data products are? And I think that this

Starting point is 00:44:05 is, again, this is the next thing that I think as an industry we can tackle is better defining both batch data products, but also real-time streaming products where these things might be changing. If we have schema problems managing real-time analytics, then, you know, the data products from those real-time systems also needs to have their own semantics to it, especially as we continue to enrich things. Like for instance, the Kafka topic, when it first gets ingested, may be highly different than the data product that comes out after it's been through stream processing and Flink, et cetera. So this is just an example, I think, of a stab at it. I don't know that DCAT is going to win, but it's definitely an interesting attempt to herd these cats. Because I think that one of the things I have seen in my career here, I've been in Silicon Valley since 1989, is the move away from standard. For instance, even when we talk about ANSI SQL, how no handling happens.

Starting point is 00:45:08 If it was all the same, we'd all handle no the same, right? But even just that is an illustrative of, it just illustrates how far we can gink standards as particular vendors, right? And I think that there's still attempts like here for the W3C and from other standards bodies to make it so that we are interoperable. But I think that vendors, there are billions of dollars at stake. And a lot of these big organizations really do want to make the standard not quite standard so they have an advantage in the marketplace. And I think that that's, if anything, I would like to go back to our old school roots where we had stronger interoperability standards. And I think that the real place where this has to be driven from is the customers themselves. If they said, listen, all you cloud vendors,

Starting point is 00:45:56 all you data warehouse vendors, you all get a room and you solve it and then we'll consume it, but you make it solved, right? I think that if we had stronger presentation from the customers, there would be a lot more drive for interoperability than from the vendors themselves. You're right about decat. And actually, since semantics, and you know, data management and knowledge management happens to be my my cup of tea. This is something I'm familiar with. And there's actually a ton of other like standard vocabularies

Starting point is 00:46:26 that could potentially be used for that purpose but I think it all comes down to what you said it's not really about having lack of technical solutions it's about you know which vendor or its tool provider just going out and doing their own thing basically. And I'm saying that is coming from a vendor. I think we have to be honest with ourselves and with our customers that we need to do a better job as an industry. Yeah, I mean, it's totally understandable why vendors would like to do their own thing for a number of reasons.

Starting point is 00:46:59 It doesn't necessarily have to be like, you know, lock-in. People just generally think, okay, I can produce a better solution to that problem. necessarily have to be like you know lock-in people just generally think okay i can i can produce a better solution to that problem but well maybe but probably the other problem you're generating by doing that is that well in the end there's no interoperability and yeah it's just kind of right well i mean i think that there were attempts to try and make standards back in the 2014 2011 days i took a look at some earlier attempts at this for various elements of the database industry.

Starting point is 00:47:29 And you can't standardize too quickly because then you put a chilling effect on innovation. And there's a tremendous amount of innovation. So if anything, though, what we should be taking a look at is ways to make extensible standards. I mean, the IETF hit this kind of well with Ethernet and with TCPIP, I think they did okay too, where you make it such that your current standards do not preclude advancement and innovation, right? You say, this is the standard for 2023 as is, and we will build an extensible grammar on top of what we

Starting point is 00:48:05 concurrently do. And if you take a look at just, for instance, OAuth, right? OAuth 1.0, that completely got pretty, you know, it got hammered. And OAuth 2.0 is significantly different. There's a lot of stuff that's not allowed in OAuth 2.0 that was in OAuth 1.0. And I think that we should, as data systems providers, we should not be afraid to try and work together. And maybe we'll get it wrong in 2023, but we should try and at least work together. And then if we need to revise for 2025, 2027, and 2030 and beyond, we need to be able to have those kinds of gut checks, interoperability discussions, and do it with an eye for the

Starting point is 00:48:47 benefit of the customer, not just our own bottom lines. We shouldn't be using data formats and query languages to lock people in. Yeah, agreed. I mean, you know, there's plenty of room for healthy competition in terms of implementation and what use cases you optimize for and, don't know marketing even and what have you but that should be not a part of it in my humble opinion. Fair enough. Okay so I don't know if we've actually let's say answered the the initial question at least so where is this six what is the sixth platform going to be like but I think we've addressed a number of questions. And since we're

Starting point is 00:49:25 kind of coming closer to wrapping up, let's just, you know, I'm just going to ask you point blank. So do you think there's going to be such a thing as a sixth platform, let's say? And if yes, what do you think it's going to look like? I would say yes. I mean, we're already starting to see it. I mean, this is essentially Jeremio's whole thesis of there being sort of this open lake house where it's not just operating on the data lake, but several different tools, several different sources, like basically being a unifying layer and all of that. So that way you can have that sort of more modular approach with a nice, easy base on it. So we're seeing customers embrace that, mainly either because they have a need to federate different sources, or they really want to just do

Starting point is 00:50:10 more with Apache Iceberg on their data lake, or, you know, they want to improve their BI performance with the features like reflections, like they are embracing sort of this sort of like, hey, let's do more with our data lake. Let's do more with a variety of tools in a more modular way using open formats like Apache Iceberg, like Apache Arrow, like Apache Parquet, and seeing how all these can unify to create a greater level of interoperability. Of course, still imperfect interoperability, but a much better state of things than in the past. Yeah, I don't know that there's going to be one system to bind all the customers, right? It's not going to be like the one ring. I think that we're still going to see a drive towards clusters of clusters, systems of systems, where you're going to have some things that are focused, again, on OLTP-ish workloads.

Starting point is 00:50:58 There's going to be some OLAP-ish workloads. There's going to be some batch workloads. There's going to be some real-time workloads. And I think that this is going to be driven by the customers to, they're going to want these things to work better together. They're going to want them to feel more like Lego bricks that snap together easily. They're not going to want it to be like an Ikea piece

Starting point is 00:51:19 of furniture that comes with a hex wrench, right? With some assembly required. There's going to be a lot more drive towards the automation of integration, but I don't know that customers want to walk away from the last generation where they were locked into a vendor to a new generation where they're locked into a vendor. They're going to want to mix and match components a lot more than ever before. They're going to want to be able to put in their own innovation where they can because that's a competitive advantage, a trade secret, some way that they're skinning the cat that

Starting point is 00:51:53 is an open source. And I think that we're just going to see a lot more hybridization than we're going to see everybody roll over to a single monolithic model. Yeah, well, I know that that's what I would like to see for one. And yes, I should have clarified when I was, in my mind, at least when I was talking about this notion of the six-plus one, I wasn't necessarily thinking of like, oh, okay, so it's like this vendor that's going to come up

Starting point is 00:52:19 and take over the world or something. No, I was more thinking about along the lines of something that you just described, Peter. Yeah, yeah. And I personally don't think that we have sufficient grammar to describe what that's going to be. I've been calling it federated data systems,

Starting point is 00:52:37 but federation already is a very complex namespace that's already like, again, federated queries has a meaning, a semantic meaning, federated learning in AIML has a semantic meaning. But I, so if other people have a better way to describe the, some people call it a data mesh, but I think a data mesh today is still too conceptual. Like I, if you were to tell a coder, make me a data mesh, there is no model for it, right? There is no standard for it. So I think that we need to come up with a better way of describing this language of interoperability. And again, DCAT's an

Starting point is 00:53:11 example, just one example of how you semantically describe these beasts that we're building. But I think we need to come up with other ways of saying what these systems are so that they can machine to machine explore each other, understand what they store within each other. They know whether it's a dead end to even connect to you or not. And we need to have a language, a grammar for the systems themselves to understand each other. So that's what I'm looking for for the next decade. Thanks for sticking around.

Starting point is 00:53:40 For more stories like this, check the link in bio and follow Link Data Registration.

Orchestrate all the Things - Data management in 2024. Featuring Peter Corless, Director of Product Marketing at StarTree, and Alex Merced, Developer Advocate at Dremio

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.