The Data Stack Show - Re-Air: The Data Economy: Turning Information into a Tradable Commodity with Viktor Kessler of Vakamo

Episode Date: October 29, 2025

This episode is a re-air of one of our most popular conversations from this year, featuring insights worth revisiting. Thank you for being part of the Data Stack community. Stay up to date with the la...test episodes at datastackshow.com. This week on The Data Stack Show, the crew brings you another conversation live from Data Council in Oakland, California. In this episode, Viktor Kessler from Vakamo explores the evolution of data architecture from rigid warehouses to flexible Lakehouse systems. Powered by Apache Iceberg, this new approach enables seamless data sharing, governance, and potential monetization. Viktor discusses how open-source innovation is transforming data management, highlighting the shift towards treating data as a product and the emerging potential for AI-driven data exchanges. The conversation provides insights into the future of decentralized, adaptable data infrastructure and so much more. Highlights from this week’s conversation include:Viktor's Background and Journey in Data (1:20)Evolution of Data Architecture (4:41)The Lakehouse Concept (7:12)Open Source Innovation (11:05)Data Production and Decentralization (15:06)Governance in Decentralized Systems (18:53)Data Economy and Monetization (21:15)Security Concerns in Data Processing (24:21)Impact on Data Consumers (27:37)Compaction Issues in Data Tables (29:39)Open Source Lake Keeper Tool and Parting Thoughts (33:02)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Transcript
Discussion (0)
Starting point is 00:00:00 Hey, everyone, before we dive in, we wanted to take a moment to thank you for listening and being part of our community. Today, we're revisiting one of our most popular episodes in the archives, a conversation full of insights worth hearing again. We hope you enjoy it, and remember you can stay up to date with the latest content and subscribe to the show at datastack show. Hi, I'm Eric Dodds. And I'm John Wessel. Welcome to The Datastack Show. The Datastack Show is a podcast where we talk about the technical, business. business and human challenges involved in data work.
Starting point is 00:00:33 Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Before we dig into today's episode, we want to give a huge thanks to our presenting sponsor, Ruttersack. They give us the equipment and time to do this show week in, week out, and provide you the valuable content. RudderSack provides customer data infrastructure and is used by the world's most
Starting point is 00:01:02 innovative companies to collect, transform, and deliver their event data wherever it's needed all in real time. You can learn more at rudderstack.com. Welcome back to the Datasack show. We are recording live in Oakland, California at the Data Council Conference. It's been an awesome week so far. And we have a really fun guest, Victor, welcome to the show. yeah thank you thank you guys for inviting me all right well we want to talk about all things
Starting point is 00:01:32 lakehouses and talk about lakekeeper and but first tell us about your backup absolutely well first of all i'm based out of switzerland so from europe and i'm happy to be here at data console i am one of the co-founder of a company named bakamo and spakamo is the company which develops a lake keep open source of my iceberg catalog and from my background i'm extremely ex-mongo be yes in X risk management absolutely you know like I'm based previous doubt of Germany in Germany we have a lot of insurance companies and one one of the companies I used to work was Munich re-argo in different companies so Victor one of the topics that I think we need you to clear us up on as catalogs and the
Starting point is 00:02:16 maybe a little bit of a misnomer so we want to get some clearer from you on that and then what do you want to talk about yeah well well catalog from you know like what I hear is a misused word and unfortunately like everything is a catalog. And if you're going to talk about catalogs around Lakehouse, there is a specific technical catalog what you need for Apache Iceberg. And what we would love to talk like from Avakamo is all about the metadata and technical metadata which need to become actionable. Yeah, love it. Awesome. Well, let's dig in. All right, Victor, so excited to connect here in person at Data Council. When you, we were talking a bit about your background. The word you use is that you've worked on ancient systems. And here we are
Starting point is 00:03:00 today in 2025, all the way up to Lakehouse Architecture, Iceberg. What is, what's maybe one takeaway or kind of lesson that you've kept with you all the way up to today, maybe that you kind of learned working on those ancient systems? Yeah. So maybe just, you know, like to mention what is the ancient system? And, you know, like, so some people probably who are going to listen that they don't even know that type of a system. And from my experience, I can go to some kind of like IT archaeology. I just imagine you with a brush and with a whole... Yes, that's something what you can do.
Starting point is 00:03:37 And what I actually mean, it's a super robust system. But I used to work with mainframe DB2 and with something like cobal copybooks, which used to be very useful in 60s, I would say. But right now it's super hard to go. them, even looking at the assembler or something like this. But I'm grateful to have that experience because, you know, like you understand what was the beginning, like punch card. What is the punch card exactly? And now moving on on the stack, you see that there's one thing is, which is the change is continuous. It's not like, oh, now we have an AI revolution. No, we have like all way down to that
Starting point is 00:04:17 punch card to today some kind of a change. And that's under what I learned. That's even like in next couple of years, we're going to talk about some new topic, I guess. Yeah. Okay, so let's, I want to dig into that a little bit, and I love that perspective. I mean, clearly AI is going to have an impact on the way that, you know, the way that a lot of things are done. And, but it is just one of many, if we think about the IT archaeology, right? It's one of many, one of many ages, right?
Starting point is 00:04:49 How do you think Lakehouse Archaeology? architecture, or I guess maybe a better way to put it, would be, is it a fundamental shift? Yeah. Do you believe we're at the front of a fundamental shift, or is it just another component sort of in a landscape? Yeah, so maybe let's just go on a journey of a data platform to understand why we end up with a lakehouse and what is a lakehouse and what type of challenges we're trying to solve here. So in a back days, we had a nice system called data warehouse, and we used all the different databases like Postgres or maybe like SQL server DB2 oracle teradata and they've
Starting point is 00:05:29 been like a monolithic type of a system one box and so it was good to serve the amount of data that time pre-internet era let's put it that way and we could actually store the amount of data we had and so the amount of reports we can serve for that type of a system but the system itself was monolithic and very bureaucratic so it's like you had some someone sitting in an ivory tower who was making a decision twice a year we're going to make an adjustment or a data model and that was like holy grail for that person the db admin yes kiss my hand kiss the ring and i will make the change yeah you need to go to an altar and make some yeah an actual animal sacrifice That was a quite funny time.
Starting point is 00:06:19 At the data temple, right? At the data temple, yeah, we're in the data temple. Exactly. But, you know, things has changed, and especially with internet, we got a vast amount of data, which didn't fit inside of a data warehouse part of the, you know, like you have your schema, you have your star schema, storeflex schema. We've got like all that clicks, IoT, all that different data, which you need to capture somehow and analyze.
Starting point is 00:06:42 It's why, like, after data warehouse, we've got data lake. And data lakes was based primarily on the Hadoop system, and then we got like object storages with S3, where we could store some formats like Avro, Parquet, or C, and we got like up to the petabytes. And that was like in a parallel world to data warehouse, and we had the data lakes with all that file-based systems. The thing on a data lake was that you could not go and make transaction. It was super hard on a schema evolution. All that stuff was very kind of. heart from a modeling perspective, but it was very flexible. So I could now just go, like, stole a bit about the Vega and then take something like Trino, Presto, whatever, different tools,
Starting point is 00:07:26 and just analyze it. And it was awesome. But again, maybe like here, hello from Germany, for GDPI, European type of regulation that someone will tell you, or you got to delete that data. And you don't know how to do that. Yeah. It was super hard. So now we have like two paradigms. We have a data warehouse, very bureaucratic, rigid, and they have a very flexible, chaotic beta lake with height metastore. And, you know, like, you can do that parallel, but at the end, we understood like, okay, we need to do something around that part. So we need just to take that best from both worlds, combine that. And that's why we got a lakehouse, which can serve exactly that pattern. So from one side, we have all the transaction guarantees.
Starting point is 00:08:10 And we have like schema evolution, time travel. level capabilities. And in parallel, we have now infinite storage on a tree where we're going to store like parquet file. The question what you have is it kind of a evolution or revolution. And from my perspective, we have revolution because what we actually got with a lakehouse, we've got an open table format named Ice Prope, which made it data free. So the thing is, right now you can just go and store your Ice Group in whatever place you want. You can store that in a cloud. And now with a cloud repatriation, you can go and store that on-prem. And so you can write that with Spark by Icebrook.
Starting point is 00:08:48 You can use a like five train, all that different tools. And then by reading, you can, again, use all that different tools as well. And the good thing is that you're not stacked with one type of technology, which dictates you how you're going to analyze your data. It's more like, okay, my use case is A, B, C, and that is dictated by business where I need to drive value. And in that situation, I can just go and pick a technology which will drive the best value for me so I can be more competitive.
Starting point is 00:09:14 And that is a revolutionary thing because the data is now in an icebook format free. But they're caveats, their challenges, you need to manage icebergs. So one of the things that I think we talked about this last night, that's really practical but easy to gloss over, is data people have been doing this long enough. One of the reasons that's so attractive is because they've been through these system migrations. Like they get acquired and then they're like all of your technology, isn't this technology. You have to move it to this technology by this day. You spend 18 months
Starting point is 00:09:45 doing it. That's a topic for a cop gem and I accent. You know, like going migration from ED2 to Oracle and then to onto open and to Snowflin. Or you just get a new leader and they decide that you're going to use the new technology. And you can make a whole career of these companies essentially. What did you do? Like I think I just migrated things between my entire career. So that's one of the reason I think it's so attracted to people. Well, okay, one interesting question, actually, John, for both you and Victor, especially as we think about this, I love the concept of archaeology.
Starting point is 00:10:15 I'm trying to figure out how to fit a paleontology. I think you're going to say Indiana Jones, but all right. One interesting, the forces that are pulling these advances have pulled these advances, I think about two main forces, and maybe I'm thinking
Starting point is 00:10:33 about this, you know, maybe my view is too near, but you have this pull of cost, right, where I know I want to do this thing, but it's just way too expensive with the current technology. And then you have use cases, right? I need to do something that I can't do because of limitations here, right? And there's obviously a relationship there. Are those the two primary forces or what are the other forces? What I would love to add to absolutely agree on that two forces. And there's one additional, which is an open source community. That sometimes follows that two forces, but sometimes they have like a different understanding of the world.
Starting point is 00:11:10 And that is a very essential part. And nowadays, you can see that the open source community drives a lot of innovation. And that open source even goes, well, probably they have a crystal ball and they're just trying to see like in the future and develop some stuff that can be used later on by a bank, by insurance. And that is kind of a cool thing, what I would like just to add forces. Yeah, yeah. That's an interesting, that's an interesting dynamic because the first two are almost completely commercial. I mean, you have cost, right? We're trying to manage a balance. We have a use case. We're trying to essentially add revenue through executing some use case of data. But the open source community is driven by innovation, the joy of building things, right? Curiosity. Yeah, exactly. It's like Indiana Jones probably, but for a little. Yeah, totally. There we go. You have the Indiana Jones.
Starting point is 00:12:06 And I mean, I think there's also the driver of light developer experience, right? Like a lot of people are solving their own, like, painful, like, pinpoints. And I've been using this tool. It's awful. Like, this is my life. I have to create something better. Which is the innovation part, but also it's just like a, is it a frustration perspective? Yeah.
Starting point is 00:12:26 Yeah. Can we look at those, let's look at this revolution in those three lenses, right? So we know that cost, I think, is probably the easy. one in terms of that pattern was established by, you know, S3 basically, right? We can store, essentially have unlimited storage that, you know, a very low cost, practically, but there were all these limitations, right? So that aspect is very clear when we think about the lakehouse where there is a cost driver. What about the use case side of it?
Starting point is 00:13:00 Yeah. Maybe even like just to talk about that costs and use case, let's just look at the lakehouse, architecture how it's structured so you have like main three components on a lakehouse the first one is storage which like might write cost or maybe like lower the cost aspect and that is kind of a solved issue with like amazon street it's google and azure or you can go like and use some dell storages some plan it's a commodity absolutely then the second component which is commodity as well it's a compute and we have classically like two type of a compute writes and reads and then on right you have like spark pie iceberg and all the different eKL tools and on reach you have like
Starting point is 00:13:41 presto trino dot db data fusion and that is again like a large list and it's kind of a challenge for our companies to pick a right compute and that can drive the cost up and down depending like on your use case and then the last third component in order to get your lake house alive is well now we are with a word catalog because in all order to create a table, iceberg table, like you have a DDS statement, create table, alter table, you communicate eventually with a catalog, which will execute that to create the metadata layer of Icebrook table. And then your compute will communicate with catalog to understand that meta, data, and then write the parkey file to a street storage. So it's all
Starting point is 00:14:25 distributed right now, and it helps you actually to scale every component on your demand and the use case. So, and that's quite interesting because on a use case, perspective, is right now you have like a classical way we have like a centralized data engineers who are just trying to collect all the data in the one space but what happens in parallel to the organization we decentralized the whole stuff right now like every company want to be a startup and now we have inside of a company our marketing is a startup and sales is startup and everyone is like independent which is actually kind of a not aligned with the way how we treat data and what What we actually need to do here is to think like every department, aka startup, now needs
Starting point is 00:15:13 to treat data as a product and think about like, okay, I'm the one who understands the data. I'm the one who can prepare that as a product and give it to someone. So I'm a data producer. It's my data manufacturing machine and everyone can consume that via through API, SQL, whatever give you like protocols. and now we have like MCP for AI agents and so forth. And that is something what you will look at the use case side. So you have like all that different use case
Starting point is 00:15:41 and they can be solved by teams or data domains itself, but not century. And that is a different kind of a trend, what we have here. Yep. We're going to take a quick break from the episode to talk about our sponsor, Rudder Stack. Now, I could say a bunch of nice things as if I found a fancy new tool.
Starting point is 00:15:57 But John has been implementing Rudder Stack for over half a decade. John, you work with customer event data every day and you know how hard it can be to make sure that data is clean and then to stream it everywhere it needs to go. Yeah, Eric, as you know, customer data can get messy. And if you've ever seen a tag manager, you know how messy it can get. So Rudderstack has really been one of my team's secret weapons. We can collect and standardize data from anywhere, web, mobile, even server side, and then send it to our downstream tools. Now, rumor has it that you have implemented
Starting point is 00:16:30 the longest running production instance of Rudder Stack at six years in going. Yes, I can confirm that. And one of the reasons we picked Rudder Stack was that it does not store the data and we can live stream data to our downstream tools. One of the things about the implementation that has been so common over all the years and with so many Rudder Stack customers is that it wasn't a wholesale replacement of your stack. It fit right into your existing tool set.
Starting point is 00:16:57 Yeah, and even with technical tools, Eric, things like coffee, or PubSub, but you don't have to have all that complicated customer data infrastructure. Well, if you need to stream clean customer data to your entire stack, including your data infrastructure tools, head over to rudderstack.com to learn more. John, you asked about the term catalog. Let's dig into that because it is, I, Victor, you, when you were talking before we hit record, the term came up and you got a nice, sly grin on your face, and you're chuckling now that John you had some questions about that term i think it'd be helpful to kind of overlay so most of
Starting point is 00:17:37 our listeners will be very familiar with let's say postgres right to overlay what postgres kind of bundles for you and then let's look at that in this new architecture and talk about the different layers and what's happening and then talk about the names like the misnomer around catalogs yeah yeah so you know like if you look at the like let's take postgres so it's a box which has everything it's storage compute it takes care of your table life cycle, it takes care of access management. But what happened eventually that someone took a tollhammer, as my co-founder would say, and just 8,200 Postgres, and it full apart. And now if you look at storage from a Postgres, you have S3.
Starting point is 00:18:18 And then if you look at compute, you have something like Spark. And if you look at exactly that part which managed the Postgres, so you as a user or whoever can communicate, that is exactly the catalogue part with what you can actually call information schema where you have like your tables views you have some objects inside of your information schema and that's exactly what we call a catalog in lake house and there are some like benefits of that type of architecture but we need to think about like how we're going to manage the governance in that case because at the end it's not a single system which controls who is writing and who is reading
Starting point is 00:18:58 Now you have, like, again, getting back to that startup type of organization, marketing uses by iceberg, sales uses snowflake, and then how are you going to give access to your table, who is going to read, who is going to run? And then there's, I think there's also the use case of data sharing between business units or between partners or between vendors. Like, I think that's going to grow as well. That's a topic for itself. But that's, you know, like you touch something.
Starting point is 00:19:23 So like, when 99% of my discussions, like, okay, we have a big company and we would like just to build a lighthouse. And then there are sometimes discussions, okay, let's zoom out on a supply chain. And like I'm from Germany, we have like manufacturing cars, automotive, pharmaceutical, and then in that supply chain, you have like thousands of different suppliers.
Starting point is 00:19:42 And now is the question. Let's assume I'm continental, I'm producing tires. And then you're a Mercedes. You're building a Mercedes and then you buy Mercedes and you drive-drawn Mercedes. So me as a continental, I have an R&D department who made an assumption about, like how you're going to drive in san francisco and then someone drives in i don't know in a different part of that country and me as continental i would love to get that data back cycle to understand
Starting point is 00:20:09 that somewhere where it's like he's foreign guides plus 70 and someone who is by plus 30 is a different type of tires what i need or like a robber on my go and that is kind of a question what right now is kind of unsolved because my R&D tries to predict, but getting back exactly to that zoom out on the supply chain, we need to build a sharing, and not just sharing of the data, we need to govern that sharing. And there are two aspects to that, and especially Lakehouse can solve that because Lakehouse offers us some sort of a no-copy architecture. So I store that in an S-3, and then I can give access to S-3 to all the different partners.
Starting point is 00:20:50 But I need somehow to manage what the purpose of reading of that data, who is going, I want to read that data. I need to audit all that different reads. And therefore, I need a data contract, but not like a PDF and a Wikipedia page. I need to have part of a computational. Not in DocuSign, right? Well, you can try and try. But that's going to be hard, especially like if you want to automate the whole stuff. And if you look at in the future, right now we have in that process of sharing humans. And I can call, again, John, and ask you, like, can I get your data? But in the future, we will have like, AI agents and they need a way to automate the whole process. And that process cannot be done
Starting point is 00:21:30 just, you know, like on the phone. I might call each other, maybe. But I think they expect to have MCP typeish protocol just to negotiate on the way, how are we going to use it. And the funny part is if you have like that supply chain, you might ask yourself, okay, so now I'm like producing a data product. And then I have someone who just want to consume it. Inside of organization, outside of organization. So can I put a price take on my data product? So can I just drive the value from that? So we can actually go and then say, well, now we can actually create a data economy
Starting point is 00:22:05 because now we can sell data products. And that's how data becomes oil, wheat, or whatever type of a commodity. Well, I think there's this interesting thing that we touched on that is part of the evolution of is that separation between storage and compute, right? Super important part of the evolution. and essentially all of the cost is emitted you. The storage is almost free. Like, not quite.
Starting point is 00:22:30 There's a same level where you can wrap up some costs. But for most companies, it's almost free. If you're in the majority of companies that don't have that much data, it's very cheap. And then I think you touched on this too. You've got like, okay, so I've got all the storage and then what up the person that's asset, like you have to handle the governance,
Starting point is 00:22:48 but then the person accessing the data brings their compute. So there's this interesting cost dynamic here too where like it's just there's an easiness to like yeah like you're going to have out to the state that we handle governance you bring your own compute and then from a cost standpoint like you're paying for whatever you're using. I think that's an interesting. Yeah maybe maybe that's that's super awesome because well first thing about that storage doesn't cost much well just try to count how much do we need to pay like for S3 to store a petabyte. The new terabyte is a petabyte. That's kind of like with situation nowadays. And there is this estimation that we have 150
Starting point is 00:23:26 zabytes of data stored and the estimation by 2030 is going to be like 2,000 zabytes. So you can just say that. So someone who is building their business on storage, it's a good time. Yeah. It's been a good time for a long time. Yes. Yes. So it's not going to be that cheap anymore I guess and therefore we need to drive value. But the good point is how to to use the computer on a different way. And there is two ways. So you can go on and use your NDP engine or you can think about like if majority of selects or reits is not that big it's like one gig yeah right and the question is why not to use your own laptop with something like
Starting point is 00:24:06 doug db beta fusions and especially with the power of the the individual machines now absolutely absolutely and that's something what we think is really in a laykeeper why not just to take something like duct db or data fusion use a wasm embed that in a browser and you just go open the browser and then you can write your query which will use your computer on your local machine go through catalog to rest read the prokey file and you will get your results so it might be not a millisecond response it's a couple of seconds but if it does the job so why not to do so what do you think about in that architecture like obviously people are going to have security concerns because i think they probably have a little bit of a false sense of security when everything's like processed on secure servers But, you know, and then now it's like processing it local. Like, how do you think that is going to be approached, the security challenge? Yeah. So the governance is a very hard topic.
Starting point is 00:25:02 And he's like kind of a question, who is going to issue the key of access? Yeah. And if you look at the organizations and that they are very free in choosing the tool, so usually if you go to like the large enterprise, they have everything. Yeah. And then there's a poor guy, CISO, who needs to. just to say, like, it's all secured, don't worry. There's no data bridges.
Starting point is 00:25:26 Right. Yeah. And then this is... You said, we have everything and don't worry. There's no data breach. Yeah, that's kind of a tricky question. And now looking at Lakehouse, it might be my biased opinion, but I think that there are only one place where I can just say that a person, a group, a tool can read, a right,
Starting point is 00:25:50 is a catalog because the catalog is the same like in the Postgres Postgres you will say okay lateral can read and the same in a catalog
Starting point is 00:25:57 and it's actually what we do we connect to the IDP and that was one of the decision of Playkeeper team so we're not
Starting point is 00:26:04 going to be at IDP so we're not issuing any tokens whatever we can just connect to Entro ID Octa Kit log and then we're going
Starting point is 00:26:11 to use that token and the inside of Playkeeper we have authorization concept based on Google Zanzibap paper we use OpenFGA so we rebag
Starting point is 00:26:20 a subset of a bag or our bag whatever bag and we can actually manage inside of catalog and say okay group a can read the table and person b can write to the table and the catalog is replaced and then it doesn't matter which tool all of them will go to us and say okay i am a person and we can actually solve that problem for a CISO yeah very cool yeah i was thinking about the CISO and the concept of IT archaeology that's not the type of dig you want to but i mean that is a pretty strong selling point around security because it's like we we just drastically simplified very big problem yeah and maybe just what i would like to add because i have a lot of conversation for
Starting point is 00:27:06 companies is like okay so from a governance perspective you have like a security but there is an additional concept which which is well companies trying to avoid that but let's assume the situation that I'm an owner of a table and I let's take a customer table and someone is using that table but I have usually no idea about that person that they use a table customer for whatever purposes but I hope that the purpose is just to build a report and go and make a business decision which will low the cost or it's just get some revenue in a company and if you go to the enterprise you will find like that situation that we have 100,000 of tables and the owner has usually no understanding who is using what type of the table. But what I can do as an owner,
Starting point is 00:27:50 I can go and make an alternative. So from security, from Airbag, it's all solved. But that will cause a problem on your side or a consumer side. Because your report is not working anymore, it will break. So you are impacted on a business decision. And from a governance perspective, that's a very eventual kind of, it's a very important part because what we need to do is somehow to solve a problem that the consumption pipeline is unbreakable, that business is not interrupted. And that's exactly what we built inside the Playkeeper, the interface that communicates with a contract engine, which means, let's assume I'm running Altotable. So the Airbag will tell the Eurovd Administrator, go for it.
Starting point is 00:28:34 On a second step, there is a business constraint inside of data contract, and there is an SLO stable schema, which actually prohibits me to do that type of operation. laykeeper will examine that as a law tell me there's a conflict. On the next step, the laykeeper can inform every consumer in me that there are two personal, one person who uses that product. So what I can do now, I can go and churn the contract and say, okay, someday, there is grace period, change your report, adapt to the new system. And that's the way how we can achieve that, that the pipeline, the consumption pipeline will become unbreakable. And going forward in couple of next years if there's no like a human that process but the eye agents and that will
Starting point is 00:29:18 help them you know like just to retail so what do you think the biggest practical barrier is to adopting that like today to adopting this architecture and then maybe where does that look like next year in a few years yes so the i would say from a lakehouse perspective is that it's still brand new and we miss a couple of things now what do you think the biggest missing things like for people that are like I really want this yeah from a technical perspective the missing part is the optimization of compaction of a table so it's very hard issue right now so because you know on a day one you start insorting in your table all is good on a day two you're trying to run your report it doesn't work anymore because the table has too many
Starting point is 00:30:02 small files and I think yesterday was like LinkedIn presenting some data about like compaction how hard that issue actually is and which means you need need to go on day two and run a compaction where you're going to take all that small files, let's say 10 small archive files and write one big file. Because it's not just the performance, it's costs, you know, like every get and list on S3 cost you money. Yeah. And if it's like 100 gets instead of one get, so you have like a different cost bill at that. And that is a very hard issue right now. So to solve the compaction, I know like a lot of companies trying to do that. And so well again, from a catalog
Starting point is 00:30:42 perspective. I think catalogs is the best place. A way to just tick the box and say like, okay, that table should be optimized one today. That's it. I know we're getting close here, but I want to ask about something you mentioned at the beginning, which is fascinating. So this architecture enables a world where you use supply chain, for example. So tires, you know, car manufacturer and then the actual, you know, John driving the road in San Francisco.
Starting point is 00:31:09 Actually, I guess technically, really, the future is that you're not in the car or you're not driving. You're just sitting in the car, you know, but it's still, you know, rubber around the road. What's interesting to think about if we, you know, there's all this technology underlying that. The catalog enables, you know, all these interesting ways to execute contracts between multiple different parties. But what we're talking about is an economy where products are being exchanged, right? like there's an exchange of goods, it's just that it's data, and the architecture actually enables that. How do you think that economy will form in terms of the actual format of the transactions,
Starting point is 00:31:49 right? Because there is this really fascinating set of commodities that are currently not monetized because the pathway to monetization is very inefficient. Like, it is actually a ton of work. There are security considerations, right? but the future you are describing is that we now have an architecture that can create efficiencies there and so what's the mechanism that's actually going to enable the exchange of goods well i think we still have to develop some stuff because when i talk to companies and they would like to share some
Starting point is 00:32:24 data and that is a misconception you shouldn't share data you should share the data product yes And the data product is a bit more than just a raw table. And so that is a piece which we don't have at the moment. So I know a lot of startups trying to build something around the data product. Because in a physical world, you don't want to buy plastic you would like to buy a product, right? That they can use. And if we have that piece, then we can think about like what type of platforms we can use to exchange for goods. Is it going to be like the Amazon for data products?
Starting point is 00:32:58 Right. Is it going to be a NASDAQ for some sort of like a commodity exchange and so on? So there are a lot of new stuff coming up in the next five to 10 years around the dating up. Yep. Man, that's going to be really fascinating. Okay, Victor, we're at the end, but tell our listeners where they can find out about Lakekeeper and it's an open source. It's an open source tool so they can go try it out.
Starting point is 00:33:21 Yeah. Well, everyone is invited just Lakekeeper.com. And then you will find actually the whole information or just go to the GitHub, up, try it out, give us a feedback. We're building that not in a bubble. So everyone needs just to try it out, give us a feedback. And if you like it, give us a star. It would awesome just to get a star. And we open for contribution. So if you want to develop a feature, you're welcome. Great. Awesome. All right. Well, thank you so much for joining us here in Oakland, Victor. Thank you, guys. All right. That's a wrap for episode four here in person at Data Council.
Starting point is 00:33:55 So I'll stay tuned. We've got more coming your way. The Datastack show is brought to you by Rudderstack, the warehouse native customer data platform. Rudderstack has purpose built to help data teams turn customer data into competitive advantage. Learn more at Rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.