The Data Stack Show - 92: Building a Decentralized Storage System for Media File Collaboration with Tejas Chopra

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Today, we're talking with Tejas from Netflix. He is building Netflix Drive, which is a fascinating system that has enabled Netflix artists and employees around the world to collaborate

Starting point is 00:00:37 on media files. Super interesting. Kostas, I am really interested to know, working in the cloud is so second nature to most of us, right? When you think about Google Drive or Dropbox or whatever, right? Even files you can easily share on your phone, right? It's just so natural. And so I want to know what it was like before they started building Netflix Drive. And I know the pandemic was a catalyst for that, but that's going to be my question. What was the workflow like before? And then how did they start to undertake migrating

Starting point is 00:01:10 that into the cloud? How about you? Yeah, it's a very good opportunity to discuss with an expert, actually, what this whole thing about local first is when it comes to building application experience and product experience. So there's a lot of conversation and noise around that stuff, more on the web application space. And I mean, we see a lot of that happening actually in applications that are like Figma, for example, where you can edit things and you can collaborate online and be also offline and then continue working from there.

Starting point is 00:01:50 There's like a bunch of applications like this that they assume this kind of like block alpha, let's say, experience. So it would be great to talk with him and see what exactly it means on the back end for that and what it means to try and do that on the scale of not like just anything small image with two people or three people but at the scale of Netflix right where you have huge media files and very complicated workflows so yeah I'm very very excited to chat about that stuff with him. All right. Well, let's do it. Let's do it. Tejas, welcome to the Data Stack Show.

Starting point is 00:02:29 We are so excited to chat with you. Thank you. And it's a pleasure being here to meet Kostas and you as well, Eric. So thank you for having me here. Absolutely. Okay. Well, we always start in the same place. So tell us about your background and then how you ended up at Netflix.

Starting point is 00:02:47 Sure. Yeah, so I actually grew up in India and came to the US around 12 years ago. Did my master's from Carnegie Mellon University and started working in the Bay Area at some smaller companies. My focus has always been on backend systems, low-level operating systems. That's where I started working. And I was writing GNU debuggers. So a lot of debugging for a processor course.

Starting point is 00:03:12 Through acquisitions, I went through several companies and then I worked at a startup called Datrium. And Datrium was trying to revolutionize how we think about storage, about virtual machine storage. And it was the one-stop shop for not just primary, but backup use case as well. There I worked on file systems. So I wrote, I helped write a file system and some components of it and some data management primitives, like snapshotting, replication, all of that.

Starting point is 00:03:39 And then after Datrium, I got a job at Box. And Box is a pioneer in content management, cloud content management. So I was working there, building a lot of the services to power petabytes of data on the cloud. Intelligently placing data on the cloud and leveraging a lot of techniques for on-premise and cloud storage and developing solutions around that. At Netflix, I started working around two years ago. And my focus has been mostly something called as Netflix Drive. And the way to look at it is it's like Google Drive, but Google Drive is for your files and folders. Netflix Drive is for media assets.

Starting point is 00:04:15 So when you think about Netflix, you think about the great movies that you watch. And we also produce movies. We also make movies through Netflix studios. Now, when you make a movie, you have artists that collaborate to work on a movie, the visual effects, the animation side. Typically, they used to go to the production site and work there. But with the pandemic trip in our world today, they work from their home. So how do you build solutions that can give them the same experience of collaboration? That is something that Netflix Drive enables. And that's been my focus at Netflix as well.

Starting point is 00:04:48 So that's how I got into data. That's how my journey has been so far. Wonderful. Okay, I want to hear about Netflix Cloud. But first, and our audience knows that I always do a little bit of LinkedIn stalking. I noticed that you were a software engineering intern at Apple very early on. And I think on LinkedIn it said 2011, which was a really interesting time because the iPhone came out and I think widespread worldwide adoption is happening. So I just am super curious, what did you work on there and what was that like?

Starting point is 00:05:22 Absolutely. So Apple, I was a part of the media IMG group then, which is image and multimedia, if I remember it correctly. And a lot of us were working on applications such as FaceTime and a lot of like processor cores that were being licensed by Apple. How to do testing for those cores. So we used to get a processor cores from external companies. If I remember correctly, it was Imagination. And they also used to provide us software. But their processor cores, how do they fit on the Mac or the iPads and whether they render the image correctly

Starting point is 00:05:56 or not, that had to go through a rigorous process of testing and validation. And one of my first jobs was to validate. So I used to write kernel extensions to validate those processor cores on the iPads. That was my job. But some of my peers were working on initial versions of FaceTime at that point in time. So it was really a fun time. It was right around the time when Steve Jobs was still around. So we did bump into him a couple of times in Apple. It was, I mean, life was very different back then. Like, and it was, I guess technology was there, but Bitcoin wasn't there. So, or at least I didn't know about it. So Apple was the craze.

Starting point is 00:06:37 It still is the craze, but it was, it was such a, such a great feeling to, you know, be in college and work for Apple. So I was, I was really in a happy space. And that was my first time in California. So when I landed in California, I remember I saw everything golden. And I thought, this must be heaven. It's just so beautiful. It is so beautiful. So I remember that feeling very well. Yeah. Okay. Was Steve Jobs wearing a black turtleneck when you bumped into him? Oh, yeah. Oh, yeah. Perfect. That may be the best thing that I've heard in a long time. That's so great.

Starting point is 00:07:11 Yeah. Yeah. Okay. So Netflix Cloud, this is what I'm interested in. It's really interesting, I think, for me and probably some of our listeners to hear something like Google Drive or collaborating on business files or documents or, you know, whatever, code, all that sort of stuff. It's so second nature now for anyone who, you know, working around technology. And so it's a little bit funny to say, like, imagine Google Drive, right? It's like, well, yeah, I mean, isn't that just how people work? So yes. Can you explain this? Like, what was the infrastructure like and how did people interact with it before the pandemic? collaborate and work, it's second nature to all of us. We don't even realize the things and services we use.

Starting point is 00:08:08 But when it comes to movie making, you have camera that captures a movie. But the movie, when it is captured, is very different than the movie that you see on the screen. And there are so many things that go behind the scenes. There's like cuts, edits, rendering. There are so many different variations of the movie based on your device type. So all of those processing, pre-processing, post-processing activities

Starting point is 00:08:29 on a recorded image or a recorded movie happens behind the scenes, right? And you have a lot of camera footage that gets collected and only 1% actually makes it to the final cut. So to actually transfer that amount of data, typically in earlier, artists used to actually go to the production site just because you can avoid that transfer of data, the time it takes to transfer that data. You can directly work there on those and then you can actually, you know, have the final iterations that you can work off and you can use the cloud. Let's say you worked

Starting point is 00:09:03 on an image, you posted it to cloud using Google drive. Let's say a small image, right? It's like a photo. And then you can share it with some other artists that wants to like either add some color or some other edits to that image. But the problem is at scale, this doesn't work. Google drive has limitations, right? It only allows some thousands, tens of thousands of files.

Starting point is 00:09:22 And you know, you when you are an artist and you have a huge corpus of data, you want to have the ability to just work on assets that you care about. So surfacing the right assets on your purview is very important. So you need that control when it comes to data. It's not just you show all the data to everyone. You need levels of control. You need access levels, authorization, all of that to be built in. And those primitives, those security primitives are lacking in Google Drive because it's just imagined to be a file

Starting point is 00:09:53 sharing service. So we wanted to take that forward. So that was one thing that was a problem. The other thing is when artists work from their homes, they work from different machines. You have Photoshop on one machine, you've configured your brush size and everything, and you're working on an image. Some you just close your laptop, you go to another machine. You want to have the same image with your same settings persisted, right? All of these fit in some files in these applications. A simple way is you have those files, you put them in your email or your Google Drive

Starting point is 00:10:24 and you bootstrap the application on the other machine with those pulled in. But this can all be made seamless. You can actually run Photoshop off a shared cloud drive that allows you to sync between machines, sync with other artists, collaborate remotely with other people. That is what the vision was for Netflix Drive. And it's just one part of the equation. There are so many other things that it can enable because right now we are talking about your machines, like your Mac OS, Windows, your Linux boxes.

Starting point is 00:10:56 So it has that component where it has different OS versions that it supports. But also, if we move away from media, if you think about any form of sharing, not just media and not just studios netflix drive if built correctly with the vision can actually solve all the issues it's a super set of google drives so we are able to and we've designed it in a way where you can plug in any metadata and any data store on its backend so it's an abstraction layer we can plug in a cloud

Starting point is 00:11:24 database and a cloud data store we can plug in an on-premise database and an on-premise data store on its backend. So it's an abstraction layer. We can plug in a cloud database and a cloud data store. We can plug in an on-premise database and an on-premise data store, or we can plug in a hybrid one and it'll work. We plan to open source it so people can actually use it. And we are also currently, the first version that we've built internally works with S3 at the object store, and it works with CockroachDB as the metadata store. So we have a layer on top of CockroachDB and a layer on top of S3, and that takes care of the first version of Netflix Drive. But that's the vision with Netflix Drive. Fascinating. Okay, Kostas, I'm going to hand the mic to you. I usually monopolize at this point in the conversation,

Starting point is 00:12:01 but I'm so interested to hear what you're going to ask, especially because because they just mentioned CockroachDB. Yeah, we had the, we had those like the pleasure to have an episode with them like some time ago and okay. So obviously like a very interesting database system. But I mean, I was about to ask you actually actually, if you have any plans to open source anything, but you answered that already. So having to find another question is, after it gets open source,

Starting point is 00:12:32 are you going to start a company? I think so far, we're not thinking that far ahead. We have a lot of plans with Netflix Drive. Open sourcing is one. Trying to see the different applications it can support. Building the different abstraction layers. And one other thing with Netflix Drive is when you think about a file system, right? You think about your local machine.

Starting point is 00:12:54 You have reads, writes, all of these calls that happen. Netflix Drive not only does that, it also exposes APIs. So you can actually call APIs on a live file system. And these APIs are used to actually enable workflows that are built on top of Netflix Drive. Like I said, to service the right files, hydrate a new machine with just a subset of the files from your older machine. And we are using Netflix Drive actively in different types of ways. We use it in animation. We use it in rendering.

Starting point is 00:13:26 We use it in users' home directories. So your MacBook, all of the files on your Mac can actually run off Netflix Drive, and you can go on your other machine and it'll just surface all the files up. So APIs allow that. Yeah, I have a question because, okay, like these cloud file systems have been around for a while, like Dropbox, Box, Drive. And, okay, they are designed for sharing, but not necessarily for

Starting point is 00:13:52 currently working over the data. Exactly. That's right. And that's something that, I mean, okay, it would be awesome if you can do it. I remember the first time that we, in my first company, we had Dropbox to share data between the founders there and the employees.

Starting point is 00:14:11 And we were like, oh, it looks like a file system. Let's start editing the same thing at the same time. And then it was a bit of a mess, to be honest. It's not exactly like... Then we understand that, okay, this file system is not built as a network file system as it has been used in the past. So is this something that Netflix Drive can do? It allows, like, it's designed with this in mind?

Starting point is 00:14:36 Yeah, that's right. Because a lot of other things that we work on are creative iterations where artists actually work on, you know, drawings which require strict requirements of latency and experience. So Netflix Drive uses tiered forms of storage. So it works with your local files, your local storage. It will cache the files in your local storage to give you the great performance. But it also allows you to have tiered intermediate storages before cloud. And the way to think about this is,

Starting point is 00:15:06 let's say you're an artist. You work on 100 iterations of something, and then you're like, aha, this one is what I like. You know, this is the one that I want to collaborate and share with someone. But those 100 iterations, if you just build a system like Dropbox, where you have cloud and you have your local machine,

Starting point is 00:15:23 and you don't have tiering, all of these 100 iterations will go to cloud, right machine and you don't have peering all of these hundred iterations will go to cloud right because you don't have enough space on your machine so you're paying for the cost that is to storage cost in the cloud the rest request cost or whatever cost you pay for cloud and then you'll have to delete files but you still are paying something but having these feared forms of storages you can actually have these 99 iterations sit in the middle and only the final cut can actually, by the use of the API, make it to cloud to be so that the other artisan collaborate with you on that. Now this is unique. This is not this control. They probably

Starting point is 00:15:58 may have tiered storage on the background, like Box and Dropbox probably have it, but the user cannot control it. Netflix Drive but the user cannot control it. Yeah. Netflix Drive allows the user to control it so that they can actually build different types of applications on top of it. This is a very, very simple example of iterations, but like, let's say you, some files are temporary in nature. You do not want these temporary files. You want the files to be around for some time, but you don't want them to like go to

Starting point is 00:16:22 cloud because you don't want to pay the cost. You can still use the intermediate storage uh storage pods and store these files there and netflix already has these storage pods around the globe so we actually can bootstrap these different caches and stores media stores relatively simply and we can provide the same experience and i'll explain why we also need this just to take a step back. Many locations do not have great accessibility to cloud. So the pipe between cloud and their machine is not that wide. So they don't get great throughput and bandwidth.

Starting point is 00:16:55 But Netflix has its own back channel of having a great throughput and bandwidth by the use of Open connect and RCDNs. So we actually leverage that high network throughput that we get to stand up these intermediate storage locations that are closer to the artists. So if artists are working in LA, they do not have to push their files to cloud. They can just work off the media stores or storage locations in the middle. And most of the files can be surfaced from there. So this gives great performance again. So performance was the main reason why we designed it in a tiered form. Okay.

Starting point is 00:17:29 And okay. What are the trade-offs there when it comes like to concurrency? So let's say we have two editors, they want to edit like the same file at the same time, right? Yes, there are. And that's a great, great question because the simplest solution is what we tried to design first, which is the last writer wins. Whoever edited the file last wins, but that may actually result in losing work, right?

Starting point is 00:17:54 So the way we do that is we actually have allow the user to select what they want to surface. So if every, like, let's say that two artists are working on the same asset, they can either overwrite the existing asset with their own final copy, or they can accept the changes. And we, so far we haven't designed it to do this, but the vision is whenever an artist writes to a file, it generates an event and that event is actually consumed by other artists that are

Starting point is 00:18:25 collaborating on the same file. And when that event is consumed, they can take a decision whether they want to just overwrite the consistent, the current copy with what the other artist has worked on, save their current copy to a temporary location and do a git merge in some ways on their own, or they want to write and reject that completely. So Sanyam Bhutaniya- can you completely. Can you think of an architecture where you have, let's say, these events that are emitted when something's happening

Starting point is 00:18:51 on the binary file level, and then the clients implement some kind of CRDT that they can technically, automatically resolve any kind of conflicts and then eventually be consistent at the end and do it in such a way? Because CRTDs are used a lot in that kind of environment, make total sense, but at the same time, they're mainly designed for text edits, right? Exactly.

Starting point is 00:19:19 Not exactly something that you're going to do on a petabyte binary file or something, right? So is this how you think about it? Yes. So for the first version, we are not thinking about CRDTs because a lot of these files are image files that are compressed differently. So if you do the binary translation, it may not be the direct editing that may work because you may not be able to see the image after that. So we want the artist to actually open their own copy of the image, open the other copy of the image, see if they want to apply something, change something, and then commit a final copy to cloud.

Starting point is 00:19:53 So we're keeping it simple. We are not trying to, we anticipate the first few versions that we are thinking. We will not have a lot of artists collaborating on the same asset at the same time. So we envision it to be working between artists that work in different parts of the world that have different time zones that they work on. So it's like a pipeline at that point where you have one workflow that you know, persist changes and then the other workflow in the next stage picks it up and works on it. So that's how we are thinking of it initially, but yes, those are great

Starting point is 00:20:18 things we learn along the way and we'll probably have to implement a CRDT for media files like that. Yeah, that's, that's a great point. But yes, those are great things we learn along the way. And we'll probably have to implement a CRDT for media files. Yeah, that's super interesting. Sorry. Oh, go for it. Go for it, Cassius.

Starting point is 00:20:33 Very good. Interrupted. I'm very excited. So you have to... Yeah, no, no, no. Go for it. Yeah, because I was, I mean, not playing around, but taking a look at what's going on with shared IDs, because, like, there's a lot of, like, not exactly hype, but there are things happening with MV case.

Starting point is 00:20:53 But they're very, like, their use case is, like, extremely focused on, like, collaborative text editing. And maybe, let's say, the most, I'll say that like the different things that you might see out there is like what Figma was doing. You have like a visual environment and you have like the changes there that you track and you try to make them visually consistent. But at the same time, like, okay, it makes a lot of sense to think about how you can use some kind of this functionality with data in general, right? Like not just like a sequence of edits, like on a string.

Starting point is 00:21:27 So I'm very excited to hear that there is actually a use case that might make sense, but I just have no idea about like the research that is going on there, to be honest. So I don't know if anything has been done there yet. Yeah, I'm not sure either. So I mean, I'm the way i look at it right like we are probably not folks that are best aware of how to merge two images that's where an artist comes with their creativity right and i think at some point all the technology will not ever replace creativity yeah so we want to still keep the true essence of that alive yeah so for the, I think that for smaller merges,

Starting point is 00:22:07 maybe it's easier to like tackle those because Netflix Drive is a generic system. It's not just used for images and files, images and media files, but also other files as well. Like there could be some tracking files and other files, metadata files. So those could definitely be solved by this, but I still believe that there always will be an area where creativity will trump technology.

Starting point is 00:22:28 Yeah, 100%. I mean, at the end, there are limitations and you need to find the right trade dose and when to involve a human there to decide that, okay, this is what we keep at the end and whatnot. Eric, yours. I'm sorry, but I got too excited. Oh, no, no. It's super interesting. Well, one of the questions that flows from this naturally is sort of a formal version control component of the system, right?

Starting point is 00:22:59 So, like, Last Rider wins, and then you have, like, intermediary stores. Like, have you considered, like considered a formal version control mechanism? Which is interesting to think about. The way that that generally happens, at least that I've seen in the context where I've seen people work on media, is you just save a working version of the file and like append a number to it, right? V1, V2, V3, right? And so with heavy media that, you know, you run and you like create a ton of storage flow and all that sort of stuff, you know, there's horrible documentation because it's just a bunch of files in a folder

Starting point is 00:23:37 that are named sequentially, maybe, right? So has that entered the conversation at all? Oh, yeah, yeah, yeah. So we are using for our first version because we are building with S3 as the backend data store. S3 allows you to have multiple versions of an object. So when Netflix Drive comes up, it has two variants or it has multiple variants. But two of them are you explicitly save a file using APIs and the other is you automatically save a file in the background. So while you're working on a file, it'll automatically sync the file. And every such

Starting point is 00:24:10 sync actually, you know, creates a new version of the file in S3. So it creates a completely new version. It creates a new version. We are, the way we are thinking about it and it's still in the works. When you think about media, media is a big file, right? Now, even if you change a small pixel in the file, you don't want the entire file to be a new version. So the way we do it is we chunk the file into parts. And that allows us to just actually think and replicate or create a new version for the chunk that has the changes. That's one. So 99% of the file doesn't even need to be streamed to cloud.

Starting point is 00:24:48 It'll only be that chunk that changed. That's one. The other is it allows us to also de-duplicate better in the future. Because if there are two media files, two big files, two movie files, 99% of the movie is the same, you can actually de-duplicate and you will reduce your cloud storage. So we do have versioning on the background that S3 takes care of, and we can also surface the correct version.

Starting point is 00:25:09 So we can, we're right now imagining if there's a time machine kind of an interface where you can not just look at the current files, but go back and look at the versions of your file by picking the right version from cloud. So that's how version control can actually help us. And even if two artists collaborate, let's say they're both writing to the same folder. And so one version will be overwritten by the other version. You can always go back to that other person's version as well.

Starting point is 00:25:36 So that's the beauty of having that versioning with object stores. Fascinating. And so just to be clear, because this is really interesting, that happens without the end user, the artist, having to declare anything related to version control. But then they can sort of access the versions as needed. Exactly. And also, auto-checkpointing implies that the artist doesn't even need to click on save. It's like your Google Doc where it'll auto save. But some artists, you know, they don't want to like have the auto checkpointing. They know that they'll work on their machine and they only want to save a file or they only want to copy the cloud or in intermediate data stores when they really are done with the file.

Starting point is 00:26:20 So they will explicitly like not use the checkpointing feature. They will call the save button and it'll automatically overwrite the previous version in the background. They don't need to rename or anything. It'll just automatically take care of it in the background. Fascinating. And this is a really specific question,

Starting point is 00:26:36 but I'm just really interested. When you chunk the file, like when you break it apart, well, two questions related to this. So when they download the file to work on, are you actually stitching it back together? Yeah. When they download it to work on or whatever.

Starting point is 00:26:53 So today we, uh, our data, our metadata store is the one that maps of file to the chunks. So we are given a file who it, if it has a hundred chunks, it'll like have that metadata mapping between a file and all the objects and each chunk becomes an object in S3. So a file to an object translation happens by the metadata and that's where the values and when we have to download a file, we actually look at the file. We get all the objects that belong to a file.

Starting point is 00:27:20 Today as it stands, currently we download all the objects for a given file. But in the future, we have plans to just download specific chunks for the file based on if the user requests a specific offset. We do have on-demand prefetch, which means you do not fetch the files from cloud unless you really touch the file locally. So you will only fetch the metadata. So your LS and other commands can work. But when you start working on a file, that's when we will prefetch the file from cloud and get all the objects for the file.

Starting point is 00:27:48 So we do have that today, but we get it still at the granularity of a file. And we do not, today we don't have the implementation to get just specific objects from a file or specific chunks from a file, but that's in the works. Got it. And the second question is, how did you decide how big the chunks are? That's a great question. So we started this project when S3 did not have support for large file sizes. We typically see movie files are upwards of five gigabytes sometimes. S3's maximum size was much less than that. I think 500 MB at that point in time. So we decided, you know, we have to take matters in our own hands.

Starting point is 00:28:27 We have to chunk. So we decided we'll go with 64 megabyte chunks. That's something that we chose a number and we found that that gives us the maximum benefit. But we recently had a hackathon in Netflix and one of the projects that my team worked on was variable chunking, which is don't choose 64 megabyte chunks. Check if you can use variable chunking algorithms like Rabin-Karp fingerprinting to choose variable size chunks, because that gives

Starting point is 00:28:51 you a higher probability of deduplication. And so we will explore that. That's one of the features we can add in the future. But so far, we have fixed size chunks. Got it. And it's interesting. So that's interesting. One of the reasons I asked actually was, I wasn't initially thinking about storage capacity limitations. I was more thinking about, based on the average length of a movie or show file, is there a particular slice of that in terms of how long it is time-wise where it makes sense to have a cutoff or something? Today, I think there are a lot of

Starting point is 00:29:39 algorithms that try to research on what's the right chunk at which you break a movie. And people have gone into a lot of depth with these algorithms. We don't, we haven't used them, but as part of this hackathon project, it was just to see if we ever were to go that route, what are the savings we can get? What are the impact we can have? You know, it, it simplifies some things, but it complicates other things. You really it's, it's the cost of how complicated you want your code to be versus what are the benefits of having a simple solution and sometimes, you know, keeping things simple isn't really bang.

Starting point is 00:30:13 So we, so that's, that, that was the goal. And we still have to do a full analysis of if we were to variably chunk files and store, what does that translate into savings? And what does that translate into performance? So we're looking into that. Yes. I have a question that is more about working with data in the scope of, let's say, data analytics and data science. But you are describing an environment where people are working with data again, right?

Starting point is 00:30:46 Like audio files or video files, but still data. But the approach there is like local first, right? Like you need to have the file like locally to work. You cannot just like edit the file like remotely on a server, in a a VM like on AWS. Now, when we are talking about like data analytics and in general, like the more, let's say, structured data kind of work, we usually take, assume that like everything is centralized, right? Like we have a data warehouse or even a data lake or a lake house or whatever we want to call it.

Starting point is 00:31:18 But still we are talking about like a centralized storage where the data is and we execute all the queries there. We don't have like a local first kind of mentality there. Do you feel like we are going to see a transition or do you see use cases where local first makes sense also for these use cases? It depends. I do believe that it may not be local first that we make sense for these use cases, but I do believe that decentralized data lakes and data warehouses will be something that will happen in the future. And let me take a step back and explain what I mean here.

Starting point is 00:31:56 When you think about data lake, right, you have the central place where all the data lives and you run algorithms on top of it. So you're taking your code to the data lives and you run algorithms on top of it so you're taking your code to the data but now imagine if i tell you that there is a way to split this data into pieces each data piece can be operated upon by a subset of that algorithm it may be a subset of the algorithm and the overall impact of parallelly applying these algorithms on these data pieces is the same as applying an algorithm on that entire data lake. It may seem impossible. It may seem like that may not work because you need so much information from the entire data. But there are techniques today where you can work on pieces of data and still aggregate them in a way such that the total sum of all of these or

Starting point is 00:32:45 these aggregations is equivalent to operating on a big data lake. I think that is where we will move, the world will move because having a central data lake has a lot of restrictions when it comes to privacy, when it comes to security. It has privacy and security, but you can think of ways in which you have a medical industry, right? You have healthcare data. You're working on some algorithms and machine learning on that healthcare data. Now you go to a separate company.

Starting point is 00:33:13 You have healthcare data there. You cannot translate all the data because that's compromising user information. You cannot also hide a lot of the learnings because some of these learnings may tell you a lot about the users as well. So it can tell you personally identifiable information. So how do you deal with such situations where you want to apply the learnings from one dataset to the other dataset, right? This becomes a classic case of there are two data lakes and you want to apply some algorithms and take learnings from both. Now take this concept to other places. You want a way,

Starting point is 00:33:45 and there are techniques in privacy preserving computation where you can work on decentralized data storage backends and still preserve the privacy, work on encrypted data instead of decrypting the data and securely get your learnings.

Starting point is 00:34:03 That is where I think we will move in the future. And that's how I think. As regards the current case of having local storage or not, I think that for some of these applications, latency is not that big of a concern as much as it is throughput and bandwidth. So for local storage,

Starting point is 00:34:20 you really are solving for the latency problem where you have a user that needs a great user experience. And these are creative artists, right? So, or you want a user who's clicking on an application and they expect great UX. So you want that to be served locally. Some of these are, you know, you run queries over large datasets, all of which may not be serviced locally. So I think that we will still, until we get to the point where we solve the decentralized data lake problem, we will still work with ways in which we run algorithms on top of data, rather than taking data to algorithms.

Starting point is 00:34:54 And about like these algorithms and like these techniques that you talked, what do we stand today in terms of like the state of the art, like, are we, do you think that we already like to build products on that or there's still like research that has to be done before we can start even thinking of the art? Do you think that we already like to build products on that or there's still research that has to be done before we can start even thinking of productizing this? Yeah, there is research going on right now. There are different ways in which you can work

Starting point is 00:35:15 on this data. So there is SMPC which is secure multi-party computation. There's an entire field of research that allows you to break your data into pieces, operate on each piece individually, and then collect all of the learnings and have the same impact as working on that huge, humongous piece of data. The problem with these fields is that there's a lot of message passing between all of these different pieces for them to come to an agreement of what that eventual result should be. That's a problem of consensus and it takes time. So every operation that you do on each piece of the data,

Starting point is 00:35:49 you need to tell all your peers about it and you have to come to consensus. So that is what is impacting. That is why it's not mainstream today. But I imagine that there'll be many research, there'll be a lot of research that will come out in the future, which will try to remove this consensus or figure out a way to get it much better. That's when we'll have this more mainstream. That's pretty cool.

Starting point is 00:36:12 All right. I think we talked a lot about Netflix Drive. I can't wait to see it open source, to be honest. Do you have any estimation of when this is going to happen? So we are thinking we'll try to open source it this year. And if not this year, then maybe early next year. But yeah, that's the plan. Okay, that's awesome.

Starting point is 00:36:31 I'll keep an eye on it, see when it happens. Thank you. So I know that you have also other interests. It's not just the Netflix drive that you are working on. And you are also a very experienced engineer yourself, and you have seen things changing and happening all these years. So there is something that I'd like to ask you, based also on your experiences, not only in Netflix, but also other companies that you have worked. And that's about the introduction of, let's say, a new engineering discipline,

Starting point is 00:37:06 which is data engineering and data engineers, and how this is different than an application engineer or a backend engineer or a systems engineer or, I don't know, whatever, all these different flavors that we have of engineers out there. Why it's different? And if in your mind there is a good reason for that like what what constitutes this difference like what are the different skill sets that a data engineer needs yeah yeah and that's a that's a very pertinent and a good question i think so

Starting point is 00:37:36 from the way the way i look at it right one thing that binds all these engineering disciplines together the common thing between all of them is curiosity. You have to be curious with regards to any field that you are specializing in. And that curiosity can have different dimensions. When it comes to systems engineer, you're looking at how systems work, trying to squeeze out latency, trying to squeeze out CPU performance, power, all of that, optimizing for that. So that's something that that optimizing for that so that's something that you focus on and that's something that you can work in a silo and you can you know come down to how you can like look at metrics for your cpu and all of that on your machine and you can work off it when it comes to uh front end or like application also not just front end it's application as backend and front end both

Starting point is 00:38:25 you actually work on you can work on the full stack but again your view is very it's for a particular application for a particular machine for a particular like environment that it's written in when it comes to data it's actually far broader because you cannot get a lot of learnings from just the data that is produced in these two different streams. You actually have to work on systems that can allow you to operate on data at 10x the scale. And so you actually can leverage a lot of tools such as Spark, Hadoop, and all of that, that can work in parallel to observe data, to get learnings from data. These tools only give you benefit when they work at scale. So I think scale is a bigger difference in data compared to these other disciplines. And also you optimize for, there are different things you optimize for in these different fields, right? Like for an application, you optimize for user experience.

Starting point is 00:39:28 For systems, you optimize for system performance. When it comes to data engineering, you optimize for learnings from the data. Now, you want to remove the outliers. You want to have the least amount of false positives or false negatives. So you actually cleaning the data, having the right source of data, how do you optimize the performance or parallelize your operations on data? How do you become more cost efficient when it comes to data? The other two fields do not have cost efficiency as a big metric there.

Starting point is 00:39:57 But with humongous amount of data, how can you like apply tiering of storage? How can you apply compute and storage scaling differently to save your costs and reduce your time it takes to run these queries on top of that data? Those are some of the things that differ here. So I think that that's a different mindset and you have to look at every problem with that mindset to see if you fit what it takes to be a data engineer. Yeah. Oh, I loved the, what you you said about the different things that you optimize for. I think that's very to the point. That was great. And I mean, based on your experience, because you have seen

Starting point is 00:40:36 like many worldwide with many engineers out there, where the best data engineers are coming from? Because no one like, okay, starts today out of Collins and they're like, I'm a data engineer. Like, no. So what, let's say the, the journey, the path that you see, that's probably like the most successful for someone like to get into data engineering at some point. Yeah. So I think there are two ways, two things that I've seen. One is if you work at a big company that has a humongous amount of data, you can

Starting point is 00:41:04 kind of learn about the tools that exist today, the state-of-the-art tools. So if you work at a company such as Facebook or Netflix or Google of the world, which have a huge amount of data, you can look at how Spark jobs are optimized, how Hadoop is being used, how there are different data lakes that are being used for different applications. So that gives you a very good idea to get started. But I think that as an engineer, every time a system becomes 10 times its size, you have to re-architect systems, right? That's a rule of thumb. Like every time you 10x, you have to throw away the older architecture and re-architect.

Starting point is 00:41:39 You want to go through that in your life at least once or twice. For you to like understand that what worked for a petabyte of data will not work for 10 petabytes of data and you have to throw away the existing tools that you're using to analyze that data and see what you can use so if you go through that cycle once you kind of know how to like operate at that scale and how you can and then you are well accustomed to bootstrapping a startup where you have very limited data and then scaling it, as well as working in a bigger organization where you have lots of data and you just need to optimize the cycles. So I think these two are two variants that can help you as a data engineer. Yeah, makes total sense.

Starting point is 00:42:19 Makes total sense. Okay, one last question from me, and then I'll give the microphone to, and the stage to Eric. So starting, if I remember when we had like a quick introduction before we started recording, you said your first job was at Apple and you were writing extensions to the kernel there, like to do testing. From that to architecting and building like large scale distributed file systems on the cloud and also let it also operate locally, how was the journey? And what is the difference between you back then and today as an engineer?

Starting point is 00:43:02 Yeah, I think that you always stand on the shoulders of giants, I would say. So I think in my case, it's been, when I was right out of college, I was very new to the field of software, my goal was to throw code at a problem. You know, write a lot of code to solve a problem. With my learnings and with my mentors, I've learned that sometimes you need to remove code to solve a problem. So the more code you write, the more bugs you will have. So learning such as those have really helped me. And also I have been exposed to a lot of technology in these different

Starting point is 00:43:36 companies and have learned to look at it, try to fit the puzzles together in different industries. So that has really helped me. And the third thing is that the confidence that my peers and my mentors have shown on me. So sometimes you don't even realize your own potential until you're faced with a challenge or a problem. So I think that in my case, I've been very lucky with that. So that's helped me in my journey. And I hope to, you know, give back in the same way. Like I hope to be a good... I'm still learning from my mentors. I hope to mentor other people as well to keep the cycle of growth for everyone going.

Starting point is 00:44:10 But that's, I would say, a short answer. Awesome, awesome. Eric, all yours. All right, well, we have time. I'm going to switch this subject matter up just a little bit because one of your other passions is blockchain, Web three, and all the

Starting point is 00:44:27 subjects that surround those. And as we were chatting before the show a little bit, you know, we were talking about how those can be very sort of buzzwordy topics, you know, and actually, I was thinking, Costas, we had a, do you remember when we interviewed, I think it was Peter from Aquarium and he had worked on self-driving cars, you know, and it was like, man, the media wave on self-driving cars hit way too early. You know, it was just like, this is awesome. And then it's like, okay, you know, this, the, you know, I think he said that famous quote, the future is here, it's just not distributed yet. So is that the case with sort of blockchain and web three? And I think specifically, because you work so deeply in data infrastructure, what I'd love to hear is, when do you think, say, sort of your average data engineer,

Starting point is 00:45:21 those sorts of technologies are really going to impact their day to day work, you know, in a widespread way? And what is that going to look like? Sure. So I think when you step back and look at the world around us, right? We, internet was one of the first decentralized architectures. Like internet is made up of collection of machines, collection of nodes, that, and it self-corrects. Like if some nodes go down, you get routed to the right right information internet solves the problem of getting to an information a piece of information but there are other paths to information which is storing information or processing information

Starting point is 00:45:54 right so internet kind of solves the decentralization problem of not having a central authority when it comes to transfer of information but when it comes to storage of information or processing of information they are still under the centralized waters because today you have aws google cloud microsoft all of them like big humongous entities that own the cloud in some ways so i feel that what blockchain does in general is you know takes this these these three parts out of that centralized waters because you no longer have to have storage that is centralized and because you do not have storage that is centralized processing is nothing but it takes storage as an input or it takes data as an input and it produces some other data in some other form as an output so by taking data storage

Starting point is 00:46:44 out of the centralized waters you've by taking data storage out of the centralized waters, you've inherently also taken processing out of the centralized waters. That is why blockchain is more exciting these days, because it takes us to that vision of having all the three corners of transfer, storing and processing of data out of the centralized waters. So that's one. I think what will happen is, and you know, people use the term blockchain very loosely. People try to retrofit a lot of applications and make it blockchain-y in some ways. But really what the blockchain if there is some form of transaction this transaction can be you sharing an image with someone the transaction can be used giving a file to someone else that information should be stored on the blockchain everything else even when the existing like you know blockchains that exist like filecoin ipfs all of that that exists

Starting point is 00:47:39 the data the metadata is stored on the blockchain but the data itself lives off the chain because there is no value to putting data on a chain. And the blockchain itself gets replicated on every node. So if you have even a megabyte or a five megabyte or a huge file, that's it. You've now exhausted all the nodes in the network because they will all have to download the entire chain. So the way I feel what will happen is

Starting point is 00:48:03 we'll move away from the concept of centralized authorities owning the cloud to becoming like a decentralized cloud where compute, storage and services are not tied to AWS or Google or Microsoft, but it is run by a bunch of nodes around the world. it'll be tokenized and folks like you and me can actually give our spare CPU cycles and spare hard drive in this decentralized cloud and we'll have encryption and other niceties take care of storing the files on those and we'll tokenize and get rewards for it. That's where I think the future will go with this. The other thing is also there is when we think about the physical world around us, we can look at scarcity, right? There's land, but it's scarce. You have a bottle and you can touch it and it's scarce because it's right there and it's one. But if you have to take the same scarcity concept to the digital realm, how do you do that? NFTs, non-fungible tokens in the blockchain enable you to do that.

Starting point is 00:49:03 They can take the concept of scarcity that exists in the physical world to a digital realm. So you can imagine your land sales actually have tokens on the blockchain, which represent the land. So you will avoid cases where the same land is sold to multiple people. And there are many frauds that happen because of that. So you avoid that completely. The other thing, you pay taxes today to your government, right? Any government. But you don't know exactly where the money is going.

Starting point is 00:49:31 By having a blockchain take care of that, you actually can look at how much money the government is spending on different initiatives. And that's open for everyone to view. So I think that blockchain enables a lot of things that are not possible today because of regulations, because of central authorities. And when it comes to data, I think decentralized data cloud or decentralized cloud, or I mean, for lack of a better term, sky, would be how blockchain can disrupt data infrastructure today. So this ties into the same conversation we were having about decentralized data lakes. So I think that that can be enabled with using blockchain-like technology. Fascinating. That is absolutely fascinating. And so this is interesting. So it sounds like you're proposing that the big three who have these massive businesses built on storage will see disruption from this decentralization. They should see disruption because you do not. So there are so many problems, right?

Starting point is 00:50:37 There's vendor lock-in. Then you pay a lot of money. You may think that storage is cheap because of the tiers. But if you look at, just you peek at the alternatives that you have in storage and Filecoin, which are the decentralized storage alternatives, they are like one 10th the cost of Amazon. So you actually can, you know, use these decentralized techniques. The only challenge is it'll come down to performance. When you use Amazon, you know that you will be within the latency requirements

Starting point is 00:51:05 and performance requirements that Amazon has, the SLAs. But when it comes to decentralization, unless you have a big number of nodes, your performance will always be a bit flaky. So it's like a cold start problem where you really need to have a lot of participants

Starting point is 00:51:20 for it to even make sense. So I think that a lot of efforts will go into that direction. And I truly believe that owning your data and securing your data and privacy are the new, like our table stakes now.

Starting point is 00:51:34 It's not an afterthought. Security will not be an afterthought for any data service. It will be built in when designing services. So decentralization enables you to do that. With central authorities, you're putting your keys in their basket and hoping them to comply,

Starting point is 00:51:49 which I don't think will work in the future. So I think that if you look at the world 20 years from now, 30 years from now, and if you iterate backwards, this is the right time to invest in researching on these things. Yeah, absolutely fascinating. Absolutely fascinating. Yes, I know that we have some listeners who are looking for the decentralized storage layer companies to invest in. Very cool. Well, Tejas, this has been such a fun, such a fascinating episode. I have learned so much and just really appreciate you giving us the time and teaching us so much about Netflix Cloud and then also what the future of blockchain and data is. Absolutely. It was a pleasure here. And thank you so much for having me here. And I hope that I

Starting point is 00:52:36 was able to provide value to anyone who was listening in. And I hope that people bootstrap ideas that can enable these technologies and make data better for the world. Wonderful. Well, thanks again. Thank you. I have two takeaways from this. One is that I really appreciated when we were talking about the decision around how to chunk the files and how big to make those file sizes or lengths, that they had actually done a lot of research on that and decided that keeping things simple and just doing 64 megabyte chunks was just fine, right? They weren't going to try to over-optimize that, which seems like a really natural thing to do. He wasn't opposed to doing that in the future. But I, you know, when we've heard this a lot of times on the show where, you know, we could do something more complicated, but simple works really, really well, which is great. And we haven't asked any of them, like, can you go home and watch Netflix without thinking about work?

Starting point is 00:53:46 And that was a huge missed opportunity. I'm so mad I didn't ask that. Yeah, that's true. That's true. I think next time we should do that. Maybe we should also do like a Netflix reunion or something. Get all of them. We should.

Starting point is 00:54:01 We really should. Yeah. Yeah. Yeah, we should do that. Yeah, I mean, I think Netflix is a very special company for this show. I mean, we've had some amazing conversations with great people so far and it keeps being amazing what kind of projects are coming out of this company. I'm really looking forward to see like the next generation of cloud storage companies after Dropbox and Google Drive, because it seems that and that's what I'm keeping from today, that there's still like a lot of space for disruption there.

Starting point is 00:54:36 And by the way, they're going to open source it at some point. So, yeah, that is going to be so interesting. Yep. So we'll see. Maybe we'll have him back to be so interesting. Yep. So we'll see. Maybe we'll have him back on as a founder. Maybe. All right.

Starting point is 00:54:55 Well, thanks again for joining the Data Stack Show, and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 92: Building a Decentralized Storage System for Media File Collaboration with Tejas Chopra

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.