The Data Stack Show - 92: Building a Decentralized Storage System for Media File Collaboration with Tejas Chopra
Episode Date: June 22, 2022Highlights from this week’s conversation include:Tejas’ background and career journey (2:49, 43:04)Digital collaboration with Netflix Drive (7:57)A formal version control component (23:44)Centrali...zed store vs. local affairs (31:05)The different skill sets a data engineer needs (37:38)How to get into data engineering (40:57)New technologies coming into day-to-day work (44:39)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. Today,
we're talking with Tejas from Netflix. He is building Netflix Drive, which is a fascinating
system that has enabled Netflix artists and employees around the world to collaborate
on media files. Super interesting. Kostas, I am really interested to know, working in the cloud is so second nature to most of us, right?
When you think about Google Drive or Dropbox or whatever, right?
Even files you can easily share on your phone, right?
It's just so natural.
And so I want to know what it was like
before they started building Netflix Drive.
And I know the pandemic was a catalyst for that,
but that's going to be my question. What was the workflow like before? And then how did they start to undertake migrating
that into the cloud? How about you? Yeah, it's a very good opportunity to discuss with
an expert, actually, what this whole thing about local first is when it comes to building
application experience and product experience.
So there's a lot of conversation and noise around that stuff, more on the web application
space.
And I mean, we see a lot of that happening actually in applications that are like Figma,
for example, where you can edit things
and you can collaborate online and be also offline and then continue working from there.
There's like a bunch of applications like
this that they assume this kind of like block alpha, let's say, experience.
So it would be great to talk with him and see what exactly it means on the back end
for that and what it means to try and do that on the scale of not like just anything small image with two people or three people
but at the scale of Netflix right where you have huge media files and very complicated workflows
so yeah I'm very very excited to chat about that stuff with him. All right. Well, let's do it.
Let's do it.
Tejas, welcome to the Data Stack Show.
We are so excited to chat with you.
Thank you.
And it's a pleasure being here to meet Kostas and you as well, Eric.
So thank you for having me here.
Absolutely.
Okay.
Well, we always start in the same place.
So tell us about your background and then how you ended up at Netflix.
Sure.
Yeah, so I actually grew up in India and came to the US around 12 years ago.
Did my master's from Carnegie Mellon University and started working in the Bay Area at some
smaller companies.
My focus has always been on backend systems, low-level operating systems.
That's where I started working.
And I was writing GNU debuggers.
So a lot of debugging for a processor course.
Through acquisitions, I went through several companies and then I
worked at a startup called Datrium.
And Datrium was trying to revolutionize how we think about storage,
about virtual machine storage.
And it was the one-stop shop for not just primary, but backup use case as well.
There I worked on file systems.
So I wrote, I helped write a file system and some components of it and some data
management primitives, like snapshotting, replication, all of that.
And then after Datrium, I got a job at Box.
And Box is a pioneer in content management, cloud content management.
So I was working there, building a lot of the services to power petabytes of data on the cloud.
Intelligently placing data on the cloud and leveraging a lot of techniques for on-premise and cloud storage and developing solutions around that.
At Netflix, I started working around two years ago.
And my focus has been mostly something called as Netflix Drive.
And the way to look at it is it's like Google Drive, but Google Drive is for your files and folders.
Netflix Drive is for media assets.
So when you think about Netflix, you think about the great movies that you watch.
And we also produce movies. We also make movies through Netflix studios.
Now, when you make
a movie, you have artists that collaborate to work on a movie, the visual effects, the animation side.
Typically, they used to go to the production site and work there. But with the pandemic trip in our
world today, they work from their home. So how do you build solutions that can give them the same
experience of collaboration? That is something that Netflix Drive enables.
And that's been my focus at Netflix as well.
So that's how I got into data.
That's how my journey has been so far.
Wonderful.
Okay, I want to hear about Netflix Cloud.
But first, and our audience knows that I always do a little bit of LinkedIn stalking.
I noticed that you were a software engineering intern at Apple very early on.
And I think on LinkedIn it said 2011, which was a really interesting time because the iPhone came out and I think widespread worldwide adoption is happening.
So I just am super curious, what did you work on there and what was that like?
Absolutely.
So Apple, I was a part of the media IMG group then, which is image and multimedia,
if I remember it correctly.
And a lot of us were working on applications such as FaceTime and a lot of like processor cores that were being licensed by Apple.
How to do testing for those cores.
So we used to get a processor cores from external companies. If I
remember correctly, it was Imagination. And they also used to provide us software. But their
processor cores, how do they fit on the Mac or the iPads and whether they render the image correctly
or not, that had to go through a rigorous process of testing and validation. And one of my first
jobs was to validate. So I used to write kernel extensions
to validate those processor cores on the iPads. That was my job. But some of my peers were working
on initial versions of FaceTime at that point in time. So it was really a fun time. It was
right around the time when Steve Jobs was still around. So we did bump into him a couple of times in Apple.
It was, I mean, life was very different back then.
Like, and it was, I guess technology was there, but Bitcoin wasn't there.
So, or at least I didn't know about it. So Apple was the craze.
It still is the craze, but it was, it was such a, such a great feeling to, you know,
be in college and work for Apple.
So I was, I was really in a happy space. And that was my first
time in California. So when I landed in California, I remember I saw everything golden. And I thought,
this must be heaven. It's just so beautiful. It is so beautiful. So I remember that feeling very
well. Yeah. Okay. Was Steve Jobs wearing a black turtleneck when you bumped into him?
Oh, yeah. Oh, yeah.
Perfect. That may be the best thing that I've heard in a long time. That's so great.
Yeah. Yeah.
Okay. So Netflix Cloud, this is what I'm interested in. It's really interesting, I think,
for me and probably some of our listeners to hear something like Google Drive or collaborating on business files or documents
or, you know, whatever, code, all that sort of stuff. It's so second nature now for anyone who,
you know, working around technology. And so it's a little bit funny to say, like,
imagine Google Drive, right? It's like, well, yeah, I mean, isn't that just how people work?
So yes. Can you explain this? Like, what was the infrastructure like and how did people interact with it before the pandemic? collaborate and work, it's second nature to all of us.
We don't even realize the things and services we use.
But when it comes to movie making, you have camera that captures a movie.
But the movie, when it is captured, is very different than the movie that you see on the
screen.
And there are so many things that go behind the scenes.
There's like cuts, edits, rendering.
There are so many different variations of the movie based on your device type.
So all of those processing,
pre-processing, post-processing activities
on a recorded image or a recorded movie
happens behind the scenes, right?
And you have a lot of camera footage that gets collected
and only 1% actually makes it to the final cut.
So to actually transfer that amount of data,
typically in earlier, artists used to actually go to the production site just because you can avoid that transfer of data, the time it takes to
transfer that data. You can directly work there on those and then you can actually, you know,
have the final iterations that you can work off and you can use the cloud. Let's say you worked
on an image, you posted it to cloud using Google drive.
Let's say a small image, right?
It's like a photo.
And then you can share it with some other artists that wants to like either add some
color or some other edits to that image.
But the problem is at scale, this doesn't work.
Google drive has limitations, right?
It only allows some thousands, tens of thousands of files.
And you know, you when you are an artist and you have a huge corpus of data, you
want to have the ability to just work on assets that you care about.
So surfacing the right assets on your purview is very important.
So you need that control when it comes to data.
It's not just you show all the data to everyone.
You need levels of control.
You need access levels, authorization, all of that to be built in. And those primitives,
those security primitives are lacking in Google Drive because it's just imagined to be a file
sharing service. So we wanted to take that forward. So that was one thing that was a problem.
The other thing is when artists work from their homes, they work from different machines.
You have Photoshop on one machine, you've configured your brush size and everything,
and you're working on an image.
Some you just close your laptop, you go to another machine.
You want to have the same image with your same settings persisted, right?
All of these fit in some files in these applications.
A simple way is you have those files, you put them in your email or your Google Drive
and you bootstrap the application on the other machine with those pulled in. But this can all be made
seamless. You can actually run Photoshop off a shared cloud drive that allows you to sync between
machines, sync with other artists, collaborate remotely with other people. That is what the
vision was for Netflix Drive.
And it's just one part of the equation.
There are so many other things that it can enable
because right now we are talking about your machines,
like your Mac OS, Windows, your Linux boxes.
So it has that component
where it has different OS versions that it supports.
But also, if we move away from media,
if you think about any form of sharing,
not just media and not just
studios netflix drive if built correctly with the vision can actually solve all the issues it's a
super set of google drives so we are able to and we've designed it in a way where you can plug in
any metadata and any data store on its backend so it's an abstraction layer we can plug in a cloud
database and a cloud data store we can plug in an on-premise database and an on-premise data store on its backend. So it's an abstraction layer. We can plug in a cloud database and a cloud
data store. We can plug in an on-premise database and an on-premise data store, or we can plug in a
hybrid one and it'll work. We plan to open source it so people can actually use it. And we are also
currently, the first version that we've built internally works with S3 at the object store,
and it works with CockroachDB as the metadata store.
So we have a layer on top of CockroachDB and a layer on top of S3, and that takes care of the
first version of Netflix Drive. But that's the vision with Netflix Drive. Fascinating. Okay,
Kostas, I'm going to hand the mic to you. I usually monopolize at this point in the conversation,
but I'm so interested to hear what you're going to ask, especially because because they just mentioned CockroachDB.
Yeah, we had the, we had those like the pleasure to have an episode with them like
some time ago and okay.
So obviously like a very interesting database system.
But I mean, I was about to ask you actually actually, if you have any plans to open source anything,
but you answered that already.
So having to find another question is,
after it gets open source,
are you going to start a company?
I think so far, we're not thinking that far ahead.
We have a lot of plans with Netflix Drive.
Open sourcing is one.
Trying to see the different applications it can support.
Building the different abstraction layers.
And one other thing with Netflix Drive is when you think about a file system, right?
You think about your local machine.
You have reads, writes, all of these calls that happen.
Netflix Drive not only does that, it also exposes APIs.
So you can actually call APIs on a live file system.
And these APIs are used to actually enable workflows that are built on top of Netflix Drive.
Like I said, to service the right files, hydrate a new machine with just a subset of the files from your older machine.
And we are using Netflix Drive actively in different types of ways.
We use it in animation.
We use it in rendering.
We use it in users' home directories.
So your MacBook, all of the files on your Mac can actually run off Netflix Drive,
and you can go on your other machine and it'll just surface all the files up.
So APIs allow that.
Yeah, I have a question because, okay, like these cloud file systems have been around for a while, like Dropbox, Box, Drive.
And, okay, they are
designed for sharing,
but not necessarily for
currently working over the
data. Exactly.
That's right.
And that's something that,
I mean, okay, it would be
awesome if you can do it. I remember
the first time that we, in my first company,
we had Dropbox to share data between the founders there and the employees.
And we were like, oh, it looks like a file system.
Let's start editing the same thing at the same time.
And then it was a bit of a mess, to be honest.
It's not exactly like...
Then we understand that, okay, this file system is not built
as a network file system as it has been used in the past.
So is this something that Netflix Drive can do?
It allows, like, it's designed with this in mind?
Yeah, that's right.
Because a lot of other things that we work on are creative iterations where artists actually work on, you know, drawings which require
strict requirements of latency and experience.
So Netflix Drive uses tiered forms of storage.
So it works with your local files, your local storage.
It will cache the files in your local storage to give you the great performance.
But it also allows you to have tiered intermediate storages before cloud.
And the way to think about this is,
let's say you're an artist.
You work on 100 iterations of something,
and then you're like, aha, this one is what I like.
You know, this is the one that I want to collaborate
and share with someone.
But those 100 iterations,
if you just build a system like Dropbox,
where you have cloud and you have your local machine,
and you don't have tiering,
all of these 100 iterations will go to cloud, right machine and you don't have peering all of these
hundred iterations will go to cloud right because you don't have enough space on your machine so
you're paying for the cost that is to storage cost in the cloud the rest request cost or whatever
cost you pay for cloud and then you'll have to delete files but you still are paying something
but having these feared forms of storages you can actually have these 99 iterations sit in the middle and
only the final cut can actually, by the use of the API, make it to cloud to be so that the other
artisan collaborate with you on that. Now this is unique. This is not this control. They probably
may have tiered storage on the background, like Box and Dropbox probably have it, but the user
cannot control it. Netflix Drive but the user cannot control it.
Yeah.
Netflix Drive allows the user to control it so that they can actually build different types of applications on top of it.
This is a very, very simple example of iterations, but like, let's say you,
some files are temporary in nature.
You do not want these temporary files.
You want the files to be around for some time, but you don't want them to like go to
cloud because you don't want to pay the cost.
You can still use the intermediate storage uh storage pods and store
these files there and netflix already has these storage pods around the globe so we actually can
bootstrap these different caches and stores media stores relatively simply and we can provide the
same experience and i'll explain why we also need this just to take a step back.
Many locations do not have great accessibility to cloud.
So the pipe between cloud and their machine is not that wide.
So they don't get great throughput and bandwidth.
But Netflix has its own back channel of having a great throughput and bandwidth by the use of Open connect and RCDNs. So we actually leverage that high network throughput that we get to stand up these
intermediate storage locations that are closer to the artists.
So if artists are working in LA, they do not have to push their files to cloud.
They can just work off the media stores or storage locations in the middle.
And most of the files can be surfaced from there.
So this gives great performance again.
So performance was the main reason why we designed it in a tiered form.
Okay.
And okay.
What are the trade-offs there when it comes like to concurrency?
So let's say we have two editors, they want to edit like the same
file at the same time, right?
Yes, there are.
And that's a great, great question because the simplest solution is what we tried to
design first, which is the last writer wins.
Whoever edited the file last wins, but that may actually result in losing work, right?
So the way we do that is we actually have allow the user to select what they want to
surface.
So if every, like, let's say that two artists are working on the same asset,
they can either overwrite the existing asset with their own final copy,
or they can accept the changes.
And we, so far we haven't designed it to do this, but the vision is whenever
an artist writes to a file, it generates an event and that event is actually
consumed by other artists that are
collaborating on the same file.
And when that event is consumed, they can take a decision whether they want to
just overwrite the consistent, the current copy with what the other artist has
worked on, save their current copy to a temporary location and do a git merge
in some ways on their own, or they want to write and reject that completely.
So
Sanyam Bhutaniya- can you completely. Can you think of an architecture where you have, let's say,
these events that are emitted when something's happening
on the binary file level,
and then the clients implement some kind of CRDT
that they can technically, automatically resolve
any kind of conflicts and then eventually be consistent
at the end and do it in such a way?
Because CRTDs are used a lot in that kind of environment, make total sense, but at the
same time, they're mainly designed for text edits, right?
Exactly.
Not exactly something that you're going to do on a petabyte binary file or something, right?
So is this how you think about it?
Yes. So for the first version, we are not thinking about CRDTs because a lot of these files are image files that are compressed differently.
So if you do the binary translation, it may not be the direct editing that may work
because you may not be able to see the image after that.
So we want the artist to actually open their own copy of the image, open
the other copy of the image, see if they want to apply something, change
something, and then commit a final copy to cloud.
So we're keeping it simple.
We are not trying to, we anticipate the first few versions that we are thinking.
We will not have a lot of artists collaborating on the
same asset at the same time.
So we envision it to be working between artists that work in different parts of the world that have different time zones that they work on.
So it's like a pipeline at that point where you have one workflow that you know, persist changes and then the other workflow in the next
stage picks it up and works on it.
So that's how we are thinking of it initially, but yes, those are great
things we learn along the way and we'll probably have to implement a CRDT for
media files like that.
Yeah, that's, that's a great point. But yes, those are great things we learn along the way.
And we'll probably have to implement a CRDT for media files.
Yeah, that's super interesting.
Sorry.
Oh, go for it.
Go for it, Cassius.
Very good.
Interrupted.
I'm very excited.
So you have to...
Yeah, no, no, no.
Go for it.
Yeah, because I was, I mean, not playing around,
but taking a look at what's going on with shared IDs, because, like, there's a lot of, like, not exactly hype, but there are things happening with MV case.
But they're very, like, their use case is, like, extremely focused on, like, collaborative text editing.
And maybe, let's say, the most, I'll say that like the different things that you might see
out there is like what Figma was doing.
You have like a visual environment and you have like the changes there that you track
and you try to make them visually consistent.
But at the same time, like, okay, it makes a lot of sense to think about how you can
use some kind of this functionality with data in general, right?
Like not just like a sequence of edits, like on a string.
So I'm very excited to hear that there is actually a use case that might make sense,
but I just have no idea about like the research that is going on there, to be honest.
So I don't know if anything has been done there yet.
Yeah, I'm not sure either.
So I mean, I'm the way i look at it right like we are probably not folks that are
best aware of how to merge two images that's where an artist comes with their creativity right
and i think at some point all the technology will not ever replace creativity yeah so we want to
still keep the true essence of that alive yeah so for the, I think that for smaller merges,
maybe it's easier to like tackle those
because Netflix Drive is a generic system.
It's not just used for images and files,
images and media files, but also other files as well.
Like there could be some tracking files
and other files, metadata files.
So those could definitely be solved by this,
but I still believe that there always will be an area where creativity will trump technology.
Yeah, 100%.
I mean, at the end, there are limitations and you need to find the right trade dose and when to involve a human there to decide that, okay, this is what we keep at the end and whatnot.
Eric, yours.
I'm sorry, but I got too excited.
Oh, no, no.
It's super interesting.
Well, one of the questions that flows from this naturally
is sort of a formal version control component of the system, right?
So, like, Last Rider wins,
and then you have, like, intermediary stores.
Like, have you considered, like considered a formal version control mechanism? Which is interesting to think about. The way that that generally happens, at least that I've seen in the context where I've seen people work on media, is you just save a working version of the file and like append a number to it, right? V1, V2, V3, right?
And so with heavy media that, you know,
you run and you like create a ton of storage flow
and all that sort of stuff,
you know, there's horrible documentation
because it's just a bunch of files in a folder
that are named sequentially, maybe, right?
So has that entered the conversation at all?
Oh, yeah, yeah, yeah.
So we are using for our first version because we are building with S3 as the backend data store.
S3 allows you to have multiple versions of an object.
So when Netflix Drive comes up, it has two variants or it has multiple variants.
But two of them are you explicitly save a file using APIs and the other is you automatically save a file in
the background. So while you're working on a file, it'll automatically sync the file. And every such
sync actually, you know, creates a new version of the file in S3. So it creates a completely new
version. It creates a new version. We are, the way we are thinking about it and it's still in the
works. When you think about media, media is a big file, right?
Now, even if you change a small pixel in the file, you don't want the entire file to be a new version.
So the way we do it is we chunk the file into parts.
And that allows us to just actually think and replicate or create a new version for the chunk that has the changes.
That's one.
So 99% of the file doesn't even need to be streamed to cloud.
It'll only be that chunk that changed.
That's one.
The other is it allows us to also de-duplicate better in the future.
Because if there are two media files, two big files, two movie files, 99% of the
movie is the same, you can actually de-duplicate and you will reduce your
cloud storage.
So we do have versioning on the background that S3 takes care of, and
we can also surface the correct version.
So we can, we're right now imagining if there's a time machine kind of an
interface where you can not just look at the current files, but go back and look
at the versions of your file by picking the right version from cloud.
So that's how version control can actually help us.
And even if two artists collaborate,
let's say they're both writing to the same folder.
And so one version will be overwritten by the other version.
You can always go back to that other person's version as well.
So that's the beauty of having that versioning with object stores.
Fascinating.
And so just to be clear, because this is really interesting,
that happens without the end user, the artist, having to declare anything related to version control.
But then they can sort of access the versions as needed.
Exactly. And also, auto-checkpointing implies that the artist doesn't even need to click on save. It's like your Google Doc where it'll auto save.
But some artists, you know, they don't want to like have the auto checkpointing.
They know that they'll work on their machine and they only want to save a file or they only want to copy the cloud or in intermediate data stores when they really are done with the file.
So they will explicitly like not use the checkpointing feature.
They will call the save button
and it'll automatically overwrite
the previous version in the background.
They don't need to rename or anything.
It'll just automatically take care of it in the background.
Fascinating.
And this is a really specific question,
but I'm just really interested.
When you chunk the file,
like when you break it apart,
well, two questions related to this.
So when they download the file to work on, are you actually
stitching it back together?
Yeah.
When they download it to work on or whatever.
So today we, uh, our data, our metadata store is the one that
maps of file to the chunks.
So we are given a file who it, if it has a hundred chunks, it'll
like have that metadata mapping
between a file and all the objects and each chunk becomes an object in S3.
So a file to an object translation happens by the metadata and that's where the values
and when we have to download a file, we actually look at the file.
We get all the objects that belong to a file.
Today as it stands, currently we download all the objects for a given file. But in the future, we have plans to just download specific chunks for the file
based on if the user requests a specific offset.
We do have on-demand prefetch, which means you do not fetch the files from cloud
unless you really touch the file locally.
So you will only fetch the metadata.
So your LS and other commands can work.
But when you start working on a file, that's when we will prefetch the file from cloud
and get all the objects for the file.
So we do have that today, but we get it still at the granularity of a file.
And we do not, today we don't have the implementation to get just specific objects from a file or
specific chunks from a file, but that's in the works.
Got it.
And the second question is, how did you decide how big the chunks are?
That's a great question. So we started this project when S3 did not have support for large file sizes.
We typically see movie files are upwards of five gigabytes sometimes. S3's maximum size was much less than that.
I think 500 MB at that point in time. So we decided, you know, we have to take matters in our own hands.
We have to chunk.
So we decided we'll go with 64 megabyte chunks.
That's something that we chose a number and we found that that
gives us the maximum benefit.
But we recently had a hackathon in Netflix and one of the projects that my team
worked on was variable chunking, which is don't choose 64 megabyte chunks.
Check if you can use variable chunking
algorithms like Rabin-Karp fingerprinting to choose variable size chunks, because that gives
you a higher probability of deduplication. And so we will explore that. That's one of the features
we can add in the future. But so far, we have fixed size chunks. Got it. And it's interesting.
So that's interesting. One of the reasons I asked actually was,
I wasn't initially thinking about storage capacity limitations.
I was more thinking about,
based on the average length of a movie or show file,
is there a particular slice of that in terms of how long it
is time-wise where it makes sense to have a cutoff or something? Today, I think there are a lot of
algorithms that try to research on what's the right chunk at which you break a movie. And people have gone into a lot of depth with these algorithms.
We don't, we haven't used them, but as part of this hackathon project, it was
just to see if we ever were to go that route, what are the savings we can get?
What are the impact we can have?
You know, it, it simplifies some things, but it complicates other things.
You really it's, it's the cost of how complicated you want your code to be versus
what are the benefits of having a simple solution and sometimes, you know,
keeping things simple isn't really bang.
So we, so that's, that, that was the goal.
And we still have to do a full analysis of if we were to variably chunk files
and store, what does that translate into savings? And what does that translate into performance?
So we're looking into that.
Yes.
I have a question that is more about working with data in the scope of, let's say, data
analytics and data science.
But you are describing an environment where people are working with data again, right?
Like audio files or video files, but still data.
But the approach there is like local first, right?
Like you need to have the file like locally to work.
You cannot just like edit the file like remotely on a server, in a a VM like on AWS. Now, when we are talking about like data analytics and in general, like the more,
let's say, structured data kind of work, we usually take, assume that like
everything is centralized, right?
Like we have a data warehouse or even a data lake or a lake house or
whatever we want to call it.
But still we are talking about like a centralized storage where the data is
and we execute all the queries there.
We don't have like a local first kind of mentality there.
Do you feel like we are going to see a transition or do you see use cases where local first
makes sense also for these use cases?
It depends.
I do believe that it may not be local first that we make sense for these use cases, but I do believe that decentralized data lakes and data warehouses will be something that will happen in the future.
And let me take a step back and explain what I mean here.
When you think about data lake, right, you have the central place where all the data lives and you run algorithms on top of it.
So you're taking your code to the data lives and you run algorithms on top of it so you're taking your code to the data but
now imagine if i tell you that there is a way to split this data into pieces each data piece can
be operated upon by a subset of that algorithm it may be a subset of the algorithm and the overall
impact of parallelly applying these algorithms on these data pieces is the same as applying an algorithm on that entire data lake.
It may seem impossible.
It may seem like that may not work because you need so much information from the entire data.
But there are techniques today where you can work on pieces of data and still aggregate them in a way such that the total sum of all of these or
these aggregations is equivalent to operating on a big data lake.
I think that is where we will move, the world will move because having a central data lake
has a lot of restrictions when it comes to privacy, when it comes to security.
It has privacy and security, but you can think of ways in which you have a medical
industry, right?
You have healthcare data.
You're working on some algorithms and machine learning on that healthcare data.
Now you go to a separate company.
You have healthcare data there.
You cannot translate all the data because that's compromising user information.
You cannot also hide a lot of the learnings because some of these learnings may tell you
a lot about the users as well. So it can tell you personally identifiable information. So how do you deal
with such situations where you want to apply the learnings from one dataset to the other
dataset, right? This becomes a classic case of there are two data lakes and you want to
apply some algorithms and take learnings from both. Now take this concept to other places.
You want a way,
and there are techniques
in privacy preserving computation
where you can work on
decentralized data storage backends
and still preserve the privacy,
work on encrypted data
instead of decrypting the data
and securely get your learnings.
That is where I think
we will move in the future.
And that's how I think.
As regards the current case of having local storage or not,
I think that for some of these applications,
latency is not that big of a concern
as much as it is throughput and bandwidth.
So for local storage,
you really are solving for the latency problem
where you have a user that needs a great
user experience. And these are creative artists, right? So, or you want a user who's clicking on
an application and they expect great UX. So you want that to be served locally. Some of these are,
you know, you run queries over large datasets, all of which may not be serviced locally. So I think
that we will still, until we get to the point where we solve the
decentralized data lake problem, we will still work with ways in which we run
algorithms on top of data, rather than taking data to algorithms.
And about like these algorithms and like these techniques that you talked,
what do we stand today in terms of like the state of the art, like, are we, do
you think that we already like to build products on that or there's still like
research that has to be done before we can start even thinking of the art? Do you think that we already like to build products on that or there's still research that has to be done before we can
start even thinking of productizing this?
Yeah, there is
research going on right now. There are different
ways in which you can work
on this data. So there is SMPC
which is secure multi-party computation.
There's an entire field of
research that allows you to break your data
into pieces, operate on each piece individually, and then collect all of the learnings and have the same impact as working on that huge, humongous piece of data.
The problem with these fields is that there's a lot of message passing between all of these different pieces for them to come to an agreement of what that eventual result should be.
That's a problem of consensus and it takes time.
So every operation that you do on each piece of the data,
you need to tell all your peers about it and you have to come to consensus.
So that is what is impacting.
That is why it's not mainstream today.
But I imagine that there'll be many research,
there'll be a lot of research that will come out in the future,
which will try to remove this consensus or figure out a way to get it much better.
That's when we'll have this more mainstream.
That's pretty cool.
All right.
I think we talked a lot about Netflix Drive.
I can't wait to see it open source, to be honest.
Do you have any estimation of when this is going to happen?
So we are thinking we'll try to open source it this year.
And if not this year, then maybe early next year.
But yeah, that's the plan.
Okay, that's awesome.
I'll keep an eye on it, see when it happens.
Thank you.
So I know that you have also other interests.
It's not just the Netflix drive that you are working on.
And you are also a very experienced engineer yourself, and you have seen
things changing and happening all these years. So there is something that I'd like to ask you,
based also on your experiences, not only in Netflix, but also other companies that
you have worked. And that's about the introduction of, let's say, a new engineering discipline,
which is data engineering and data engineers,
and how this is different than an application engineer
or a backend engineer or a systems engineer or, I don't know, whatever,
all these different flavors that we have of engineers out there.
Why it's different?
And if in your mind there is a good reason for
that like what what constitutes this difference like what are the different skill sets that a
data engineer needs yeah yeah and that's a that's a very pertinent and a good question i think so
from the way the way i look at it right one thing that binds all these engineering disciplines
together the common thing between all of them is curiosity. You have to be curious with regards to any field that you are specializing in. And that curiosity
can have different dimensions. When it comes to systems engineer, you're looking at how systems
work, trying to squeeze out latency, trying to squeeze out CPU performance, power, all of that,
optimizing for that. So that's something that that optimizing for that so that's something that you
focus on and that's something that you can work in a silo and you can you know come down to how
you can like look at metrics for your cpu and all of that on your machine and you can work off it
when it comes to uh front end or like application also not just front end it's application as backend and front end both
you actually work on you can work on the full stack but again your view is very it's for a
particular application for a particular machine for a particular like environment that it's written
in when it comes to data it's actually far broader because you cannot get a lot of learnings from just the data that is produced in these two different streams.
You actually have to work on systems that can allow you to operate on data at 10x the scale.
And so you actually can leverage a lot of tools such as Spark, Hadoop, and all of that, that can work in parallel to observe data, to get learnings from data.
These tools only give you benefit when they work at scale. So I think scale is a bigger
difference in data compared to these other disciplines. And also you optimize for,
there are different things you optimize for in these different fields, right? Like for an application, you optimize for user experience.
For systems, you optimize for system performance.
When it comes to data engineering, you optimize for learnings from the data.
Now, you want to remove the outliers.
You want to have the least amount of false positives or false negatives.
So you actually cleaning the data, having the right source of data,
how do you optimize the performance or parallelize your operations on data?
How do you become more cost efficient when it comes to data?
The other two fields do not have cost efficiency as a big metric there.
But with humongous amount of data, how can you like apply tiering of storage? How can you apply compute and storage scaling differently to save your costs
and reduce your time it takes to run these queries on top of that data?
Those are some of the things that differ here.
So I think that that's a different mindset and you have to look at every
problem with that mindset to see if you fit what it takes to be a data engineer.
Yeah.
Oh, I loved the, what you you said about the different things that you optimize for. I think that's
very to the point. That was great. And I mean, based on your experience, because you have seen
like many worldwide with many engineers out there, where the best data engineers are coming from?
Because no one like, okay, starts today out of Collins and they're like, I'm a data engineer.
Like, no.
So what, let's say the, the journey, the path that you see, that's probably like
the most successful for someone like to get into data engineering at some point.
Yeah.
So I think there are two ways, two things that I've seen.
One is if you work at a big company that has a humongous amount of data, you can
kind of learn about the tools that exist today, the state-of-the-art tools.
So if you work at a company such as Facebook or Netflix or Google of the world, which have a huge amount of data, you can look at how Spark jobs are optimized, how Hadoop is being used, how there are different data lakes that are being used for different applications.
So that gives you a very good idea to get started.
But I think that as an engineer, every time a system becomes 10 times its size,
you have to re-architect systems, right?
That's a rule of thumb.
Like every time you 10x, you have to throw away the older
architecture and re-architect.
You want to go through that in your life at least once or twice.
For you to like understand that what worked
for a petabyte of data will not work for 10 petabytes of data and you have to throw away the
existing tools that you're using to analyze that data and see what you can use so if you go through
that cycle once you kind of know how to like operate at that scale and how you can and then
you are well accustomed to bootstrapping a startup where you have very limited data and then scaling it, as well as working in a bigger organization where you have lots of data and you just need to optimize the cycles.
So I think these two are two variants that can help you as a data engineer.
Yeah, makes total sense.
Makes total sense.
Okay, one last question from me, and then I'll give the microphone to, and the stage to Eric.
So starting, if I remember when we had like a quick introduction before we
started recording, you said your first job was at Apple and you were writing
extensions to the kernel there, like to do testing.
From that to architecting and building like large scale distributed file
systems on the cloud and also let it also operate locally, how was the journey?
And what is the difference between you back then and today as an engineer?
Yeah, I think that you always stand on the
shoulders of giants, I would say.
So I think in my case, it's been, when I was right out of college, I was very
new to the field of software, my goal was to throw code at a problem.
You know, write a lot of code to solve a problem.
With my learnings and with my mentors, I've learned that sometimes you need
to remove code to solve a problem.
So the more code you write, the more bugs you will have. So learning such as those have really helped me. And also I have been exposed to a lot of technology in these different
companies and have learned to look at it, try to fit the puzzles together in different industries.
So that has really helped me. And the third thing is
that the confidence that my peers and my mentors have shown on me. So sometimes you don't even
realize your own potential until you're faced with a challenge or a problem. So I think that
in my case, I've been very lucky with that. So that's helped me in my journey. And I hope to,
you know, give back in the same way. Like I hope to be a good... I'm still learning from my mentors.
I hope to mentor other people as well
to keep the cycle of growth for everyone going.
But that's, I would say, a short answer.
Awesome, awesome.
Eric, all yours.
All right, well, we have time.
I'm going to switch this subject matter up
just a little bit
because one of your other passions
is blockchain, Web three, and all the
subjects that surround those. And as we were chatting before the show a little bit, you know,
we were talking about how those can be very sort of buzzwordy topics, you know, and actually,
I was thinking, Costas, we had a, do you remember when we interviewed, I think it was Peter from Aquarium and he had worked on self-driving cars, you know, and it was like, man, the media wave on self-driving cars hit way too early.
You know, it was just like, this is awesome.
And then it's like, okay, you know, this, the, you know, I think he said that famous quote, the future is here, it's just not distributed yet.
So is that the case with sort of blockchain and web three?
And I think specifically, because you work so deeply in data infrastructure,
what I'd love to hear is, when do you think, say, sort of your average data engineer,
those sorts of technologies are really going to impact their day to day work, you know, in a widespread way? And what is that going to look like?
Sure.
So I think when you step back and look at the world around us, right?
We, internet was one of the first decentralized architectures.
Like internet is made up of collection of machines, collection of nodes,
that, and it self-corrects.
Like if some nodes go down, you get routed to the right right information internet solves the problem of getting to an information a piece of information
but there are other paths to information which is storing information or processing information
right so internet kind of solves the decentralization problem of not having a central
authority when it comes to transfer of information but when it comes to storage of information or
processing of information they are still under the centralized waters because today you have
aws google cloud microsoft all of them like big humongous entities that own the cloud in some
ways so i feel that what blockchain does in general is you know takes this these these three parts out of that centralized waters
because you no longer have to have storage that is centralized and because you do not have storage
that is centralized processing is nothing but it takes storage as an input or it takes data as an
input and it produces some other data in some other form as an output so by taking data storage
out of the centralized waters you've by taking data storage out of the centralized waters,
you've inherently also taken processing out of the centralized waters. That is why blockchain
is more exciting these days, because it takes us to that vision of having all the three corners
of transfer, storing and processing of data out of the centralized waters. So that's one.
I think what will happen is, and you know, people use the term blockchain very loosely. People try to retrofit a lot of applications and make it blockchain-y in some ways. But really what the blockchain if there is some form of transaction
this transaction can be you sharing an image with someone the transaction can be used giving a file
to someone else that information should be stored on the blockchain everything else even when the
existing like you know blockchains that exist like filecoin ipfs all of that that exists
the data the metadata is stored on the blockchain but the data itself lives off the chain
because there is no value to putting data on a chain.
And the blockchain itself gets replicated on every node.
So if you have even a megabyte or a five megabyte
or a huge file, that's it.
You've now exhausted all the nodes in the network
because they will all have to download the entire chain.
So the way I feel what will happen is
we'll move away from the concept of centralized authorities owning the cloud to becoming like a decentralized cloud where compute, storage and services are not tied to AWS or Google or Microsoft, but it is run by a bunch of nodes around the world. it'll be tokenized and folks like you and me can actually give our spare CPU cycles and spare
hard drive in this decentralized cloud and we'll have encryption and other niceties take care of
storing the files on those and we'll tokenize and get rewards for it. That's where I think the future
will go with this. The other thing is also there is when we think about the physical world around us, we can look at scarcity, right?
There's land, but it's scarce.
You have a bottle and you can touch it and it's scarce because it's right there and it's one.
But if you have to take the same scarcity concept to the digital realm, how do you do that?
NFTs, non-fungible tokens in the blockchain enable you to do that.
They can take the concept of scarcity that exists in the physical world to a digital realm.
So you can imagine your land sales actually have tokens on the blockchain, which represent the land.
So you will avoid cases where the same land is sold to multiple people.
And there are many frauds that happen because of that.
So you avoid that completely.
The other thing, you pay taxes today to your government, right?
Any government.
But you don't know exactly where the money is going.
By having a blockchain take care of that, you actually can look at how much money the government is spending on different initiatives.
And that's open for everyone to view.
So I think that blockchain enables a lot of things that are not possible today because of regulations, because of central authorities. And when it comes to data, I think decentralized data cloud or decentralized cloud, or I mean, for lack of a better term, sky, would be how blockchain can disrupt data infrastructure today. So this ties into the same conversation we were having about
decentralized data lakes. So I think that that can be enabled with using blockchain-like technology.
Fascinating. That is absolutely fascinating. And so this is interesting. So it sounds like
you're proposing that the big three who have these massive businesses built on storage will see disruption from this decentralization.
They should see disruption because you do not.
So there are so many problems, right?
There's vendor lock-in.
Then you pay a lot of money.
You may think that storage is cheap because of the tiers. But if you look at, just you peek at the alternatives that you have in
storage and Filecoin, which are the decentralized storage alternatives,
they are like one 10th the cost of Amazon.
So you actually can, you know, use these decentralized techniques.
The only challenge is it'll come down to performance.
When you use Amazon, you know that you will be within the latency requirements
and performance requirements
that Amazon has, the SLAs.
But when it comes to decentralization,
unless you have a big number of nodes,
your performance will always be a bit flaky.
So it's like a cold start problem
where you really need to have
a lot of participants
for it to even make sense.
So I think that a lot of efforts
will go into that direction.
And I truly believe that
owning your data
and securing your data
and privacy are the new,
like our table stakes now.
It's not an afterthought.
Security will not be an afterthought
for any data service.
It will be built in
when designing services.
So decentralization
enables you to do that.
With central authorities, you're putting your keys in their basket and hoping them to comply,
which I don't think will work in the future. So I think that if you look at the world 20 years
from now, 30 years from now, and if you iterate backwards, this is the right time to invest in
researching on these things. Yeah, absolutely fascinating. Absolutely
fascinating. Yes, I know that we have some listeners who are looking for the decentralized
storage layer companies to invest in. Very cool. Well, Tejas, this has been such a fun,
such a fascinating episode. I have learned so much and just really appreciate you giving us the time
and teaching us so much about Netflix Cloud and then also what the future of blockchain and data
is. Absolutely. It was a pleasure here. And thank you so much for having me here. And I hope that I
was able to provide value to anyone who was listening in. And I hope that people bootstrap
ideas that can enable these technologies and make data better for the world.
Wonderful. Well, thanks again. Thank you. I have two takeaways from this. One is that I really
appreciated when we were talking about the decision around how to chunk the files and how big to make
those file sizes or lengths, that they had actually done
a lot of research on that and decided that keeping things simple and just doing 64 megabyte chunks
was just fine, right? They weren't going to try to over-optimize that, which seems like a really
natural thing to do. He wasn't opposed to doing that in the future. But I, you know, when we've heard this a lot of times on the show where, you know, we could do something more complicated, but simple works really, really well, which is great. And we haven't asked any of them, like, can you go home and watch Netflix without thinking about work?
And that was a huge missed opportunity.
I'm so mad I didn't ask that.
Yeah, that's true.
That's true.
I think next time we should do that.
Maybe we should also do like a Netflix reunion or something.
Get all of them.
We should.
We really should.
Yeah.
Yeah.
Yeah, we should do that.
Yeah, I mean,
I think Netflix is a very special company for this show. I mean, we've had some amazing conversations with great people so far and it keeps being amazing what kind of projects are
coming out of this company. I'm really looking forward to see like the next generation of cloud storage companies after Dropbox and Google Drive,
because it seems that and that's what I'm keeping from today, that there's still like a lot of space for disruption there.
And by the way, they're going to open source it at some point.
So, yeah, that is going to be so interesting.
Yep.
So we'll see. Maybe we'll have him back to be so interesting. Yep.
So we'll see.
Maybe we'll have him back on as a founder.
Maybe.
All right.
Well, thanks again for joining the Data Stack Show, and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every
week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.