The Data Stack Show - 17: Working with Data at Netflix with Ioannis Papapanagiotou
Episode Date: December 9, 2020This week on The Data Stack Show, Kostas and Eric are joined by Ioannis Papapanagiotou, senior engineering manager at Netflix. Ioannis oversees Netflix’s data storage platform and its data integrati...on platform. Their conversation highlighted the various responsibilities his lean teams have, utilizing open source technology and incorporating change data capture solutions.Key points in this week’s episode include:Ioannis’ background with academia and Netflix (4:42)Comparing the data storage and data integration teams (6:19)Discussing indexing and encryption (20:31)Netflix’s role in the open source community (27:21)Implementing change data capture (40:42)Using Bulldozer to efficiently move data in batches from data warehouse tables to key-value stores (42:43)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome back to the Data Stack Show.
We have an extremely exciting guest today, Ionis from Netflix.
And if you're in the world of data and technology and open source tooling,
there's a really good chance that you've heard of Netflix because they have so many
projects that have become extremely popular and have really done some amazing things. So
this is going to be a very technical conversation, probably some of it over my head, but it's rare
that we get a chance to talk with someone who's had such close involvement with projects like
this. Costas, what are the burning questions in your mind that you want to ask Ionis about
working on the data side of Netflix?
Yeah, first of all, what I think is going to be very interesting is that Yannis is actually
managing two teams over there.
One is dedicated to anything that has to do with media storage.
And the other team is dedicated to syncing data between the different data systems that
they have for more
kind of analytics use cases. So it's very interesting. I mean, it's quite rare to find
someone who has experienced this kind of different diverse use cases of manipulating and working with
data. And it would be amazing to hear from him what commonalities are there or like what
differences. So that's one thing, which I think is super interesting. The other thing, of course, is scale. I mean, Netflix
is a huge company dealing with millions of viewers. They have like very unique requirements
and needs around the data that they work with, something that's not easy to find in other
companies. So I think that would be great to hear from him
what it means to operate data teams
inside such a big organization,
not in terms of the people involved necessarily,
but at least in terms of the data that needs to be handled
and, of course, all the use cases
and the products that they build on top of data.
I mean, everyone knows and talks and makes comments
around the recommendation algorithms, for example,
that Netflix has.
And of course, all these are driven and supported
by the teams that Yanis is managing.
So I think it's going to be super interesting
to learn from his experience.
I agree.
I think the other thing that will be interesting
is to hear about some of the tools that they use
that people might
not be as familiar with that are popular products, but they get so much attention for things that
they've built. It'll be great to learn more about some of the more common tools that they use
internally. So why don't we jump in and start learning about Netflix? We have a really special
guest today who I'm so excited to learn from,
Ioannis from Netflix.
He's a senior engineering manager, and we're going to learn about the way that they do things at Netflix
and the way that they build things.
Ioannis, thank you so much for taking time to join us on the show today.
Thank you so much, Eric, for having me and Kostas.
Yeah, that's great to have you here today, Yanni.
For me, there's another reason, actually.
It's not just, I mean, the stories that you can share with us from Netflix,
which obviously is going to be interesting for everyone.
But for me, it's also important because you are the first Greek person
that we are having on this show.
So you are another expat from Greece, as I am.
So double happy today for this episode.
And I'm really looking forward to discuss and learn about your experience at Netflix.
You know, I found out over the last few years that there are like a lot of people working
in the data space that are Greeks.
I don't know if this is for a specific reason, or maybe they had like great faculty members
in Greece in the data space that resulted them in working this.
But you know, there's a lot.
So maybe you'll have more in the near future.
Yeah, yeah.
Actually, that's very interesting.
I don't know exactly why this is true.
But first of all, there's a quite big team in Redshift.
There's a kind of Greek mafia there
and the engineering of Redshift.
Snowflake has quite a few Greeks also working there.
And there are many...
I mean, you know more about the academic space here. But from what I know, there are also quite a few Greeks also working there. And there are many, I mean, you know more about like the academic space here,
but from what I know,
there are also like quite a few like Greeks
in the academic space in the United States
working on database systems.
So it looks like the Greeks have a thing around databases.
Databases and DevOps and SREs also.
They are also quite well known for having like good SREs.
So that's also interesting.
Yeah.
Cool.
So let's start.
Yannick, can you do a quick introduction?
Tell us a few things about you and your background.
How, what happened, what you were doing before you joined Netflix.
And of course, like, what is your role there and what you're doing on Netflix today?
Yeah, absolutely.
I'm kind of a little of a different Karim
than most people in the Bay Area.
I joined Netflix about five and a half years ago
out of academia.
So I was a faculty member at Purdue University
before I came to Netflix.
Well, before that, I was a software engineer as well.
So for me, coming to the Bay Area
about five and a half years was a new thing.
I joined Netflix as a senior software engineer, working mainly on the key value stores. And over
the years, for the last three years, I have been managing parts of the infrastructure,
especially starting from the key value store, NoSQL databases, and then recently moving to
a new organization about a year ago called the Storage and Data Integrations.
Our focus is building integration solutions and storage
and also like storage solutions
for whatever the company needs for us to provide.
Yeah, that's very interesting.
I know that actually there are like two parts
and two teams under your organization.
Let's say one that is like working with a data storage
platform and one which is the data integration platform and you have like two separated teams and
if i remember correctly and correct me if i'm wrong on that the data storage platform is more
responsible about having an overall like storage solution for the company which includes like
how you also store your media which is of, like a very big thing in Netflix, right? And then there's also the data integration
platform, which from what I understand works mainly in how the different data systems can
exchange and sync data between them. Is this correct? Yeah, that's correct. I'm surprised
with the background you have done. So yeah, that's absolutely correct you know on the storage side you know we've been we've seen like you know we're ingesting more media
assets out of our productions and those productions happen you know anywhere around the globe so you
know my team is responsible for some of the you know transfer solutions and also how we store the
data most of the data end up being stored like in the final cut on s3 and encrypted
so my team has been responsible for like services on how we transfer store the data
how we index the data how we encrypt the data and effectively like from the time they arrive
to netflix up to the time they they are stored in an object storage in our infrastructure
so those systems you know of course in the last few years as companies is growing a lot
and becoming one of the largest studios, you know,
have seen a great evolution.
And that's one of the reasons we actually start, you know,
building solutions in this space.
And the other side of the team, the integration team,
is mainly focusing on building integrations
between like different data systems,
like Cassandra, AWS Aurora, Postgres, MySQL with other systems
like sending the data to Elasticsearch
or sending the data to a data warehouse.
And both teams actually evolved
in the last few years
out of the needs of the company
to invest in this space. An example
would be, you know, we're building a lot of services, especially on the content side, that,
you know, we were using one source of truth, and then we're having another database to,
for example, index the data, like Elasticsearch. And, you know, we're chatting about ways that we
can effectively synchronize those data systems. And a few years back,
and there was not a good solution, you know, we're just using some scripts or using some jobs
over the weekend. And then we thought, you know, what's the best possible way for us to build a
solution that will kind of synchronize those two systems. And then eventually, as we evolved as a
team, we started supporting like more systems, like moving data from Airtable and Google
Sheets to data warehouses, and also moving data from data warehouses to key value stores
for, for example, our machine learning team to do online machine learning.
So yeah, this is how we effectively form those two teams right now.
And new teams, great and exciting areas to work on.
Yeah, that's super interesting.
Actually, before we move into more details about the technical side of things,
from an organizational point of view,
these two teams, I mean, from someone from the outside,
it sounds like they are working on quite different things.
I mean, okay, it's data again,
but very, very different types of data.
And I assume that inside the organization, let's say the consumers of this data are also different, right?
So how does this work in terms of like managing these teams?
Are there like on a technology or organizational level similarities between the problems that are solved?
What do they have in common?
And what's the difference between these two?
And I guess that this is very interesting for me
because it's quite unique.
And of course, it's also because it has to do
with being in Netflix and you have like a studio there
and you have like the scale of Netflix.
But I'm usually, you know, I meet data teams
that they work mainly on database systems,
more structured data.
So I would like to hear from
you what's the commonalities between the two problems and what are also the differences and
the challenges that you have seen by managing these two teams? Yeah, that's a great question.
You know, both teams have evolved from the needs of the company in the emerging content space. So
both the two teams have been working, focusing a lot on the content space.
And, you know, while the technology they're building is different, they have a few common things.
The first thing, most important thing is this is they're solving like immediate business problems.
Right. And, you know, given, of course, like the status of the team and the evolution of the team.
And the second aspect of both teams is they're building what we call high leverage data platform solutions. So they're building solutions that can be used by many different teams.
Now, in regards to the other aspect of your question about the challenges in leading two teams,
I think there are challenges, of course, but, you know, we have spent a significant amount of time
in, you know, in hiring and entertaining, you know, really amazing talent in the team. And,
you know, eventually that becomes a little easier for the manager to kind of manage the team. And eventually that becomes a little easier
for the manager to kind of manage the team.
And of course, like over the last few years,
we have evolved some of the practice
or have shared some of the practice
within the team in terms of the way
we do product management and project management.
And we have found some interesting efficiencies
and organizational structures as a group,
which effectively make everyone's life easier, right?
The other aspect also is that looking about
pretty much the identity of a Netflix engineer,
which is usually on the senior software engineer,
we're hiring people who are great in communication,
great in terms of building products,
but they're also hiring people that are great
in terms of how they deal with customers,
they deal with partners, and we deal with cross-functionally.
So that's why, you know, usually the management aspect of the manager becomes a little easier.
And that's one of the reasons I would say that, you know,
my job has been extremely hard to manage that wonderful team.
That's great. That's great to hear.
So can you share a little bit more about about the structure of the teams that exist there?
First of all, are the structures between the two teams identical?
Are there any differences there?
And share a little bit more information about the size, the roles, and stuff like that.
Just to get an idea of how a company like Netflix has evolved in managing these kinds of problems.
Yeah, I would say that we're definitely not the normal team
that you see in other platform teams in Netflix.
Again, a lot of that has been part of the quick evolution of the team.
So I think one of the teams is about, I think,
if I recall correctly, like 12 engineers or so.
And the other side of the team is about five engineers.
So mainly on the storage space we have
done a little more investment than integration space in terms of like the size but you know
to some extent both teams are working uh cross-functionally with many other
departments in the company so you know you may think that you know we are building a product
as a team but we are not we're usually you know building products in collaboration with many other
partners and again this is an artifact as i said of of the need of many teams to jump in and solve as a team, but we're not. We're usually building products in collaboration with many other partners.
And again, this is an artifact, as I said, of the need of many teams to jump in and solve those
business problems. And the fact that we were a little more lean and agile in terms of how we do
those practices. And of course, there are many other teams at Netflix that they actually, as you
said before, they work in a specific problem. One team may do the warehouses, the other team may do database,
the other team may do streaming platforms, and so forth. So yeah, we are, I guess,
the odd one out to some extent. It's very interesting. I mean,
it's surprising for me how linked the teams are and for the size of the company and find this very, very, very interesting. Cool.
So, I mean, my interest is more around the integration,
the data integration team, to be honest,
also because of my background and the stuff that I have done in my life working.
But before we move there and we can discuss further about it,
can you give us some more information,
technical information around the data storage platform
that you have,
especially for the media.
I think it's a very unique problem that you are solving there on a global scale.
As you said, you mentioned that you have production teams
all over the world.
And I think it would be great to know a little bit more
about what kind of technologies you are using
and what are also the use cases,
how the teams are interacting with this data
and what's the lifecycle of this data?
Yeah, absolutely.
So we, as you said,
we have productions anywhere on the globe.
We're ingesting data into our infrastructure,
the storage infrastructure,
through either like people uploading the data to Netflix,
we provide some sort of an upload manager
and there is a UI that people can use
to effectively upload the assets.
We're also providing,
we plan to provide very soon
a file system user space
where people can actually store the data
and effectively the data will be
back to the cloud. You can think about it
like Dropbox for media
assets, I would say.
Then we also have ways
that people can upload the data through different say. And then, you know, we also have like ways that people can upload the data through, you
know, different APIs.
And finally, you know, there are like, there are productions that even upload the data
through like Snowball devices, you know, those big suitcases that AWS provides that, you
know, if eventually the data are being stored in S3.
But in our case, you know, in all these cases, in the end of the day, the data are encrypted, and
they are stored in a specific format that we use on AWS S3.
That's where they finally get into.
And while the data are being transferred, we also like pretty much indexing each of
these files.
So we know what's the size of the file, what's the metadata of the file.
And then we can even group files together and create file sessions
or we can group files together in what pretty much we call
on the media industry side, the assets,
where an asset can be, let's say, a movie.
And this is represented by many different files.
And a lot of that, then you have like an abercrater services
that are using those files and folder services,
those metadata to some extent, to
generate any kind of business need they have.
And this is how at a high level the storage team is organized.
As a storage team, we also offer some other products.
We offer a file system as a service.
There are places in the company that also use AWS file systems, but we also offer our
own file file service,
which is based on Seth.
And as I said, our team is also important,
you know, is managing the way we store the data on S3 as well.
Aonis, one question for you on file storage.
And I thought of this when you mentioned groups of assets
and I may be thinking about this incorrectly, but do you, so, you know, you serve ads in a dynamic way on certain content. How are those files
managed? Because that can change depending on the context of the user. Are the ad assets and the
actual sort of media assets of like the show or movie or
piece of content that the user's consuming, are those stored together? And if not,
are there challenges around sort of delivering those in a dynamic way?
Yeah, that's a good question. First of all, I think that Netflix does not offer advertisements
on the platform, but the kind of the area that we have been focusing on is more of how we ingest media assets to Netflix,
not on how we stream the media assets to Netflix.
The streaming side is handled by a different team, which is the Open Connect organization,
which we have cashes around the globe where when you know, when you click to play a movie,
effectively get the content from that cast.
Our team is mainly focusing right now
the time that the data arrive from production to Netflix
after the time that, you know, they get, you know,
we do any internal post-production activities
like encoding and so forth.
Interesting.
And one follow-up question to that would be, so five years plus at Netflix, compression and file format are concerns.
What changes have you seen from that standpoint?
And has it affected the way that you store that data?
You know, I am not sure.
I would say the honest answer to that.
I don't think we're compressing right now data in the way we store them of course
like they're you know they can use file format like they can use like you know depending on
the resolution they have been encoded and so forth or being captured by the cameras
but i don't think that we're actually compressing right now the data before we store them
or object storage this is probably something that we should be looking at right but you know i'm
also not sure about you know the efficiencies we can get in terms of compression.
So yeah, I'm not sure about that area, to be honest.
So, Yanni, if I understand correctly,
the parts of your work in terms of the lifecycle
of the productions in Netflix starts from the production,
I mean, when the content is actually captured
and it ends when it goes through production
and also probably post-production,
and then you're done, right?
Then it's another team that is responsible
about taking this content and actually figuring out
how it has to be streamed and delivered to the end user.
Is this correct?
Yeah, that's correct.
But it's not only about, you know, doing that from the production.
Of course, the productions can do what they really like as well in some cases.
But it's also like, you know, there can be like post-production vendors that may use
our ecosystem.
So, you know, they might like VFX artists can use our system.
So even animation space or even like post-production, other post-production vendors can use it.
So it can be used by like different partners, I would say.
And a lot of that is also, to some extent,
some of them are abstracted from us
because they're actually using some of the higher level
business logic applications that the company has built.
Usually it comes to us when a file arrives to Netflix
and it becomes an indexable ecosystem for us to use.
So, a bit of more technical questions on that.
You mentioned two things about these assets.
One is indexing and the other is encryption.
So, let's start with indexing.
When you say indexing, you are talking about indexing the metadata of these assets,
or you also perform some other analysis on the video itself that can be searchable?
Yeah, that's a good question. For us, we're a platform team. We're like a low-level platform
team. We are actually, for the file itself, we keep metadata, of course. Metadata, of course,
we keep an ID for each of the files. And through that ID, we can characterize the files themselves. And of course, like we think that ID we keep like
a structured format about the metadata of the file itself. So for example, when you want to
see like how many files have been stored for a specific production, you can actually use that
ecosystem to derive those statistics. And then after that, we actually send the data to S3,
and then we kind of encrypt the objects.
And then we have our own key management service
that effectively takes the data and encrypts this data,
and then we store them on S3 eventually.
And then we keep also some form of metadata
for the objects we store on S3 as well.
Okay, so this indexing happens where? where these indices are stored, like in searchable.
Is this part of S3 again, or like you have a different kind of technology where this
indexing happens and then it's exposed like to the users for searching and whatever other
use cases you have?
Yeah.
So in terms of the files, we are actually having a service that kind of does that.
And then this is backed by,
it used to be backed by a graph database in the past,
which was based on Titan TB,
or like the most modern Janus graph.
We recently replaced that with using Cockroach TB.
And then there is some indexing capabilities of that through Elasticsearch.
And then the metadata for how we store the data
effectively on AWS S3,
we're actually using Cassandra cluster. And of course we also have Elasticsearch cluster for index store the data effectively on AWS S3. We're actually using a Cassandra cluster.
And of course, we also have a last-case cluster
for indexing the data.
Oh, that's very interesting.
How did you decide to use CockroachDB, by the way?
I mean, there are some qualities of CockroachDB
that we appreciated.
And as we want to effectively make
some of these services more global,
the ability to have distributed transactions became fairly important for these services.
So we thought that Cockroach is more what we call a new SQL database that provides those new capabilities.
And therefore, it was interesting for us because it provides the guard protocol is based on Postgres.
So it was kind of fairly easier for us because it provides the guard protocol is based on Postgres. So it was kind of like fairly easier for us.
People didn't understand SQL.
And so it became like an easy transition for us from like a Titan DB interface,
which we initially thought was great,
but then effectively understood that the level of nestings
between like different files are not that many.
So that's why eventually we decided CockroachDB.
That's very interesting.
You are like one of the first... I'm aware of CockroachDB and I'm following their development, but it's very interesting to hear from someone
who's using it in the production environment. So that's why I wanted to ask and I didn't want to
miss the opportunity to ask about it. That's great.
So second question, because I said that one is the indexing, the other is
the encryption. So how important is encryption
and how do you perform encryption efficiently
on such a large scale?
Because I assume that if we're talking about
uncompressed media files,
we are talking about huge volumes of data.
So how does this work and what kind of overhead
it adds to the whole platform?
It's definitely like a lot of media assets
at the petabyte scale.
But at the same time, the speed in which we receive the assets
is not that huge that you can expect, let's say,
from a direct-to direct to consumer case because
this is like more on the enterprise software side right so in that case the speed is is less of
importance though it is importance in many cases when you know we have to turn around pretty fast
for a production so you know for us it's important because you know we want to make sure that you
know we store our data in a secure way. And then even the access mechanism of the data
is fairly controlled.
So we want to make sure that whoever accesses the data
has the right to be able to do that.
So the data cannot be viewed by anybody external to that.
So that's why we kind of focus a lot on the encryption side
of the data.
And of course, we have different formats that we store the data,
encoding formats, and of course,
each one of them is encrypted as well.
So, Yanni, just to understand a little bit better, the way that you have implemented this is like, as we consider, let's say, a file system that itself implements encryption, or you encrypt the object itself on top of the file system?
Yeah, so we encrypt the actual object itself that is being stored on AWS,
but the way we are going to
present the data to a user
could be through a UI
in a file format, or it could be
through a file system in user space. And those data,
the way we're going to see them, are not going to be encrypted, right?
You're going to think you're using
a normal file system,
and you're doing normal interactions
as you would do with, let's say, an NFS mount on your laptop, right?
And you just see the data.
But of course, in order to get the data, you'll have to get the proper privileges and have
the proper access to the proper project and so forth.
So from the user perspective, let's say from the artist perspective, you're just seeing
a file system.
But the way that actual data is stored on the cloud is encrypted.
Oh, that's super interesting.
And the appropriate user management that you
have built and access
management and all that stuff, do you use technologies
from AWS to do that?
Like IAM or something?
Or it's something that you have built internally?
We use IAM.
IAM.
IAM, yeah. But I-A-M.
IAM, yeah.
Yeah.
But of course, there's a number of internal services that our information security team has built
that are specialized for the Netflix business itself.
Right.
Okay.
I think enough questions for the data storage platform.
As I said, I didn't intend to ask so many questions around that,
but it was super interesting.
And every one of your answers actually brought in more questions.
So let's move forward to some other questions.
So I have a question.
I mean, I've seen that Netflix is quite active in terms of open source.
And when I say active in open source,
both in terms of how you adopt open source internally,
and I think we've heard exams of technologies already,
but also by contributing back to the open source community, let's say.
What's the reason for that?
I understand why you use the tools,
although I'd like to hear also your opinion on that,
why you prefer to use open-source solutions.
But what's the relationship that Netflix has with open-source
and why you decided to do that?
And what's the value that you see not only as an organization,
but you personally as Yanis for that?
Yeah, so I'm going to probably focus a little
on the data platform perspective
for this question but you know netflix kind of follows like a balanced approach there are a
number of systems that of course we're building in-house they're also like a number of what users
of many open source projects you know like kapatska sandra elastic shirts kafka flink
and many others and you know we have also sourced a number of our own solutions like
Metacad, Iceberg, Evcast for our caching solution, Dynamite, which is a proxy layer for Redis.
And we're also using vendor solutions like a lot of relational database offerings from AWS.
We have invested a lot into both open sourcing some of our code, but also open sourcing supporting
some of the open source community.
I mean, in an example,
we have a healthy number of Apache Cassandra committers
in our database team.
And of course, there are many projects
that we're supporting the community
as we use those products,
both because we use them
and we want to make sure that if there's a bug,
we can fix it,
but as well as we want to support back
the open source community.
There are many reasons that we also do open sourcing, but I think fundamentally one of them
is, of course, the hiring. You can get really great engineers when they contribute to your
project. You tend to know them better by the way they interface, not only about the technical
skills, but also some sort of how they collaborate, how they communicate, and so forth.
But it's also, I think, are also other benefits in my opinion.
Like for example, when somebody opensource a project
and then maintains that project properly and so forth,
it becomes like an identity, right?
You tend to have these external identities.
So to some extent, you know, you make yourself marketable
in the future as well.
So that's why we see like many ICs are excited about,
you know, open sourcing some of the projects. Another reason as well is that, of course,
we run systems in productions that we have in the open source space. And many of these systems,
we want the community to contribute to them, evolve and make them better so that they can
fix bugs, we can fix bugs that we see. Maybe we're going to see similar problems.
But the more of our open source teams that are adopted by the community,
the more we're going to have those commonalities
between those different commons that use the same open source projects.
And of course, as I said,
there are a number of projects that we have either donated
to the Apache community or the Cloud Foundation community
and so forth, so that we can
effectively enlarge the community from just
Netflix engineers working on a project.
Yeah, I think you
taught some very good points around
open source and why it's important
in a company. That's
really, really interesting to
hear, especially what you said about
two things. One, hiring, which is important hear, especially what you said about two things.
One, hiring, which is important,
but the other is also about what you mentioned
about collaboration.
That's super interesting.
So in terms of the projects that you have on short so far,
and if you know, which one is the most successful
in terms of adoption by the open source community
so far from Netflix?
I think there are a number of projects
that have been successful.
And to be honest with you,
most of the most successful ones,
I was not involved into them.
So out of the way,
what I'm thinking right now,
I think Spinnaker has been fairly successful
as a multi-cloud continuous delivery platform.
Metaflow is another recent example,
which we recently spoken about publicly. So these are the two main projects that come to my mind
right now that have been very much a big success recently.
And your favorite one that came out of your teams?
My favorite one. So I would say I have two favorite ones out of the fact that it was
managing those teams, the key value stores on Netflix. so one of them is evcast which is our casting solution that
we use on netflix and then the second one was dynamite which is like a proxy layer we use
for some of the again no key value stores that we have here at netflix i was part of the when i
joined netflix i was part of the dynamite team for about two and a half years helping this project
you know contributing back to the open source and And I would say that it was really, really exciting
to work and collaborate with a number of companies and open source users.
That's interesting. What was the initial need that made you like build Dynamite? You said Dynamite
is the cash on top of Redis, right?
Yeah, because I think back then, fundamentally speaking,
Redis was a single node system.
I think later on with Redis cluster,
it became like, again, like a multi-node system,
but it was more like a master-slave system,
like primary-secondary system,
where it's great, it focuses a lot on,
if you think about the CAP theorem or the consistency and partition tolerance,
whereas Dynamite and Netflix
mostly is focusing more on the
availability side because a lot of what
makes sense for the business is to make sure
that we achieve like seven nines of availability.
So that's why we wanted a system
that would still have the
properties of Redis, which is really amazing
in terms of like a no-key value store
with advanced data structures and all the amazing work that Salvatore Sanfilippo has done, but still make it highly
available.
And that's why we chose to build that kind of proxy layer above Redis.
There are also a few other things.
We were working on the Cassandra space for many years now, again, another AP system,
and we had substantial experience with the way the Dynamo protocol works.
So a lot of the sidecars and the components of the ecosystem were pretty easy or automation was
kind of pretty easy to adapt with Dynamite based on this architecture.
Ioannis, a question on internal projects that's sort of more general, not necessarily about specific projects, but
do you, what is the process like of deciding to undertake a project like Dynamite? Do you have
lots of conversations about those things internally? And then as a follow-up, are there
lots of things that you talk about that you don't end up building?
You know, I think one of the great things about Netflix is the fact that a lot of the decision-making is happening at the software engineering layer, and we have this notion
of informed captains.
So yeah, usually the informed captain brings up a business use case on why we need to build
a product or a project, and then tries to communicate that with a number of partners
and tries to make sure that,
you know, there's alignment
that this is,
this will provide like,
you know, substantial business value
to the company.
And then,
and then continues like
building the project to some extent
and then try to showcase
through maybe a prototype
that the value this project
is going to have to the company.
And then,
it then takes some sort of
a natural way, I would say,
by, you know, the leadership team funding the project and then making like a successful project within the company so yanni i
mean from your experience so far with all these open source projects that you have published at
netflix and considering that many of these projects are the outcome of like very specific
and at large scale like problems
that Netflix has. So I'm pretty sure that there are many people out there like other data engineers
who are dealing with probably similar problems, but like not at the same scale, right? So what's
your advice towards like the people out there that they learn about these technologies and how they
should use and try to adapt these technologies to the scale of problem that they have about these technologies and how they should use and try to adapt these technologies
to the scale of problem that they have.
Is there something like that you have seen
or you have communicated with the communities out there?
And what do you think is important for someone to keep in mind
when using all these projects that Netflix is maintaining right now?
Yeah, that's a good question.
I mean, I have found even myself fairly challenging to really identify the right source of information overall.
So I understand when people see many companies, including Netflix, open source and other projects, which one is the one that some person may want to invest?
And to be honest, in many cases, some of these projects are being built based on the advanced needs of a specific company.
So, you know, if I was starting new, you know, what we'll propose is, you know, first understand the problem space, you know, before kind of going deeper in a specific solution.
Again, unfortunately, I have not really found a great description about, you know, our space, which is kind of data platforms. Other than, I'll say, like, most recent post by Andreessen Horwitz about, you know, the high-level architecture of data platforms other than i'll say like most recent post by andreessen horwitz about you know the high level architecture of data platforms but you know and the second
step would probably be like identify a project that is in an interesting area maybe have like
a healthy number of contributors that someone can collaborate and grow by learning from other
people that are more experienced and of course like that project does not have to be like
necessarily like a Netflix project.
But, you know, as I said, you know,
if somebody would be interested
in a Netflix project,
you know, there are like a few of them
that have a very healthy community around them.
Like one of them is, as I said, Spinnaker,
which as I said, a multi-cloud
continuous delivery platform.
And of course, like there are other projects
that, you know, Netflix has been using.
For example, we have been doing,
you know, a fairly number of contributions on the Memcast
infrastructure and
many other projects as well
that other companies or other entities
have built. That's some great advice, I think.
Cool. Thank you so much
for that. So, moving
forward and let's talk
a little bit more about the data integration platform.
Can you
describe in a little bit more detail what the data integration platform. Can you describe in a little bit more
detail what the data integration platform does and what's the problem behind it? Why it's a problem
like in Netflix? And what's the solutions that you have come up for these problems?
Yeah. So the data integrations team is like, I would say,
like a small but very talented team,
which effectively, you know,
focuses in building
an array of integrations.
The formation of the team initially
was done, you know,
based on the fact that
we wanted to build some solutions
in which we will be able
to keep multiple data systems in sync.
And so we start investing
in building like
change data capture solutions and connector,
you know, for relational databases like Postgres, MySQL, Aurora, or, and recently,
most recently, about a year ago, we also started investing in the NoSQL space like Cassandra.
The latter is a little more complicated because it has those, you know, characteristics of a
multi-master eventually consistent system. Actually, one of my team members gave a talk recently at QCon.
So he spoke in the details about those
for people who are quite interested to listen about that.
And of course, we have written a few blog posts
about Delta and DBLog,
which is kind of the systems we have built.
But at a high level, we were seeing patterns
that people were building,
they were having different data systems,
they were trying to solve this problem with some sort of multi-system transactions
which don't really work, or even like with some sort of repair jobs
when one of the systems was becoming inconsistent.
So we tried to build some sort of solutions that would kind of not need to do that,
but rather some sort of parse, some sort of the log of a database, send this
log through a streaming system and then send the data to another system that it's going to be like
your secondary system. But as our infrastructure evolved, more services were actually using those
database integrations. And effectively, we came towards a more high level project, which is now
a project that many teams are working on, which more high-level project, which is now a project
that many teams are working on, which is called the data mesh, which is more about centralizing
a lot of how we move data between different data systems. We also started in parallel some sort of
a different effort. We call it the batch data movement effort, which the focus is more about how we efficiently move the data
from like, you know, data warehouses
to effectively to another like key value store.
So some sort of, you know,
people can do like point queries over there.
And of course, like, as I said,
you know, we've been working also in systems
like, you know, moving data
from semi-structured rudimental systems
like Airtable and Google Seeds
to our data warehouse
so we can do some business analytics
and build business intelligence on top of that.
So this is kind of the area
where we have been investing with this team
in the last about a year and a half from now.
That's super interesting.
So, I mean, I'm aware of like CDC technologies
like DeBasium, for example.
I mean, based on my understanding, at least,
the most common way of performing CDCs
by attaching to the replication log
or on the log mechanism that the database has,
listen to the changes that happen there
and then replicate these to another system.
That's on a very high level of how CDCs
usually implemented on a database.
But you mentioned also Cassandra,
and you said that there are specific challenges there
because of the eventual consistency.
You have a multi-node environment and all that stuff.
So can you give us a little bit more information
on how the CDC paradigm is implemented on something like Cassandra.
And where do we stand on that?
Do you have this currently implemented
and using it inside Netflix?
And what are the differences and challenges there
compared to the more traditional CDC
that we have seen on something like Postgres or MySQL?
Yeah, so the CDC events from NoSQL database, like active-active setups like Cassandra,
they do have some unique challenges in terms of data partitioning and replication.
So, and most of the current CDC solutions for this rely on running within the database cluster and
providing a stream with duplicate events. Our solution was more focused on by dedupling the stream in a stream
processing framework. So effectively, this involves having a distributed copy of the
source database in a stream processing framework like Apache Flink, which we use a lot on Netflix.
And this enables, to some extent, a better handling of the CDC streams,
since we have before and after images of the road changes themselves.
This is a little different from the traditional CPC
that you have seen, as I said, in Postgres and MySQL,
MariaDB and other systems,
in which you have a single stream of events
that comes from a single node,
which is kind of like your primary node.
So as I said, yeah, it was a little more difficult
to do it in Cassandra because of the challenges
of partitioning and replication and so forth.
And mainly it was on the duplication of the events.
That's interesting.
You mentioned that another thing that you are doing recently
is moving data out of the data warehouse
and syncing these into a key value store, right?
Two questions here.
One is, what's the use case behind this?
Like traditionally, I mean, data warehouse is considered mainly the destination of data,
right?
Like we collect the data that's doing some ETL and all that stuff, put it into the data
warehouse.
And from there, we do the analytics reporting.
And that's like the traditional BI that we have.
So what you're describing goes like a step beyond that.
And you actually want to pull the data
out of the data warehouse
and push it into a key value store.
Like why you want to do that?
And what kind of data are these
that you are pushing into the key value stores?
Yeah.
So, you know, recently we wrote about a system we developed.
We called it the Bulldozer.
It's an interesting name.
So as I said, you know, there are like many of services
that have the requirement to do like a fast lookup
for fine-grained data,
which need to be generated like periodically.
An example would be like enhancing our user experience
on like online
application fetches of subscriber preference data to recommend movies and TV shows. And the truth
is that, you know, data warehouses are not designed to serve those point queries, but rather the key
value stores are designed to do that. Therefore, you know, we built like some sort of bulldozer
as a system that can move the data from the data warehouse to a globally low latency, fairly reliable key value store.
And we tried to make a bulldozer some sort of a self-service platform that can be used fairly easily by users by just effectively working on a configuration. Behind that, it uses some of the ecosystem we call the
Netflix Scheduler, which is some sort of scheduling framework built on top of Meson,
which is a general purpose workflow orchestration system. And I guess the use cases include members
who want predicted scores of data to help improve their personalized experience, or it can be
metadata from Airtable to Google Sheets for data lifecycle management or even like, you know, modeling data
for messaging personalization.
When you think about that case of like,
or even it can be like, you know,
when you want to write online machine learning, right?
You can do that on the data warehouse.
You probably need that to do that on a key value store.
When you think about though the CDC and the Delta concept
which I described below, it's kind of different, right?
Because it's actually the opposite direction, right?
You move it from primary data store to a secondary, which that secondary could be a data warehouse.
Whereas the bulldozer is more from the data warehouse to the key value stores.
And how does this differ from the traditional CDC approach?
Or you see it as the same thing,
just flipping between the destination and the source?
Is the methodology the same?
Or because you have to primarily pull data out of the data warehouse,
things have to change?
Yeah.
So if you think about the system where you pull the data
of the dead house, you probably are going to do that in some sort of in a batch way. So you're
going to read the data and take them out. Whereas more of the CDC ecosystem is focusing on the real
time, parsing the data, parsing the log and real time. Of course, like many people say, you know,
what does really real time mean, right? But what you need is in the CDC aspect to really get the
mutations that happen with the database pretty fast and then move it in another system because
latency matters. On the other hand, when you move data from your data warehouse to a key value store,
the latency of actually moving those batches is important, but it's not that important. What is important
is the latency to access the data from the value stores. So fundamentally speaking, the systems are
totally different. One of them is, as I said, using a scheduling framework called Meson,
sorry, a workflow orchestration framework called Meson, whereas the other one is more like pass
the log, throw it to Kafka. And at the same time, thrown into Kafka and then also doing enrichments
because what happens is once you move the data from a primary source
to secondary, you may want to enrich the data on the way
by combining information from different other types of services,
which eventually makes the way you design microservices much simpler.
That's super interesting.
That's really, really interesting.
So what do you use as a data warehouse, Netflix?
I assume, I mean, is some kind of like this popular cloud data warehouses
like Redshift or Snowflake,
or you have something that was built like in-house?
So I think our data warehouse consists of a large number of datasets our Snowflake or you have something that was built in-house?
So I think our data warehouse consists of a large number of data sets that are stored in Amazon S3 via Hive.
We use Druid, Elasticsearch, Redshift, Snowflake, MySQL.
Our platform supports anything from Spark, Presto, Big, Hive,
and many other systems.
So it's not just a system.
It's movable systems that we use.
Okay. And
Bulldozer can interact with all these different systems?
Well, right now
the way we put the data is we put the data
out of an iceberg format,
right? And then
we send the data to the key value store.
So in fact, we put the data out of
S3 buckets and then in an iceberg
format and then we throw them to the key value.
Actually, we use an abstraction of the key value stores.
And then through that, then it's being sent to the key value stores
like a caching system.
Oh, okay.
That's interesting.
So you have like the different, like what we say of like a data warehouse
like Redshift and Spark and and all these different technologies.
But at the end, you sync all this data
that they leave there on S3 using Iceberg.
And Bulldozer comes after Iceberg
to pull the data from there
and push it back to the key value stores
that you want through this abstraction layer
that you have built, right?
Right. So Iceberg effectively comes after Iceberg.
That's correct.
All right. That's correct. After the way we started.
All right.
That's great.
So we have talked about data sources so far that are traditional data-based systems or
data processing systems like Kafka or Spark.
How do you also work?
I know that you're very microservices pro at Netflix.
Do microservices also play a role in all this
process? Is something that like, for example, you consider like CDC implemented also on top
of like microservices, like pulling data or events from there and moving it around,
or that's like something that the team does not work with, or you don't utilize,
or there's no reason to do that like in Netflix in terms of like the architecture in general that you have so you know delta or those cdc events right it's in the platform delta
platform is not just the cdc it's also like the way we do enrichments and the way we talk to other
services to get information is has simplified a lot of the way we a lot of like the microservice
we do we use but you know of course like the micro a lot of like the microservice we do, we use, but, you know, of course, like
the microservice communication happens, like gRPC or S10 points and so forth, which is
kind of a different thing.
But, you know, if you think about, you know, how we used to implement some things in terms
of like the way we communicate with microservices, there are like, Delta can simplify parts of
that.
And an example would be, I'll give you an example today
for like a movie service, right?
Where, you know, for a movie service,
you know, we want to effectively find information
from different microservices,
like let's say the deal service
or the talent service or the vendor service.
But in this like,
and then what we used to do in the past,
we used to have this polling system
that would actually pull this information
from the movie search database, from the deal service, from the talent service, from the vendor service,
and then combine them and then send them to the derived data source.
With the Delta itself, what we have done is we have simplified a lot this architecture
by effectively, instead of polling the database itself, the movie search data store, what we do is we have the connector service
that are pulling the mutations from the data store.
And then the Delta application itself
does the queries with all the services
without really needing to build like a polling service.
So in this architecture,
which we also described in the blog post of Delta,
it has substantially simplified the way
we do microservices in some areas of the business itself, which is
mainly on the content side.
Super interesting.
I mean, I think we need
another episode just to discuss
about that. And I'll probably also
use a search screen or
something because I think we are getting
on a lot of complexity that
an infrastructure like Netflix
has, but that's super, super interesting.
So, Yanni, okay, we are close to the end of the episode.
Two questions for you before the end.
So, one is, you mentioned a lot Bulldozer and Delta.
Are any parts of these open-sourced right now?
Or do you plan to open-source some parts of these projects?
I think the way that
most data systems are going to evolve,
it's going to be systems of systems
where you're going to be using
those open-source components
to build those systems.
So an example would be
that Delta is using underneath
Apache Flink and Apache Kafka.
And then there's also the CDC aspect,
which we were thinking of actually open sourcing that.
We haven't been to the state where we're ready to open source it,
but this is something that we're seriously considering
on the CDC aspect.
But open sourcing the whole platform does really make sense
because it's kind of comprised of many systems.
And I guess similarly, Bulldozer,
we wrote a blog post about a month, similarly, Bulldozer, we wrote a blog post
about a month ago
about how Bulldozer works,
trying to push the license
out to the community
so we can receive
that feedback.
But, you know,
open sourcing that,
again,
it's probably opinionated
to how we do things
at Netflix.
So I'm not confident
that it has to have
a value to be open sourced
to some extent.
But, you know,
whenever we think
that something is like,
hey, it's an entity that can be open sourced, then definitely this think that something is like, hey, it's an entity that
can be open sourced, then definitely this is the focus we're going to usually move forward with.
Yeah, makes total sense. And last question, what do you think are the next data platform
problems or challenges that the industry is going to spend time and resources in general to solve?
What are some interesting problems that you see out there
that haven't been addressed yet?
Or it's just that they just, you know, like happened
because of the evolution of the industry?
Yeah, I think there's like, I think there's like a number
of interesting problems that, you know, we're going to see
in the near future about in the data platform.
One of them, you know, is data governance in terms of like, you know,
looking at, you know, data quality, data detection,
how do you build insights about the data?
How do you catalog the data? How do you do lineage?
And what we have seen around the industry is that, you know,
there are like many, you know,
separate solutions that addresses each of these problems.
But, you know,
I'm curious to see if there's a solution
that can actually address this in a more total way.
And the other aspect that we're investing heavily here at Netflix
is the notion of data mess, where, you know,
we want to abstract a lot of the aspects of the data platform.
And, you know, instead of like people, you know,
developing their own pipelines, we want to provide, you know,
that abstraction there.
So it can be like a more centralized, you know,
pipeline to some extent with all the features that the users are need need so that's why i think like some sort of the data platform problems will start to become
more of like a high level than they are today which is like data warehouses and databases
that's great that's some great insights in what is happening in the industry right now. Yanni, thank you so much.
I mean, we could keep chatting for at least another hour,
but I think we have reached the limit of our time for this episode.
Thank you so much for spending this time
and sharing all this valuable information with us.
And yeah, personally, I'm really looking forward to see,
to meet you again in the future and discuss more about whatever interesting stuff you will be doing in the future.
Yeah, thank you so much for having me.
Well, that was an absolutely fascinating conversation.
And just the types of problems that they have to face at Netflix because of the scale. And just to hear about things that,
even things like team structure, where the streaming team dealing with the distribution
of assets, whereas Jonas's team deals with sort of collecting and storing and making available
those assets. Most companies, those are the same teams. It was just so interesting to learn about
that. What stuck out to you, Costas, as sort of the major takeaways?
There are a couple of things, actually. I was very impressed of the size of the teams. First
of all, we are talking about pretty small teams, if you think about it, but they are taking care
of such a huge infrastructure. And I mean, both the storage infrastructure that they have and
also the analytics infrastructure that they have. That was very interesting to see how these small teams
can be so agile and so effective.
The other thing which I guess it's not only characteristic of Netflix,
but other companies of their size,
is like how many different technologies are involved.
Pretty much they use across the whole organization,
like every data product that exists out there.
Every possible data
warehouse technology from cloud to on-prem.
And at the same time, they still have to be, let's say, on the state of the art of things
and build their own technology to support their needs.
So that was very, very interesting.
And I think that's a big benefit of observing what these companies are doing is because
you can take a glimpse of the future, let's say,
of what problems data engineers will have to deal in the future.
And the other thing is open source.
They are contributing a lot in open source.
They have many projects that they maintain out there and quite interesting projects also.
But at the same time, probably the complete data stack
that they have is based on open source solutions. And they contribute back again. We had Yiannis,
for example, saying about the contribution to Cassandra. They have quite a few committers in
the company, but they're committing back to Cassandra. So these are the things that I found
extremely interesting. Another thing that I would like to ask everyone to pay some attention to is the concept of CDC.
They are really investing a lot on implementing solutions on top of CDC.
And it's something that I think we will be hearing more and more about in the near future.
And of course, that's something that we discussed also with Devaris from Roxa, if you remember.
I mean, his company and his product is all about CTC.
And I think that this is like a term that we will hear more and more in the near future.
Great. Well, thank you so much again for joining us on the Data Stack Show, and we will catch you next time.