The Data Stack Show - 17: Working with Data at Netflix with Ioannis Papapanagiotou

Episode Date: December 9, 2020

This week on The Data Stack Show, Kostas and Eric are joined by Ioannis Papapanagiotou, senior engineering manager at Netflix. Ioannis oversees Netflix’s data storage platform and its data integrati...on platform. Their conversation highlighted the various responsibilities his lean teams have, utilizing open source technology and incorporating change data capture solutions.Key points in this week’s episode include:Ioannis’ background with academia and Netflix (4:42)Comparing the data storage and data integration teams (6:19)Discussing indexing and encryption (20:31)Netflix’s role in the open source community (27:21)Implementing change data capture (40:42)Using Bulldozer to efficiently move data in batches from data warehouse tables to key-value stores (42:43)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome back to the Data Stack Show. We have an extremely exciting guest today, Ionis from Netflix. And if you're in the world of data and technology and open source tooling, there's a really good chance that you've heard of Netflix because they have so many projects that have become extremely popular and have really done some amazing things. So this is going to be a very technical conversation, probably some of it over my head, but it's rare that we get a chance to talk with someone who's had such close involvement with projects like this. Costas, what are the burning questions in your mind that you want to ask Ionis about
Starting point is 00:00:46 working on the data side of Netflix? Yeah, first of all, what I think is going to be very interesting is that Yannis is actually managing two teams over there. One is dedicated to anything that has to do with media storage. And the other team is dedicated to syncing data between the different data systems that they have for more kind of analytics use cases. So it's very interesting. I mean, it's quite rare to find someone who has experienced this kind of different diverse use cases of manipulating and working with
Starting point is 00:01:17 data. And it would be amazing to hear from him what commonalities are there or like what differences. So that's one thing, which I think is super interesting. The other thing, of course, is scale. I mean, Netflix is a huge company dealing with millions of viewers. They have like very unique requirements and needs around the data that they work with, something that's not easy to find in other companies. So I think that would be great to hear from him what it means to operate data teams inside such a big organization, not in terms of the people involved necessarily,
Starting point is 00:01:55 but at least in terms of the data that needs to be handled and, of course, all the use cases and the products that they build on top of data. I mean, everyone knows and talks and makes comments around the recommendation algorithms, for example, that Netflix has. And of course, all these are driven and supported by the teams that Yanis is managing.
Starting point is 00:02:14 So I think it's going to be super interesting to learn from his experience. I agree. I think the other thing that will be interesting is to hear about some of the tools that they use that people might not be as familiar with that are popular products, but they get so much attention for things that they've built. It'll be great to learn more about some of the more common tools that they use
Starting point is 00:02:34 internally. So why don't we jump in and start learning about Netflix? We have a really special guest today who I'm so excited to learn from, Ioannis from Netflix. He's a senior engineering manager, and we're going to learn about the way that they do things at Netflix and the way that they build things. Ioannis, thank you so much for taking time to join us on the show today. Thank you so much, Eric, for having me and Kostas. Yeah, that's great to have you here today, Yanni.
Starting point is 00:03:01 For me, there's another reason, actually. It's not just, I mean, the stories that you can share with us from Netflix, which obviously is going to be interesting for everyone. But for me, it's also important because you are the first Greek person that we are having on this show. So you are another expat from Greece, as I am. So double happy today for this episode. And I'm really looking forward to discuss and learn about your experience at Netflix.
Starting point is 00:03:29 You know, I found out over the last few years that there are like a lot of people working in the data space that are Greeks. I don't know if this is for a specific reason, or maybe they had like great faculty members in Greece in the data space that resulted them in working this. But you know, there's a lot. So maybe you'll have more in the near future. Yeah, yeah. Actually, that's very interesting.
Starting point is 00:03:48 I don't know exactly why this is true. But first of all, there's a quite big team in Redshift. There's a kind of Greek mafia there and the engineering of Redshift. Snowflake has quite a few Greeks also working there. And there are many... I mean, you know more about the academic space here. But from what I know, there are also quite a few Greeks also working there. And there are many, I mean, you know more about like the academic space here, but from what I know,
Starting point is 00:04:08 there are also like quite a few like Greeks in the academic space in the United States working on database systems. So it looks like the Greeks have a thing around databases. Databases and DevOps and SREs also. They are also quite well known for having like good SREs. So that's also interesting. Yeah.
Starting point is 00:04:26 Cool. So let's start. Yannick, can you do a quick introduction? Tell us a few things about you and your background. How, what happened, what you were doing before you joined Netflix. And of course, like, what is your role there and what you're doing on Netflix today? Yeah, absolutely. I'm kind of a little of a different Karim
Starting point is 00:04:46 than most people in the Bay Area. I joined Netflix about five and a half years ago out of academia. So I was a faculty member at Purdue University before I came to Netflix. Well, before that, I was a software engineer as well. So for me, coming to the Bay Area about five and a half years was a new thing.
Starting point is 00:05:05 I joined Netflix as a senior software engineer, working mainly on the key value stores. And over the years, for the last three years, I have been managing parts of the infrastructure, especially starting from the key value store, NoSQL databases, and then recently moving to a new organization about a year ago called the Storage and Data Integrations. Our focus is building integration solutions and storage and also like storage solutions for whatever the company needs for us to provide. Yeah, that's very interesting.
Starting point is 00:05:37 I know that actually there are like two parts and two teams under your organization. Let's say one that is like working with a data storage platform and one which is the data integration platform and you have like two separated teams and if i remember correctly and correct me if i'm wrong on that the data storage platform is more responsible about having an overall like storage solution for the company which includes like how you also store your media which is of, like a very big thing in Netflix, right? And then there's also the data integration platform, which from what I understand works mainly in how the different data systems can
Starting point is 00:06:15 exchange and sync data between them. Is this correct? Yeah, that's correct. I'm surprised with the background you have done. So yeah, that's absolutely correct you know on the storage side you know we've been we've seen like you know we're ingesting more media assets out of our productions and those productions happen you know anywhere around the globe so you know my team is responsible for some of the you know transfer solutions and also how we store the data most of the data end up being stored like in the final cut on s3 and encrypted so my team has been responsible for like services on how we transfer store the data how we index the data how we encrypt the data and effectively like from the time they arrive to netflix up to the time they they are stored in an object storage in our infrastructure
Starting point is 00:07:00 so those systems you know of course in the last few years as companies is growing a lot and becoming one of the largest studios, you know, have seen a great evolution. And that's one of the reasons we actually start, you know, building solutions in this space. And the other side of the team, the integration team, is mainly focusing on building integrations between like different data systems,
Starting point is 00:07:28 like Cassandra, AWS Aurora, Postgres, MySQL with other systems like sending the data to Elasticsearch or sending the data to a data warehouse. And both teams actually evolved in the last few years out of the needs of the company to invest in this space. An example would be, you know, we're building a lot of services, especially on the content side, that,
Starting point is 00:07:51 you know, we were using one source of truth, and then we're having another database to, for example, index the data, like Elasticsearch. And, you know, we're chatting about ways that we can effectively synchronize those data systems. And a few years back, and there was not a good solution, you know, we're just using some scripts or using some jobs over the weekend. And then we thought, you know, what's the best possible way for us to build a solution that will kind of synchronize those two systems. And then eventually, as we evolved as a team, we started supporting like more systems, like moving data from Airtable and Google Sheets to data warehouses, and also moving data from data warehouses to key value stores
Starting point is 00:08:30 for, for example, our machine learning team to do online machine learning. So yeah, this is how we effectively form those two teams right now. And new teams, great and exciting areas to work on. Yeah, that's super interesting. Actually, before we move into more details about the technical side of things, from an organizational point of view, these two teams, I mean, from someone from the outside, it sounds like they are working on quite different things.
Starting point is 00:09:00 I mean, okay, it's data again, but very, very different types of data. And I assume that inside the organization, let's say the consumers of this data are also different, right? So how does this work in terms of like managing these teams? Are there like on a technology or organizational level similarities between the problems that are solved? What do they have in common? And what's the difference between these two? And I guess that this is very interesting for me
Starting point is 00:09:29 because it's quite unique. And of course, it's also because it has to do with being in Netflix and you have like a studio there and you have like the scale of Netflix. But I'm usually, you know, I meet data teams that they work mainly on database systems, more structured data. So I would like to hear from
Starting point is 00:09:46 you what's the commonalities between the two problems and what are also the differences and the challenges that you have seen by managing these two teams? Yeah, that's a great question. You know, both teams have evolved from the needs of the company in the emerging content space. So both the two teams have been working, focusing a lot on the content space. And, you know, while the technology they're building is different, they have a few common things. The first thing, most important thing is this is they're solving like immediate business problems. Right. And, you know, given, of course, like the status of the team and the evolution of the team. And the second aspect of both teams is they're building what we call high leverage data platform solutions. So they're building solutions that can be used by many different teams.
Starting point is 00:10:30 Now, in regards to the other aspect of your question about the challenges in leading two teams, I think there are challenges, of course, but, you know, we have spent a significant amount of time in, you know, in hiring and entertaining, you know, really amazing talent in the team. And, you know, eventually that becomes a little easier for the manager to kind of manage the team. And eventually that becomes a little easier for the manager to kind of manage the team. And of course, like over the last few years, we have evolved some of the practice or have shared some of the practice
Starting point is 00:10:53 within the team in terms of the way we do product management and project management. And we have found some interesting efficiencies and organizational structures as a group, which effectively make everyone's life easier, right? The other aspect also is that looking about pretty much the identity of a Netflix engineer, which is usually on the senior software engineer,
Starting point is 00:11:14 we're hiring people who are great in communication, great in terms of building products, but they're also hiring people that are great in terms of how they deal with customers, they deal with partners, and we deal with cross-functionally. So that's why, you know, usually the management aspect of the manager becomes a little easier. And that's one of the reasons I would say that, you know, my job has been extremely hard to manage that wonderful team.
Starting point is 00:11:40 That's great. That's great to hear. So can you share a little bit more about about the structure of the teams that exist there? First of all, are the structures between the two teams identical? Are there any differences there? And share a little bit more information about the size, the roles, and stuff like that. Just to get an idea of how a company like Netflix has evolved in managing these kinds of problems. Yeah, I would say that we're definitely not the normal team that you see in other platform teams in Netflix.
Starting point is 00:12:09 Again, a lot of that has been part of the quick evolution of the team. So I think one of the teams is about, I think, if I recall correctly, like 12 engineers or so. And the other side of the team is about five engineers. So mainly on the storage space we have done a little more investment than integration space in terms of like the size but you know to some extent both teams are working uh cross-functionally with many other departments in the company so you know you may think that you know we are building a product
Starting point is 00:12:40 as a team but we are not we're usually you know building products in collaboration with many other partners and again this is an artifact as i said of of the need of many teams to jump in and solve as a team, but we're not. We're usually building products in collaboration with many other partners. And again, this is an artifact, as I said, of the need of many teams to jump in and solve those business problems. And the fact that we were a little more lean and agile in terms of how we do those practices. And of course, there are many other teams at Netflix that they actually, as you said before, they work in a specific problem. One team may do the warehouses, the other team may do database, the other team may do streaming platforms, and so forth. So yeah, we are, I guess, the odd one out to some extent. It's very interesting. I mean,
Starting point is 00:13:17 it's surprising for me how linked the teams are and for the size of the company and find this very, very, very interesting. Cool. So, I mean, my interest is more around the integration, the data integration team, to be honest, also because of my background and the stuff that I have done in my life working. But before we move there and we can discuss further about it, can you give us some more information, technical information around the data storage platform that you have,
Starting point is 00:13:46 especially for the media. I think it's a very unique problem that you are solving there on a global scale. As you said, you mentioned that you have production teams all over the world. And I think it would be great to know a little bit more about what kind of technologies you are using and what are also the use cases, how the teams are interacting with this data
Starting point is 00:14:05 and what's the lifecycle of this data? Yeah, absolutely. So we, as you said, we have productions anywhere on the globe. We're ingesting data into our infrastructure, the storage infrastructure, through either like people uploading the data to Netflix, we provide some sort of an upload manager
Starting point is 00:14:22 and there is a UI that people can use to effectively upload the assets. We're also providing, we plan to provide very soon a file system user space where people can actually store the data and effectively the data will be back to the cloud. You can think about it
Starting point is 00:14:37 like Dropbox for media assets, I would say. Then we also have ways that people can upload the data through different say. And then, you know, we also have like ways that people can upload the data through, you know, different APIs. And finally, you know, there are like, there are productions that even upload the data through like Snowball devices, you know, those big suitcases that AWS provides that, you know, if eventually the data are being stored in S3.
Starting point is 00:15:00 But in our case, you know, in all these cases, in the end of the day, the data are encrypted, and they are stored in a specific format that we use on AWS S3. That's where they finally get into. And while the data are being transferred, we also like pretty much indexing each of these files. So we know what's the size of the file, what's the metadata of the file. And then we can even group files together and create file sessions or we can group files together in what pretty much we call
Starting point is 00:15:30 on the media industry side, the assets, where an asset can be, let's say, a movie. And this is represented by many different files. And a lot of that, then you have like an abercrater services that are using those files and folder services, those metadata to some extent, to generate any kind of business need they have. And this is how at a high level the storage team is organized.
Starting point is 00:15:53 As a storage team, we also offer some other products. We offer a file system as a service. There are places in the company that also use AWS file systems, but we also offer our own file file service, which is based on Seth. And as I said, our team is also important, you know, is managing the way we store the data on S3 as well. Aonis, one question for you on file storage.
Starting point is 00:16:19 And I thought of this when you mentioned groups of assets and I may be thinking about this incorrectly, but do you, so, you know, you serve ads in a dynamic way on certain content. How are those files managed? Because that can change depending on the context of the user. Are the ad assets and the actual sort of media assets of like the show or movie or piece of content that the user's consuming, are those stored together? And if not, are there challenges around sort of delivering those in a dynamic way? Yeah, that's a good question. First of all, I think that Netflix does not offer advertisements on the platform, but the kind of the area that we have been focusing on is more of how we ingest media assets to Netflix,
Starting point is 00:17:11 not on how we stream the media assets to Netflix. The streaming side is handled by a different team, which is the Open Connect organization, which we have cashes around the globe where when you know, when you click to play a movie, effectively get the content from that cast. Our team is mainly focusing right now the time that the data arrive from production to Netflix after the time that, you know, they get, you know, we do any internal post-production activities
Starting point is 00:17:38 like encoding and so forth. Interesting. And one follow-up question to that would be, so five years plus at Netflix, compression and file format are concerns. What changes have you seen from that standpoint? And has it affected the way that you store that data? You know, I am not sure. I would say the honest answer to that. I don't think we're compressing right now data in the way we store them of course
Starting point is 00:18:26 like they're you know they can use file format like they can use like you know depending on the resolution they have been encoded and so forth or being captured by the cameras but i don't think that we're actually compressing right now the data before we store them or object storage this is probably something that we should be looking at right but you know i'm also not sure about you know the efficiencies we can get in terms of compression. So yeah, I'm not sure about that area, to be honest. So, Yanni, if I understand correctly, the parts of your work in terms of the lifecycle
Starting point is 00:18:59 of the productions in Netflix starts from the production, I mean, when the content is actually captured and it ends when it goes through production and also probably post-production, and then you're done, right? Then it's another team that is responsible about taking this content and actually figuring out how it has to be streamed and delivered to the end user.
Starting point is 00:19:22 Is this correct? Yeah, that's correct. But it's not only about, you know, doing that from the production. Of course, the productions can do what they really like as well in some cases. But it's also like, you know, there can be like post-production vendors that may use our ecosystem. So, you know, they might like VFX artists can use our system. So even animation space or even like post-production, other post-production vendors can use it.
Starting point is 00:19:43 So it can be used by like different partners, I would say. And a lot of that is also, to some extent, some of them are abstracted from us because they're actually using some of the higher level business logic applications that the company has built. Usually it comes to us when a file arrives to Netflix and it becomes an indexable ecosystem for us to use. So, a bit of more technical questions on that.
Starting point is 00:20:12 You mentioned two things about these assets. One is indexing and the other is encryption. So, let's start with indexing. When you say indexing, you are talking about indexing the metadata of these assets, or you also perform some other analysis on the video itself that can be searchable? Yeah, that's a good question. For us, we're a platform team. We're like a low-level platform team. We are actually, for the file itself, we keep metadata, of course. Metadata, of course, we keep an ID for each of the files. And through that ID, we can characterize the files themselves. And of course, like we think that ID we keep like
Starting point is 00:20:48 a structured format about the metadata of the file itself. So for example, when you want to see like how many files have been stored for a specific production, you can actually use that ecosystem to derive those statistics. And then after that, we actually send the data to S3, and then we kind of encrypt the objects. And then we have our own key management service that effectively takes the data and encrypts this data, and then we store them on S3 eventually. And then we keep also some form of metadata
Starting point is 00:21:19 for the objects we store on S3 as well. Okay, so this indexing happens where? where these indices are stored, like in searchable. Is this part of S3 again, or like you have a different kind of technology where this indexing happens and then it's exposed like to the users for searching and whatever other use cases you have? Yeah. So in terms of the files, we are actually having a service that kind of does that. And then this is backed by,
Starting point is 00:21:48 it used to be backed by a graph database in the past, which was based on Titan TB, or like the most modern Janus graph. We recently replaced that with using Cockroach TB. And then there is some indexing capabilities of that through Elasticsearch. And then the metadata for how we store the data effectively on AWS S3, we're actually using Cassandra cluster. And of course we also have Elasticsearch cluster for index store the data effectively on AWS S3. We're actually using a Cassandra cluster.
Starting point is 00:22:06 And of course, we also have a last-case cluster for indexing the data. Oh, that's very interesting. How did you decide to use CockroachDB, by the way? I mean, there are some qualities of CockroachDB that we appreciated. And as we want to effectively make some of these services more global,
Starting point is 00:22:29 the ability to have distributed transactions became fairly important for these services. So we thought that Cockroach is more what we call a new SQL database that provides those new capabilities. And therefore, it was interesting for us because it provides the guard protocol is based on Postgres. So it was kind of fairly easier for us because it provides the guard protocol is based on Postgres. So it was kind of like fairly easier for us. People didn't understand SQL. And so it became like an easy transition for us from like a Titan DB interface, which we initially thought was great, but then effectively understood that the level of nestings
Starting point is 00:22:58 between like different files are not that many. So that's why eventually we decided CockroachDB. That's very interesting. You are like one of the first... I'm aware of CockroachDB and I'm following their development, but it's very interesting to hear from someone who's using it in the production environment. So that's why I wanted to ask and I didn't want to miss the opportunity to ask about it. That's great. So second question, because I said that one is the indexing, the other is the encryption. So how important is encryption
Starting point is 00:23:27 and how do you perform encryption efficiently on such a large scale? Because I assume that if we're talking about uncompressed media files, we are talking about huge volumes of data. So how does this work and what kind of overhead it adds to the whole platform? It's definitely like a lot of media assets
Starting point is 00:23:53 at the petabyte scale. But at the same time, the speed in which we receive the assets is not that huge that you can expect, let's say, from a direct-to direct to consumer case because this is like more on the enterprise software side right so in that case the speed is is less of importance though it is importance in many cases when you know we have to turn around pretty fast for a production so you know for us it's important because you know we want to make sure that you know we store our data in a secure way. And then even the access mechanism of the data
Starting point is 00:24:26 is fairly controlled. So we want to make sure that whoever accesses the data has the right to be able to do that. So the data cannot be viewed by anybody external to that. So that's why we kind of focus a lot on the encryption side of the data. And of course, we have different formats that we store the data, encoding formats, and of course,
Starting point is 00:24:44 each one of them is encrypted as well. So, Yanni, just to understand a little bit better, the way that you have implemented this is like, as we consider, let's say, a file system that itself implements encryption, or you encrypt the object itself on top of the file system? Yeah, so we encrypt the actual object itself that is being stored on AWS, but the way we are going to present the data to a user could be through a UI in a file format, or it could be through a file system in user space. And those data,
Starting point is 00:25:16 the way we're going to see them, are not going to be encrypted, right? You're going to think you're using a normal file system, and you're doing normal interactions as you would do with, let's say, an NFS mount on your laptop, right? And you just see the data. But of course, in order to get the data, you'll have to get the proper privileges and have the proper access to the proper project and so forth.
Starting point is 00:25:36 So from the user perspective, let's say from the artist perspective, you're just seeing a file system. But the way that actual data is stored on the cloud is encrypted. Oh, that's super interesting. And the appropriate user management that you have built and access management and all that stuff, do you use technologies from AWS to do that?
Starting point is 00:25:58 Like IAM or something? Or it's something that you have built internally? We use IAM. IAM. IAM, yeah. But I-A-M. IAM, yeah. Yeah. But of course, there's a number of internal services that our information security team has built
Starting point is 00:26:13 that are specialized for the Netflix business itself. Right. Okay. I think enough questions for the data storage platform. As I said, I didn't intend to ask so many questions around that, but it was super interesting. And every one of your answers actually brought in more questions. So let's move forward to some other questions.
Starting point is 00:26:37 So I have a question. I mean, I've seen that Netflix is quite active in terms of open source. And when I say active in open source, both in terms of how you adopt open source internally, and I think we've heard exams of technologies already, but also by contributing back to the open source community, let's say. What's the reason for that? I understand why you use the tools,
Starting point is 00:27:04 although I'd like to hear also your opinion on that, why you prefer to use open-source solutions. But what's the relationship that Netflix has with open-source and why you decided to do that? And what's the value that you see not only as an organization, but you personally as Yanis for that? Yeah, so I'm going to probably focus a little on the data platform perspective
Starting point is 00:27:25 for this question but you know netflix kind of follows like a balanced approach there are a number of systems that of course we're building in-house they're also like a number of what users of many open source projects you know like kapatska sandra elastic shirts kafka flink and many others and you know we have also sourced a number of our own solutions like Metacad, Iceberg, Evcast for our caching solution, Dynamite, which is a proxy layer for Redis. And we're also using vendor solutions like a lot of relational database offerings from AWS. We have invested a lot into both open sourcing some of our code, but also open sourcing supporting some of the open source community.
Starting point is 00:28:06 I mean, in an example, we have a healthy number of Apache Cassandra committers in our database team. And of course, there are many projects that we're supporting the community as we use those products, both because we use them and we want to make sure that if there's a bug,
Starting point is 00:28:20 we can fix it, but as well as we want to support back the open source community. There are many reasons that we also do open sourcing, but I think fundamentally one of them is, of course, the hiring. You can get really great engineers when they contribute to your project. You tend to know them better by the way they interface, not only about the technical skills, but also some sort of how they collaborate, how they communicate, and so forth. But it's also, I think, are also other benefits in my opinion.
Starting point is 00:28:47 Like for example, when somebody opensource a project and then maintains that project properly and so forth, it becomes like an identity, right? You tend to have these external identities. So to some extent, you know, you make yourself marketable in the future as well. So that's why we see like many ICs are excited about, you know, open sourcing some of the projects. Another reason as well is that, of course,
Starting point is 00:29:09 we run systems in productions that we have in the open source space. And many of these systems, we want the community to contribute to them, evolve and make them better so that they can fix bugs, we can fix bugs that we see. Maybe we're going to see similar problems. But the more of our open source teams that are adopted by the community, the more we're going to have those commonalities between those different commons that use the same open source projects. And of course, as I said, there are a number of projects that we have either donated
Starting point is 00:29:43 to the Apache community or the Cloud Foundation community and so forth, so that we can effectively enlarge the community from just Netflix engineers working on a project. Yeah, I think you taught some very good points around open source and why it's important in a company. That's
Starting point is 00:29:59 really, really interesting to hear, especially what you said about two things. One, hiring, which is important hear, especially what you said about two things. One, hiring, which is important, but the other is also about what you mentioned about collaboration. That's super interesting. So in terms of the projects that you have on short so far,
Starting point is 00:30:17 and if you know, which one is the most successful in terms of adoption by the open source community so far from Netflix? I think there are a number of projects that have been successful. And to be honest with you, most of the most successful ones, I was not involved into them.
Starting point is 00:30:35 So out of the way, what I'm thinking right now, I think Spinnaker has been fairly successful as a multi-cloud continuous delivery platform. Metaflow is another recent example, which we recently spoken about publicly. So these are the two main projects that come to my mind right now that have been very much a big success recently. And your favorite one that came out of your teams?
Starting point is 00:30:59 My favorite one. So I would say I have two favorite ones out of the fact that it was managing those teams, the key value stores on Netflix. so one of them is evcast which is our casting solution that we use on netflix and then the second one was dynamite which is like a proxy layer we use for some of the again no key value stores that we have here at netflix i was part of the when i joined netflix i was part of the dynamite team for about two and a half years helping this project you know contributing back to the open source and And I would say that it was really, really exciting to work and collaborate with a number of companies and open source users. That's interesting. What was the initial need that made you like build Dynamite? You said Dynamite
Starting point is 00:31:41 is the cash on top of Redis, right? Yeah, because I think back then, fundamentally speaking, Redis was a single node system. I think later on with Redis cluster, it became like, again, like a multi-node system, but it was more like a master-slave system, like primary-secondary system, where it's great, it focuses a lot on,
Starting point is 00:32:03 if you think about the CAP theorem or the consistency and partition tolerance, whereas Dynamite and Netflix mostly is focusing more on the availability side because a lot of what makes sense for the business is to make sure that we achieve like seven nines of availability. So that's why we wanted a system that would still have the
Starting point is 00:32:19 properties of Redis, which is really amazing in terms of like a no-key value store with advanced data structures and all the amazing work that Salvatore Sanfilippo has done, but still make it highly available. And that's why we chose to build that kind of proxy layer above Redis. There are also a few other things. We were working on the Cassandra space for many years now, again, another AP system, and we had substantial experience with the way the Dynamo protocol works.
Starting point is 00:32:45 So a lot of the sidecars and the components of the ecosystem were pretty easy or automation was kind of pretty easy to adapt with Dynamite based on this architecture. Ioannis, a question on internal projects that's sort of more general, not necessarily about specific projects, but do you, what is the process like of deciding to undertake a project like Dynamite? Do you have lots of conversations about those things internally? And then as a follow-up, are there lots of things that you talk about that you don't end up building? You know, I think one of the great things about Netflix is the fact that a lot of the decision-making is happening at the software engineering layer, and we have this notion of informed captains.
Starting point is 00:33:36 So yeah, usually the informed captain brings up a business use case on why we need to build a product or a project, and then tries to communicate that with a number of partners and tries to make sure that, you know, there's alignment that this is, this will provide like, you know, substantial business value to the company.
Starting point is 00:33:52 And then, and then continues like building the project to some extent and then try to showcase through maybe a prototype that the value this project is going to have to the company. And then,
Starting point is 00:34:02 it then takes some sort of a natural way, I would say, by, you know, the leadership team funding the project and then making like a successful project within the company so yanni i mean from your experience so far with all these open source projects that you have published at netflix and considering that many of these projects are the outcome of like very specific and at large scale like problems that Netflix has. So I'm pretty sure that there are many people out there like other data engineers who are dealing with probably similar problems, but like not at the same scale, right? So what's
Starting point is 00:34:38 your advice towards like the people out there that they learn about these technologies and how they should use and try to adapt these technologies to the scale of problem that they have about these technologies and how they should use and try to adapt these technologies to the scale of problem that they have. Is there something like that you have seen or you have communicated with the communities out there? And what do you think is important for someone to keep in mind when using all these projects that Netflix is maintaining right now? Yeah, that's a good question.
Starting point is 00:35:02 I mean, I have found even myself fairly challenging to really identify the right source of information overall. So I understand when people see many companies, including Netflix, open source and other projects, which one is the one that some person may want to invest? And to be honest, in many cases, some of these projects are being built based on the advanced needs of a specific company. So, you know, if I was starting new, you know, what we'll propose is, you know, first understand the problem space, you know, before kind of going deeper in a specific solution. Again, unfortunately, I have not really found a great description about, you know, our space, which is kind of data platforms. Other than, I'll say, like, most recent post by Andreessen Horwitz about, you know, the high-level architecture of data platforms other than i'll say like most recent post by andreessen horwitz about you know the high level architecture of data platforms but you know and the second step would probably be like identify a project that is in an interesting area maybe have like a healthy number of contributors that someone can collaborate and grow by learning from other people that are more experienced and of course like that project does not have to be like
Starting point is 00:36:03 necessarily like a Netflix project. But, you know, as I said, you know, if somebody would be interested in a Netflix project, you know, there are like a few of them that have a very healthy community around them. Like one of them is, as I said, Spinnaker, which as I said, a multi-cloud
Starting point is 00:36:15 continuous delivery platform. And of course, like there are other projects that, you know, Netflix has been using. For example, we have been doing, you know, a fairly number of contributions on the Memcast infrastructure and many other projects as well that other companies or other entities
Starting point is 00:36:32 have built. That's some great advice, I think. Cool. Thank you so much for that. So, moving forward and let's talk a little bit more about the data integration platform. Can you describe in a little bit more detail what the data integration platform. Can you describe in a little bit more detail what the data integration platform does and what's the problem behind it? Why it's a problem
Starting point is 00:36:53 like in Netflix? And what's the solutions that you have come up for these problems? Yeah. So the data integrations team is like, I would say, like a small but very talented team, which effectively, you know, focuses in building an array of integrations. The formation of the team initially was done, you know,
Starting point is 00:37:14 based on the fact that we wanted to build some solutions in which we will be able to keep multiple data systems in sync. And so we start investing in building like change data capture solutions and connector, you know, for relational databases like Postgres, MySQL, Aurora, or, and recently,
Starting point is 00:37:31 most recently, about a year ago, we also started investing in the NoSQL space like Cassandra. The latter is a little more complicated because it has those, you know, characteristics of a multi-master eventually consistent system. Actually, one of my team members gave a talk recently at QCon. So he spoke in the details about those for people who are quite interested to listen about that. And of course, we have written a few blog posts about Delta and DBLog, which is kind of the systems we have built.
Starting point is 00:37:57 But at a high level, we were seeing patterns that people were building, they were having different data systems, they were trying to solve this problem with some sort of multi-system transactions which don't really work, or even like with some sort of repair jobs when one of the systems was becoming inconsistent. So we tried to build some sort of solutions that would kind of not need to do that, but rather some sort of parse, some sort of the log of a database, send this
Starting point is 00:38:26 log through a streaming system and then send the data to another system that it's going to be like your secondary system. But as our infrastructure evolved, more services were actually using those database integrations. And effectively, we came towards a more high level project, which is now a project that many teams are working on, which more high-level project, which is now a project that many teams are working on, which is called the data mesh, which is more about centralizing a lot of how we move data between different data systems. We also started in parallel some sort of a different effort. We call it the batch data movement effort, which the focus is more about how we efficiently move the data from like, you know, data warehouses
Starting point is 00:39:07 to effectively to another like key value store. So some sort of, you know, people can do like point queries over there. And of course, like, as I said, you know, we've been working also in systems like, you know, moving data from semi-structured rudimental systems like Airtable and Google Seeds
Starting point is 00:39:23 to our data warehouse so we can do some business analytics and build business intelligence on top of that. So this is kind of the area where we have been investing with this team in the last about a year and a half from now. That's super interesting. So, I mean, I'm aware of like CDC technologies
Starting point is 00:39:41 like DeBasium, for example. I mean, based on my understanding, at least, the most common way of performing CDCs by attaching to the replication log or on the log mechanism that the database has, listen to the changes that happen there and then replicate these to another system. That's on a very high level of how CDCs
Starting point is 00:40:03 usually implemented on a database. But you mentioned also Cassandra, and you said that there are specific challenges there because of the eventual consistency. You have a multi-node environment and all that stuff. So can you give us a little bit more information on how the CDC paradigm is implemented on something like Cassandra. And where do we stand on that?
Starting point is 00:40:29 Do you have this currently implemented and using it inside Netflix? And what are the differences and challenges there compared to the more traditional CDC that we have seen on something like Postgres or MySQL? Yeah, so the CDC events from NoSQL database, like active-active setups like Cassandra, they do have some unique challenges in terms of data partitioning and replication. So, and most of the current CDC solutions for this rely on running within the database cluster and
Starting point is 00:40:59 providing a stream with duplicate events. Our solution was more focused on by dedupling the stream in a stream processing framework. So effectively, this involves having a distributed copy of the source database in a stream processing framework like Apache Flink, which we use a lot on Netflix. And this enables, to some extent, a better handling of the CDC streams, since we have before and after images of the road changes themselves. This is a little different from the traditional CPC that you have seen, as I said, in Postgres and MySQL, MariaDB and other systems,
Starting point is 00:41:32 in which you have a single stream of events that comes from a single node, which is kind of like your primary node. So as I said, yeah, it was a little more difficult to do it in Cassandra because of the challenges of partitioning and replication and so forth. And mainly it was on the duplication of the events. That's interesting.
Starting point is 00:41:54 You mentioned that another thing that you are doing recently is moving data out of the data warehouse and syncing these into a key value store, right? Two questions here. One is, what's the use case behind this? Like traditionally, I mean, data warehouse is considered mainly the destination of data, right? Like we collect the data that's doing some ETL and all that stuff, put it into the data
Starting point is 00:42:20 warehouse. And from there, we do the analytics reporting. And that's like the traditional BI that we have. So what you're describing goes like a step beyond that. And you actually want to pull the data out of the data warehouse and push it into a key value store. Like why you want to do that?
Starting point is 00:42:37 And what kind of data are these that you are pushing into the key value stores? Yeah. So, you know, recently we wrote about a system we developed. We called it the Bulldozer. It's an interesting name. So as I said, you know, there are like many of services that have the requirement to do like a fast lookup
Starting point is 00:42:56 for fine-grained data, which need to be generated like periodically. An example would be like enhancing our user experience on like online application fetches of subscriber preference data to recommend movies and TV shows. And the truth is that, you know, data warehouses are not designed to serve those point queries, but rather the key value stores are designed to do that. Therefore, you know, we built like some sort of bulldozer as a system that can move the data from the data warehouse to a globally low latency, fairly reliable key value store.
Starting point is 00:43:31 And we tried to make a bulldozer some sort of a self-service platform that can be used fairly easily by users by just effectively working on a configuration. Behind that, it uses some of the ecosystem we call the Netflix Scheduler, which is some sort of scheduling framework built on top of Meson, which is a general purpose workflow orchestration system. And I guess the use cases include members who want predicted scores of data to help improve their personalized experience, or it can be metadata from Airtable to Google Sheets for data lifecycle management or even like, you know, modeling data for messaging personalization. When you think about that case of like, or even it can be like, you know,
Starting point is 00:44:13 when you want to write online machine learning, right? You can do that on the data warehouse. You probably need that to do that on a key value store. When you think about though the CDC and the Delta concept which I described below, it's kind of different, right? Because it's actually the opposite direction, right? You move it from primary data store to a secondary, which that secondary could be a data warehouse. Whereas the bulldozer is more from the data warehouse to the key value stores.
Starting point is 00:44:40 And how does this differ from the traditional CDC approach? Or you see it as the same thing, just flipping between the destination and the source? Is the methodology the same? Or because you have to primarily pull data out of the data warehouse, things have to change? Yeah. So if you think about the system where you pull the data
Starting point is 00:45:05 of the dead house, you probably are going to do that in some sort of in a batch way. So you're going to read the data and take them out. Whereas more of the CDC ecosystem is focusing on the real time, parsing the data, parsing the log and real time. Of course, like many people say, you know, what does really real time mean, right? But what you need is in the CDC aspect to really get the mutations that happen with the database pretty fast and then move it in another system because latency matters. On the other hand, when you move data from your data warehouse to a key value store, the latency of actually moving those batches is important, but it's not that important. What is important is the latency to access the data from the value stores. So fundamentally speaking, the systems are
Starting point is 00:45:53 totally different. One of them is, as I said, using a scheduling framework called Meson, sorry, a workflow orchestration framework called Meson, whereas the other one is more like pass the log, throw it to Kafka. And at the same time, thrown into Kafka and then also doing enrichments because what happens is once you move the data from a primary source to secondary, you may want to enrich the data on the way by combining information from different other types of services, which eventually makes the way you design microservices much simpler. That's super interesting.
Starting point is 00:46:26 That's really, really interesting. So what do you use as a data warehouse, Netflix? I assume, I mean, is some kind of like this popular cloud data warehouses like Redshift or Snowflake, or you have something that was built like in-house? So I think our data warehouse consists of a large number of datasets our Snowflake or you have something that was built in-house? So I think our data warehouse consists of a large number of data sets that are stored in Amazon S3 via Hive. We use Druid, Elasticsearch, Redshift, Snowflake, MySQL.
Starting point is 00:46:55 Our platform supports anything from Spark, Presto, Big, Hive, and many other systems. So it's not just a system. It's movable systems that we use. Okay. And Bulldozer can interact with all these different systems? Well, right now the way we put the data is we put the data
Starting point is 00:47:14 out of an iceberg format, right? And then we send the data to the key value store. So in fact, we put the data out of S3 buckets and then in an iceberg format and then we throw them to the key value. Actually, we use an abstraction of the key value stores. And then through that, then it's being sent to the key value stores
Starting point is 00:47:32 like a caching system. Oh, okay. That's interesting. So you have like the different, like what we say of like a data warehouse like Redshift and Spark and and all these different technologies. But at the end, you sync all this data that they leave there on S3 using Iceberg. And Bulldozer comes after Iceberg
Starting point is 00:47:54 to pull the data from there and push it back to the key value stores that you want through this abstraction layer that you have built, right? Right. So Iceberg effectively comes after Iceberg. That's correct. All right. That's correct. After the way we started. All right.
Starting point is 00:48:07 That's great. So we have talked about data sources so far that are traditional data-based systems or data processing systems like Kafka or Spark. How do you also work? I know that you're very microservices pro at Netflix. Do microservices also play a role in all this process? Is something that like, for example, you consider like CDC implemented also on top of like microservices, like pulling data or events from there and moving it around,
Starting point is 00:48:36 or that's like something that the team does not work with, or you don't utilize, or there's no reason to do that like in Netflix in terms of like the architecture in general that you have so you know delta or those cdc events right it's in the platform delta platform is not just the cdc it's also like the way we do enrichments and the way we talk to other services to get information is has simplified a lot of the way we a lot of like the microservice we do we use but you know of course like the micro a lot of like the microservice we do, we use, but, you know, of course, like the microservice communication happens, like gRPC or S10 points and so forth, which is kind of a different thing. But, you know, if you think about, you know, how we used to implement some things in terms
Starting point is 00:49:17 of like the way we communicate with microservices, there are like, Delta can simplify parts of that. And an example would be, I'll give you an example today for like a movie service, right? Where, you know, for a movie service, you know, we want to effectively find information from different microservices, like let's say the deal service
Starting point is 00:49:34 or the talent service or the vendor service. But in this like, and then what we used to do in the past, we used to have this polling system that would actually pull this information from the movie search database, from the deal service, from the talent service, from the vendor service, and then combine them and then send them to the derived data source. With the Delta itself, what we have done is we have simplified a lot this architecture
Starting point is 00:49:58 by effectively, instead of polling the database itself, the movie search data store, what we do is we have the connector service that are pulling the mutations from the data store. And then the Delta application itself does the queries with all the services without really needing to build like a polling service. So in this architecture, which we also described in the blog post of Delta, it has substantially simplified the way
Starting point is 00:50:22 we do microservices in some areas of the business itself, which is mainly on the content side. Super interesting. I mean, I think we need another episode just to discuss about that. And I'll probably also use a search screen or something because I think we are getting
Starting point is 00:50:39 on a lot of complexity that an infrastructure like Netflix has, but that's super, super interesting. So, Yanni, okay, we are close to the end of the episode. Two questions for you before the end. So, one is, you mentioned a lot Bulldozer and Delta. Are any parts of these open-sourced right now? Or do you plan to open-source some parts of these projects?
Starting point is 00:51:07 I think the way that most data systems are going to evolve, it's going to be systems of systems where you're going to be using those open-source components to build those systems. So an example would be that Delta is using underneath
Starting point is 00:51:20 Apache Flink and Apache Kafka. And then there's also the CDC aspect, which we were thinking of actually open sourcing that. We haven't been to the state where we're ready to open source it, but this is something that we're seriously considering on the CDC aspect. But open sourcing the whole platform does really make sense because it's kind of comprised of many systems.
Starting point is 00:51:42 And I guess similarly, Bulldozer, we wrote a blog post about a month, similarly, Bulldozer, we wrote a blog post about a month ago about how Bulldozer works, trying to push the license out to the community so we can receive that feedback.
Starting point is 00:51:53 But, you know, open sourcing that, again, it's probably opinionated to how we do things at Netflix. So I'm not confident that it has to have
Starting point is 00:51:59 a value to be open sourced to some extent. But, you know, whenever we think that something is like, hey, it's an entity that can be open sourced, then definitely this think that something is like, hey, it's an entity that can be open sourced, then definitely this is the focus we're going to usually move forward with. Yeah, makes total sense. And last question, what do you think are the next data platform
Starting point is 00:52:16 problems or challenges that the industry is going to spend time and resources in general to solve? What are some interesting problems that you see out there that haven't been addressed yet? Or it's just that they just, you know, like happened because of the evolution of the industry? Yeah, I think there's like, I think there's like a number of interesting problems that, you know, we're going to see in the near future about in the data platform.
Starting point is 00:52:43 One of them, you know, is data governance in terms of like, you know, looking at, you know, data quality, data detection, how do you build insights about the data? How do you catalog the data? How do you do lineage? And what we have seen around the industry is that, you know, there are like many, you know, separate solutions that addresses each of these problems. But, you know,
Starting point is 00:53:03 I'm curious to see if there's a solution that can actually address this in a more total way. And the other aspect that we're investing heavily here at Netflix is the notion of data mess, where, you know, we want to abstract a lot of the aspects of the data platform. And, you know, instead of like people, you know, developing their own pipelines, we want to provide, you know, that abstraction there.
Starting point is 00:53:22 So it can be like a more centralized, you know, pipeline to some extent with all the features that the users are need need so that's why i think like some sort of the data platform problems will start to become more of like a high level than they are today which is like data warehouses and databases that's great that's some great insights in what is happening in the industry right now. Yanni, thank you so much. I mean, we could keep chatting for at least another hour, but I think we have reached the limit of our time for this episode. Thank you so much for spending this time and sharing all this valuable information with us.
Starting point is 00:53:59 And yeah, personally, I'm really looking forward to see, to meet you again in the future and discuss more about whatever interesting stuff you will be doing in the future. Yeah, thank you so much for having me. Well, that was an absolutely fascinating conversation. And just the types of problems that they have to face at Netflix because of the scale. And just to hear about things that, even things like team structure, where the streaming team dealing with the distribution of assets, whereas Jonas's team deals with sort of collecting and storing and making available those assets. Most companies, those are the same teams. It was just so interesting to learn about
Starting point is 00:54:43 that. What stuck out to you, Costas, as sort of the major takeaways? There are a couple of things, actually. I was very impressed of the size of the teams. First of all, we are talking about pretty small teams, if you think about it, but they are taking care of such a huge infrastructure. And I mean, both the storage infrastructure that they have and also the analytics infrastructure that they have. That was very interesting to see how these small teams can be so agile and so effective. The other thing which I guess it's not only characteristic of Netflix, but other companies of their size,
Starting point is 00:55:15 is like how many different technologies are involved. Pretty much they use across the whole organization, like every data product that exists out there. Every possible data warehouse technology from cloud to on-prem. And at the same time, they still have to be, let's say, on the state of the art of things and build their own technology to support their needs. So that was very, very interesting.
Starting point is 00:55:39 And I think that's a big benefit of observing what these companies are doing is because you can take a glimpse of the future, let's say, of what problems data engineers will have to deal in the future. And the other thing is open source. They are contributing a lot in open source. They have many projects that they maintain out there and quite interesting projects also. But at the same time, probably the complete data stack that they have is based on open source solutions. And they contribute back again. We had Yiannis,
Starting point is 00:56:12 for example, saying about the contribution to Cassandra. They have quite a few committers in the company, but they're committing back to Cassandra. So these are the things that I found extremely interesting. Another thing that I would like to ask everyone to pay some attention to is the concept of CDC. They are really investing a lot on implementing solutions on top of CDC. And it's something that I think we will be hearing more and more about in the near future. And of course, that's something that we discussed also with Devaris from Roxa, if you remember. I mean, his company and his product is all about CTC. And I think that this is like a term that we will hear more and more in the near future.
Starting point is 00:56:52 Great. Well, thank you so much again for joining us on the Data Stack Show, and we will catch you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.