In The Arena by TechArena - The Demand for Memory Innovation to Fuel ML with Netflix

Episode Date: March 26, 2024

TechArena host Allyson Klein interviews Netflix’s Tejas Chopra about how Netflix’s recommendation engines require memory innovation across performance and efficiency in advance of his keynote at M...emCon 2024 later this month.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome to the Tech Arena. My name is Alison Klein. This morning, I'm so delighted to be joined by Tejas Chopra of Netflix. Tejas, welcome to the program. I'm so happy to have you on the show. Thanks a lot, Alison, and I'm very honored to be here. And thank you so much for interviewing me today. So Netflix obviously needs no introduction, but it's the first time we've had Netflix on the show. Why don't you talk about your role at Netflix and how it relates to our topic about memory innovation today? Absolutely. So I have been at Netflix for a little more than three and a half years now. And currently what I do is I am working on like machine learning infrastructure
Starting point is 00:01:08 which means that when you go into Netflix and you look at all the recommendations they are powered by algorithms that run on the platform that my org and my team support a lot of it has to do with how we manage storage appropriately, compute appropriately, and memory as well to run these machine learning models, which is why memory is a very critical aspect of our entire lifecycle of machine learning. And that's why it is very important for us. Like we consider it as one of the most important pillars for, you know, development of, like, to delight our customers developing the recommendations that we give. Now, I think everyone has experienced your recommendations. I do almost every day. But can you take us back for a second and just describe the architecture that fuels the Netflix that everybody enjoys from cloud to edge and how it's so important to have
Starting point is 00:02:07 that continued connectivity across that pipeline to ensure great customer experiences, including recommendations? Absolutely. If you think about how any software is delivered or how you can get your bits to your device, to your phone, to your laptops, tablets. There are multiple ways to do that. One of the ways you can think about it is, hey, you know what, I have this content in cloud. And then whenever someone wants to watch a movie, I stream it from the cloud. There are two disadvantages of this approach. One of the biggest disadvantage is cloud is not cheap when it comes to egress, which means it's easier to put data in. It is very expensive to pull data out of the cloud.
Starting point is 00:02:50 Now, if every time all of our millions of customers have to watch a movie, that will become very expensive for Netflix to maintain. The other obvious solution from that point on is, you know, you have these locations around the world where you can cache content and keep it and so users that are connected closer to these locations can then pull the content from that intermediate location rather than from cloud now there is where netflix really thrived what we did we actually partnered with the internet service providers to have a device installed in their physical location now this device is just a regular you know off-the-shelf device but runs netflix code which is open connect open connect is an open source cdn that netflix has developed and open source
Starting point is 00:03:39 so all that device does is it runs this Open Connect code inside it that allows Netflix to push its material or its data content to the edge, to the ISP location. It's a win-win for ISP as well because they don't have to have their bandwidth being served by Netflix or rather most of their bandwidth being occupied by Netflix. Instead, they can give this little physical location in their ISP hub, and then Netflix can choose to have its services installed there and leverage its own bandwidth. Now, as you can imagine, there are a lot of challenges,
Starting point is 00:04:15 you know, when it comes to content delivery. One of the first challenges is you need low latency. Because that's the entire experience experience so you have to cache the video content at the edge of the network today netflix open connect cdn consists of 17 000 edge servers and it is yeah it's installed in like more than thousand locations worldwide the second one is um you know there is a rise of 4k, HDR, high frame rate content. So you need these edge servers with powerful CPUs, GPUs,
Starting point is 00:04:51 network interfaces to handle the large data traffic. And Netflix has developed and has a lot of proprietary video encoding and bitrate streaming algorithms that allow us to deliver the highest quality for the screen on which you are watching Netflix. Of course, Netflix has pioneered this idea of microservices as well. So, you know, scalability, reliability, these are ensured because of our very decoupled architecture of microservices. So these are various ways in which, you know, we have ensured that delivery of video at the edge is really unique for Netflix and differentiates us from the competition. Yeah, I mean, I use your infrastructure and your solutions as an example of so many
Starting point is 00:05:37 different technologies and first innovations. First innovations in ML, first innovations in CDN. You know, it's really groundbreaking what you have all been able to accomplish. When you look at ML and you look at the recommendation engines that really have defined the Netflix experience, when you look at that, what are you trying to innovate today? Obviously, AI is advancing rapidly. How do you see the world of recommendation engines changing? And what are you trying to do from the performance perspective of the infrastructure? That's a great question. I mean, AI actually is, you know, everywhere now.
Starting point is 00:06:20 And when it comes to recommendations also, you you know you can start off by understanding first and foremost what recommendation is and what recommendation isn't if someone ends up searching a category on netflix or ends up putting anything in that search bar it means it's a loss for recommendation because we were not able to recommend something to someone they have to search for something so those are signals we collect like you know people searching people scrolling so at its foremost recommendations are powered by collecting data collecting data about what the user has watched what the user has skipped we use this data collect this data put it in our storage media back in storage. This is something that I work on, you know, parts where this is stored.
Starting point is 00:07:13 And then we train our models offline to actually, you know, get better over time. When it comes to what are challenges, you know, or with the new AI, what are some of the things that recommendations can evolve to? The first thing is very important, which is why we are talking here is the growing importance of memory. Memory has become very critical when it comes to recommendations, because as you are seeing more models, more generative AI models, machine learning models grow, the number of parameters that are used to train these models has also increased. Today, we are looking at hundreds of billions of parameters, trillions of parameters. All of these things require a lot of memory to be trained efficiently and to be built. Secondly, there have been this notion with Gen AI of encodings. It's something new, right? Like we
Starting point is 00:07:59 use encodings so that we can get inferencing faster, which means, you know, it's not about training the model. It's also about when you ask something, you get the result quickly. That is where encoding is used. And for such huge models, encoding space is also huge. And encodings are generally captured in memory, which means that the footprint of memory is continuously increasing.
Starting point is 00:08:22 Netflix is trying to pioneer a lot of interesting innovation in the field of memory and infrastructure to support Gen AIS. So when it comes to memory, I think that most people's minds go to standard DDR technology, DDR5. And platforms are offering up to 12 channels of memory
Starting point is 00:08:43 at this point in standard configurations. You can load a system full of a lot of memory, but that doesn't solve the whole problem for you. So tell me, what are the characteristics of memory that you're looking for between capacity, latency, et cetera? And what is going to fit the bill for these recommendation engines? That's actually a great, great question there. I mean, there are several types of memories that you can think about. There's DDR, I mean, DRAM, and then there is like the LPDDR5, HBM, and they all have their advantages and disadvantages. Like DRAM is generally, you know, the most, it's a workhorse where you have when you need caching
Starting point is 00:09:27 and buffering. HBM is specifically being used for training large language models with huge data sets and then you have the DDR5 which is you know it supports 4k 8k video streaming and it's supposed to be very beneficial for devices. So they have these different requirements and needs. In our case, we actually need, we use a plethora of these options along with persistence because persistence becomes very critical in our case. So we've developed something called as EV cash, which is ephemeral volatile cash.
Starting point is 00:10:03 You can think of it as, you know, memcached alternative, but with some notion of persistence as well. And that allows us to, a very good example of this is, if I need recommendations and if I have a lot of data and, you know, I'm running some queries on that data, the query may be slow if that data is let's say put in some storage media like s3 but by having this layer of ephemeral volatile cache in front these queries come like complete faster so that's where we are leveraging memory a lot
Starting point is 00:10:37 today and these are ways i think for us it it's balancing the needs of cost and utility because good memory is very expensive and cheap storage is not good for queries or for, you know, so having the right balance and where possible, fitting the right thing is very helpful there. So that's how we look at different memory options today. Tejas, you almost broke my heart with the word persistence. I think one of the only people on the planet that worked on persistent memory, both at Intel and at Micron. When you think about persistence and you think about why that's so important, what is the industry delivering today that fits that need? And what do you want to see from the industry in this space? Because I'm struggling with the available options to see what's going to actually satisfy your demand. That's a great question again.
Starting point is 00:11:41 So when you think about persistence, the reason you need persistence is because you have a lot of data that cannot sit in the memory and you will have to pay through your nose if you have to get that much memory. So what do you do? You have this very, very, something that has been true since the start of time, notion of you have the hot data in memory and somehow identify which is the cold data
Starting point is 00:12:08 and then move it to persistent stores or to some other media. That media happens to be persistent store because it's cheaper than constantly putting charge to keep the memory up and running. So I think that what I would imagine, you know, a lot of folks getting into or what I'm wanting from the industry is advancements in non-volatile memory technologies for persistent
Starting point is 00:12:33 storage. And, you know, there are options here. NVMe is one of them. But, you know, there are non-volatile memory capacities that are reaching multiple terabytes per dip. And what I would like to see is, you know, read-write latencies for these falling below 100 nanoseconds, for example. This would enable new possibilities for in-memory databases, real-time analytics, large-scale AI model checkpointing, which I think are things that we cannot do today because we are not able to have, there's a big gap between what is memory and what is storage. There are things which I also think many companies will care about.
Starting point is 00:13:18 It's the energy footprint of AI. That is something that I talk about generally as well. Like what's the carbon footprint of AI? And I think that having energy efficient memory is one other thing that I would like to see in 2024 where companies are, you know, making building architecture that are better for the environment. So those are two major things. Yeah, that's fantastic. Memcon is coming up later later this month i'm really excited about it um what are you looking forward to at the event and what is your talk going to feature uh yeah so i'm i'm very excited about memcon i am uh you know looking at several areas where i think i could uh i want to learn more about how the industry is looking at it. Like I said, memory-centric computing architectures,
Starting point is 00:14:08 which are the new architectures like this? Samsung has come up with some processing in memory, DRAM technologies. There are, like I mentioned, advancements in non-volatile memory technologies for persistent storage. Advancements in memory that is specific for AI, which is high bandwidth, low latency memory.
Starting point is 00:14:32 Like NVIDIA's GPU works with a particular type of memory. What are some of the ways in which it can expand to other architectures of memory? And seamless integration of memory across the stack. Like this is CXL, which a lot of people talk about these days, which enables like memory sharing between CPUs, GPUs and other accelerators.
Starting point is 00:14:51 I would like to see how CXL is being adopted and embraced in the industry. And also what are some of the software optimizations that people are working with in different industries to like build on top of existing primitives of memory. So those are some of the areas that I would love to learn more about. I am talking about how memory is being used when it comes to machine learning infrastructure. So this is focusing specifically on what are some of the
Starting point is 00:15:16 optimizations that you can look at when it comes to ML? What is the nature of data structures and why is memory so important in ML? Well, I am really excited for your talk. I will be in the audience listening. And you gave us a really nice preview today. So one final question for you, Tejas, before I let you go is where can folks connect with you and find out more and, you know, connect with you about machine learning? Absolutely. So I work on the infrastructure side for machine learning to power some of these models and applications. People can connect with me on LinkedIn or Twitter. And, you know, I'll be there at MemCon as well. So that's a good opportunity to, like, you know, share ideas and, like, have conversations.
Starting point is 00:16:07 Fantastic. MemCon, of course, is March 26th and 27th in Mountain View at the Computer History Museum. You can meet Tejas there as well as myself. So thanks so much for your time today. Thanks a lot, Allison. And it was a pleasure. And thank you for inviting me today. Thanks a lot, Alison. And it was a pleasure. And thank you for inviting me today. Thanks for joining the Tech Arena.
Starting point is 00:16:30 Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.