Orchestrate all the Things - LinkedIn's feed evolution: more granular and powerful machine learning, humans still in the loop. Featuring LinkedIn Senior Director of Engineering Tim Jurka and Staff Software Engineer Jason Zhu

Episode Date: October 16, 2023

LinkedIn is a case study in terms of how its newsfeed has evolved over the years. LinkedIn's feed has come a long way since the early days of assembling the machine learning infrastructure tha...t powers it. Recently, a major update to this infrastructure was released. We caught up with the people behind it to discuss how the principle of being people-centric translates to technical terms and implementation. Article published on Orchestrate all the Things

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Orchestrate All the Things. I'm George Anadiotis and we'll be connecting the dots together. Stories about technology, data, AI and media and how they flow into each other, saving our lives. LinkedIn's feed has come a long way since the early days of assembling the machine learning infrastructure that powers it. Recently, a major update to this infrastructure was released. We caught up with the people behind it to discuss how the principle of being people-centric translates to technical terms and implementation. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook. Hey George, thanks for having me. I'm Jason, a machine learning engineer at LinkedIn, a foundational AI technologies organization.
Starting point is 00:00:46 So we are a horizontal team aiming to connect multiple AI-powered applications at LinkedIn with state-of-the-art AI techniques and algorithms. We basically prototype and build the foundations behind those advanced algorithms that can benefit multiple vertical use cases at LinkedIn, which can potentially deliver tremendous values to our members. I grew my interest in machine learning back in undergraduate study,
Starting point is 00:01:12 where I was fascinated about computer vision, specifically facial recognition, landmark detection, those kind of stuff. I was trying to connect those with multiple real-world applications. And then I came to US to pursue my graduate study at REST University. I got a great opportunity to intern at LinkedIn feed AI team. I worked with a bunch of talented engineers on high impactful projects. It was really an exceptional experience. So naturally, I decided to come back and join as a full time engineer, continuously working on ranking and personalization problems.
Starting point is 00:01:50 Yeah, this is pretty much how I come to the field. Okay, great. Thank you. Thanks for the intro. And I also thought it would be a good idea for the benefit of people who may be listening to the conversation to give a little bit of background on myself and LinkedIn. So it was a long time ago, like six years ago by now, that I had the opportunity to converse on the principles and inner workings of LinkedIn's feed and how specifically machine learning powers the feed with the people who were responsible for it back then. Now, obviously, it's been a long time since.
Starting point is 00:02:33 I mean, six years is like a century or something in technology. And back then, it was only the beginning of applying machine learning to the field. And so we got to talk a lot about principles and how everything came together. Now, things have evolved quite a bit, I guess, since then. And to be perfectly honest with you, I missed some of that evolution. However, I was also wondering how have some key principles that were outlined back then hold up today?
Starting point is 00:03:07 For that purpose, we also have joining us today Tim Jurka, who will address these questions before moving on to the upgrades that Jason's team implemented. Tim is Senior Director of Engineering at LinkedIn, currently leading a global team of data scientists, AI engineers, and computational linguists that use data to guide LinkedIn's flagship app strategy and make the member experience more intelligent. So, Tim, what I wanted to ask you first is, back in 2017, the human-in-the-loop strategy was a key part of LinkedIn's approach to how machine learning was used to power the feed and other applications. Is that still a thing today? Yeah, so I would say human loop is definitely
Starting point is 00:03:52 still a thing. It's actually probably even more a thing than it was, you know, several years ago. Really, it boils down to two applications for the feed. The first is we're really trying to prioritize insightful, authentic, knowledge-oriented content in the LinkedIn feed. The first is we're really trying to prioritize insightful, authentic, knowledge-oriented content in the LinkedIn feed. And that's just, it's a very hard problem to get right just with AI alone. But we have an entire team of award-winning journalists led by our editor-in-chief, Dan Roth. And we've actually created AI algorithms that train on the content that they curate and believe is really the cream of the crop in terms of LinkedIn insights. And so that AI algorithm then tries to say, you know, would an
Starting point is 00:04:32 editor promote this on LinkedIn's platform? And if so, let's try to distribute on the platform and see which audiences resonate with that particular piece of content. So that's one particular area we use it. The second is just with large language models, you know, really taking the world by storm. The way that we make those just really high quality is by fine tuning them using techniques like reinforcement learning with human feedback. And so, you know, large language models, again, are presenting yet another way where human loop is much more deeply integrated into the AI workflow versus where it was maybe five or six years ago. Okay, that's great. Thank you. And then in terms of spam detection, I know previously in 2017, LinkedIn used a three-bucket system. Basically, content was classified in three different buckets
Starting point is 00:05:24 according to its likelihood of being low quality. And then those three buckets were further refined. Is that still a similar approach today or has that changed at all? So our system of spam detection works in the same way. There's really, like you said, three buckets. One is, is there content that violates LinkedIn's Terms of Service, in which case we want to make sure we remove that from the platform we try to identify low quality content
Starting point is 00:05:48 and then everything else is clear and kind of distributes on the platform i would say what has maybe changed uh since the last time we chatted is that that third bucket the kind of clear bucket we've we've invested a lot more in deeper content understanding understand the intent behind the post. Is this somebody trying to share a job opportunity? Is it somebody trying to share their opinion about the news? And getting a lot more granular so we know how to distribute the kind of clear content on our platform.
Starting point is 00:06:16 Okay. So how exactly does AI play into this process? Yeah. So first and foremost, when content is violating LinkedIn's terms of services, we try to be very proactive and our AI classifiers try to take that down in an automated way. So that's, you know, our first line of defense. But, you know, as with any probabilistic system, AI is not perfect and some things do get through.
Starting point is 00:06:42 And that's where we have multiple tiers of defense, including proactive kind of human detection, right? Either our members reporting it and us reacting and making sure we evaluate the content or content moderators looking out on the platform for anything that might have split through our automated defenses. So those are the main ways we use AI and humans together to moderate our platform. Thank you. Then if you could just quickly describe the rationale for how the feed has evolved over the years from 2017 to 2021,
Starting point is 00:07:19 when you all talked a little bit more about multi-objective optimization being brought to the feed. and then to where we are today with this use of embeddings and similar features. Yeah, the evolution of the feed ranking algorithm, I mean, it really follows the evolution of the LinkedIn feed. And since 2017, we have just a lot more members using the LinkedIn feed product. We have a lot more content creators publishing on LinkedIn. And our algorithms have had to keep pace with all the kind of diverse use cases that people come to with LinkedIn feed. And so a lot of the complexity here is just making sure that our AI models can capture all those diverse use cases.
Starting point is 00:08:00 For example, you might have a post from a member mentioning that they raised their Series A for their startup, and now they're hiring. And that post can be perceived through three very different lenses. If you're a job seeker, that post might actually be an entry point for you to reach out and get a job. And the AI model has to understand that value to you. If you're a first degree connection of that individual, it might be just to congratulate them and say, hey, I'm just checking to see, you know, you raised series A, congrats. If it's somebody who's in maybe the venture capital space, this is actually an insight where they are like, huh, I didn't know that this particular company raised series A. They may want to dig in and learn
Starting point is 00:08:35 a bit more about the term sheet. And so our AI models actually have to capture all these different types of value depending on who the member is and what goals they're trying to accomplish with the LinkedIn feed. And so the size of our models has kind of grown, complexity of our models has grown, the architecture has changed to really capture all that nuance. And so it seems like it was around 2021 that LinkedIn sort of overhauled the feed ranking platform. And the idea there was to introduce something called multi-objective optimization. And, well, there were a number of TensorFlow models trained.
Starting point is 00:09:14 And, Jason, I think it would be a good idea if you could sort of gradually introduce us to what it was that you did back then. And from there, we can actually use it as a stepping stone to come to what it is that you did today. Yes. Before we dive into the details of the architectures and TensorFlow multitask learning model, I'd like to first give our audience some context on the problem space of the blog post we published, which is the second-pass ranking problem. As you may know, like, our homepageage Feed contains a heterogeneous list of updates
Starting point is 00:09:47 from members' connections and a variety of recommendations, such as jobs, people, articles. I think the objective here is to provide a personalized rank list that helps our professional on the platform to be more productive and successful. So basically, we are adopting a two-state ranking,
Starting point is 00:10:06 where each type of updates are called first pass ranker. So basically, the top key candidates of each update are selected by individual algorithms before sending to the second pass ranker for a final green ranking. So essentially, we are working on the second pass ranking problem, which is a final green ranking. So basically, the models are presented with a set of features from both member and content set.
Starting point is 00:10:33 The model is trying to predict a set of likelihood of responses, such as the likelihood of starting a conversation, likelihood of performing a viral actions on particular post executive. Yeah, basically, from the modeling perspective of how we achieve this goal, right? Things evolved so fast, and we have keep up with rapidly evolving industry to continuously modernize our stack to deliver the best value and experience to our members. I think the earlier blog post mentioned that we transition our ranking model from a linear-based model to a deep neural network model because we firmly believe that the deep model is more effective because it introduces nonlinearities through activation functions, basically enable the models to capture more complex and nonlinear relationships in our data and learn more powerful representations. I think basically with this unified setup in the earlier post,
Starting point is 00:11:33 so instead of having individual linear model for predicting each different responses separately, we have a multi-task deep learning model that shares parameters between different tasks. So this sort of like learning of each task can benefit the other tasks through the transfer learning, right? With the large data and this inter-task relationship can be captured more effectively than having separate linear models for predicting responses separately.
Starting point is 00:11:59 I think some of the challenges we are facing is, for example, our data set is highly skewed between different responses, prediction tasks. So we have, for example, we have more click responses than engagement. We need to do like cautious sampling and rewriting to balance that factor to not cause some negative task interference during the training process because all those tasks are sharing parameters. Okay, so you introduced what I would call the cause behind this evolution. So basically, you wanted to go from linear to non-linear models in order to have, I guess, to be able to have a better feed in the end. So to be able to accommodate more inputs and more parameters and to have more fine grade models
Starting point is 00:12:56 that are also able to feed on each other's output, I guess. So that also suggests that what happened as an effect of this evolution was probably that the number of models as well as the number of parameters that these models work with multiplied. And I wonder if you have any data that you may be able to share as to what exactly happened? So was it like, I don't know, factor of two, three, 10, or if you have any idea of the quantitative effect that this evolution had? Yes, I think the model size has grown by 500x and we have also grown our data sets by around 10x, something like that. I think you mentioned the model sets scaled a lot, right? We also changed our data significantly
Starting point is 00:13:57 from our initial version as our model become more powerful and data hungry. I think one of the natural things we can think of is to increase the data sets to harness more from our data, right? Basically, we expanded the training period to a longer duration and sampled more training data from the served model. This process also presented some challenges because, you know, the served model may favor certain content, such as numbers, creators, all those popular ones. These signals are incorporated in our training data because we are collecting the training
Starting point is 00:14:33 data from those served model. This kind of bias can be further reinforced when training the model on those data generated by those models. So we are tackling those kind of challenges. And we mentioned that in our latest blog post, we can dive a little bit deeper into that. If you're interested. Yeah. Yep. Yeah, absolutely. And that was going to be the next set of questions I have for you. But actually, before we go there, I know that a lot of the work that you did, I will sort of hint that you did, I think you serve a dual goal in a way.
Starting point is 00:15:10 So on the one hand, it sounds like you started out with the objective of, again, once more optimizing the behavior of the model and actually also the performance, I guess. But that also drove you in another direction because you also ended up upgrading the infrastructure on which those models run. But before we go into both of those directions, actually,
Starting point is 00:15:34 I would like to take a little bit of time to just summarize in a very abstract way what it is that you did. And because I know that a central part of what you did actually focuses on embeddings, again, for the benefit of people who may be listening to the conversation, if you can just very quickly explain
Starting point is 00:15:56 what embeddings are in the first place. Yeah, yeah, sure. Basically, so in our latest work, I think our focus is on transforming high-dimensional sparse ID features into embedding space. So we can interpret sparse features as categorical one-hot representations, where the cardinality is several millions. In these representations, all the entries are zero except one, which corresponds to ID index. So it can be an example of a hashtag ID or a member ID. They will be transformed to low-dimensional continuous dense space, which we call embedding space, using the embedding lookup tables with hundreds of millions of parameters trained on multi-billion records.
Starting point is 00:16:46 Basically embeddings is just a dense vector with continuous value which provides an effective way of capturing essential relationships and patterns within the data while reducing the computational concept complexity. And in such a case each member's preference is encoded in a very low dimensional dense flow vector. And also, semantically similar content or IDs will have smaller distance in the embedding space. It has been used heavily in the language model domain where you convert those word or tokens into embedding space and similar word will have smaller embedding distance and by like presenting the model with such dense vector the model will understand the semantic meanings between the different words or different ids this is pretty much a high level introduction of the embedding and how we transforming those high dimensional sparse ID features into embedding space. Okay. Just a quick step back again, before we go any further.
Starting point is 00:17:50 So to you, it's obvious, but perhaps it's not obvious to everyone. So why are embeddings relevant in that context? So in other words, why do you end up with very, very sparse vectors? Yes, it is for personalization, right? Things we are trying to convert the ID feature into dense representation. You can think of the ID feature can have millions of dimensions.
Starting point is 00:18:17 For example, we have millions of dimensions for member ID because we have millions of members on the platform. And we want to, for each of the members, we want to learn a personalized dance representation. So this dancer representation will encode users preference, maybe like interaction with the past users in the past, a fixed period or the user's preference of the hashtag in the past fixed period. Basically all those information will be encoded in this dense vector during the training on the large data and by presenting the model with such personalized vector, the model will better be able to capture the dynamic shifting work and
Starting point is 00:19:01 also your members preference better. You can think of it in such a way. Just to be able to visualize that so let's take a random LinkedIn member like I don't know myself for example would this vector that you create for that member have representation slots, let's say, for every possible hashtag, for every possible other LinkedIn member, and I don't know, for every possible group and so on. This would definitely explain why you end up with such a sparse vector. Yeah, yeah. I'll just give you a concrete example, right? So for example, you may have interacted with AI and ML hashtags, right?
Starting point is 00:19:45 In the past, I also interact with the generative AI hashtag. These are different hashtags, but they have semantic meanings in the embedding space. They have very smaller distance in the embedding space. As a result, the model can understand you may have the similar taste as mine, and the model can recommend similar contents in the AI domain for both of us. This is a concrete example of how this come into effective in terms of personalization.
Starting point is 00:20:16 And in our model, as mentioned in the blog post, for each of the content we are generating, for each of the viewer, we are generating the hashtags this viewer has interacted in the past fixed period. And also the member this viewer has preference or often interact with same type of content or similar group of other content, other members tend to have similar embeddings resulting in smaller distance in the embedding space. So this capability basically enables a system to identify and recommend contacts that is contextually relevant or aligns with our members' preference. Okay, well thanks. I think that makes it a bit more approachable to people. So given everything
Starting point is 00:21:15 we've just said about embeddings and what they represent specifically in the context of LinkedIn and the number of parameters that may go into those, specifically affecting the experience that people have on the feed. What was the objective when you started this endeavor to optimize the behavior of your models and how did you go about it? Yes. So basically we're trying to optimize for engagement, basically predicting better likelihood in terms of how user will interact with different posts.
Starting point is 00:21:57 And then by enhancing the prediction of those likelihood, we can better rank the feed that's providing a user more engaging experience. Okay. So how do you embeddings and sparse IDs come into play into this high level objective? So what did you have to do in order to produce this better engagement?
Starting point is 00:22:24 And how does actually also also the upgrading hardware comes into play into that? Yes, I think, for the first question, these embedding tables are trained together with our objective in terms of our like, response prediction model. So basically they are trained end to end. So basically the embedding the dense vector captures the user's preference of the likelihood
Starting point is 00:22:53 of these different responses. So it contributes to the final prediction of the likelihood and the overall ranking, right? In terms of hardware upgrade, i think we definitely upgrade our uh the hardware in our serving and training cluster uh so basically for serving we need to have like hardware that have large larger gpu and computation power uh in order to serve multiple large models in parallel. And for training, we also upgrade our GPU hardware and also move to Kubernetes cluster for more resilient task scheduling.
Starting point is 00:23:37 Okay. Okay. That's, that's interesting. And so it makes me wonder whether this is something that was foreseen at the beginning of this project or something that came up as a side effect, let's say. And then I guess if the latter was the case, then probably you would have to go through some process of approval for getting these upgraded capabilities.
Starting point is 00:24:03 Yes, usually we have to ramp up prototyping model online to prove the online metrics gains and before convincing us that we want to fund in hardware. And also something interesting is we are building this kind of large model in a progressive way. Initially we are facing challenges of serving dozens of large models in parallel. Yeah, we're serving host. After enlarging the model size, but like by 500X, right? The host was just not designed
Starting point is 00:24:35 to be able to handle such scale. So we chose to serve the model with a two-stage strategy in the first place. In the first stage, we have a global model that is trained as normal. So basically, ID features are converted to dense vector from large embedding tables first, and then these dense vectors are sent to deep neural network along with existing features for implicit feature interactions. In the second stage, we split the global model
Starting point is 00:25:01 at the boundary of embedding table and deep neural network. Since from the deep neural network's perspective, it only needs a set of dense vectors, which is the embedding representation of those sparse features for predicting the results. The ID conversion step is not necessary to present it on the critical parts of model serving. So we can essentially convert those embedding offline and push them to high performance key value stores, such as we don't need to host those embedding tables in our serving host because serving host has limited memory at that moment. Well, we are upgrading our host. We use this strategy to deliver tremendous values to our members because the model can still, the features generally offline don't have some lag compared to
Starting point is 00:25:49 you serve them as a whole in memory. But it still generates some values and deliver metrics gains. So we shape that strategy while we are working on upgrading the host. This is something very interesting. And later on, I think once we upgrade our hardware, we move everything in memory
Starting point is 00:26:09 and do a bunch of memory optimizations that we can talk about later on. Okay. So obviously the feed is a very, very central feature of LinkedIn and therefore the work that you did is also by extension very central. I'm wondering, however, if those embedding tables are potentially also used by other models in the context of LinkedIn.
Starting point is 00:26:36 And if yes, then I guess what you did, besides being central in itself, would probably also serve other functions. Good question. Yeah, I think the embedding table itself is highly coupled with the data set or the problem we are trying to solve. But I want to mention that the technique itself is definitely generalizable.
Starting point is 00:26:59 As I mentioned, we are a horizontal team. Though the blog post is specifically for feed, the technology or the infrastructure foundation will be built behind this. It can definitely be leveraged by multiple different use cases. Actually, we've already leveraged the same set of components or techniques by different other use cases within LinkedIn, such as job recommendation and ads recommendation, those kind of vertical use cases. We are adopting a very similar strategy
Starting point is 00:27:31 of converting those first ID to the dense representation. We believe that's a scalable way of developing AI at LinkedIn. So basically develop a generic component that can be leveraged by different use cases. And the learnings we have in feed can also be applied in a bunch of other use cases. But I think the embedding table itself is not that leverageable.
Starting point is 00:27:57 So basically the technique itself can be leveraged by multiple use cases. Okay. I see. I wonder, however, since you know, the embeddings that you already generate and use are sounds like you know, they're quite besides being scalable, they're also quite large. And therefore, do you think that they could potentially be usable as is? Or is it some? Is it maybe that
Starting point is 00:28:24 specific different embedding techniques are used by different models and therefore each model would have to use its own table? Yeah, I think in general, the techniques are are also very similar between different use cases such as for training in order to scale, scale our training to be more faster. We adopted 4D model parallelism. And that that has been leveraged by different use cases to speed up the training. And it is technique is designed is catered towards this embedding lookup architecture, due to some of the bottlenecks within the architecture. And the tech, the technique or component review successfully address such kinds
Starting point is 00:29:15 of problem that has been leveraged by different use cases. Okay, so I guess probably the same. Besides the actual improvement in the technique that you implemented, the upgrade in the infrastructure is probably also just at the time being at least used for the benefit of this specific model. It didn't entail any upgrades that were also shared, let's say, with other models or other functions? Yes, I think the infrastructure we built, the hardware upgrade, and some of the components we built will definitely benefit other use cases such as large language model fine-tuning use cases. Yeah. Okay. Okay. And was there any, did you use in this effort or potentially did you also
Starting point is 00:30:16 contribute to any open source projects? And because I know that LinkedIn has a number of open source projects, some of which it contributes to, some of which it contributes to, some of which it actually started and are maintained within LinkedIn. So I wonder if any of those was involved in some way. Yes, I think as mentioned in that blog post, we adopted and extended Horovod at LinkedIn, which is an open source project originally originated from Uber, if I remember correctly, that provides final grain control for communicating parameters between multiple GPUs. So the motivation for that is originally we use distributed data power,
Starting point is 00:31:00 which means models are replicated across different GPUs. And each GPU is responsible just for a portion of the data. At the end of the training step, right, a communication happens between GPUs to synchronize the gradient and models, and then update the model width. So you can see a lot of inefficiency here that the whole model stays in the memory for each of the GPU. As the model size grows further, it will result in out-of-memory and it just does not fit in one host.
Starting point is 00:31:33 And also during the communication, the gradients of the whole model on each rank, on each GPU, will send to all the other ranks, which is highly inefficient. And we observe this communication as a significant bottleneck during training. So basically, we extend the hardware to support the model parallelism for efficient training. Basically, in model parallelism, model is sharded across different GPUs. In our case, we did table-wise splitting,
Starting point is 00:32:01 meaning that, for example, if you have two embedding parameter tables in the model, each GPU hosts only one parameter table. And during training, the gradient update of embedding table only happens on the belonging GPUs. So at the end of each step, only the gradient in model split boundary is communicated, which is significantly smaller or cheaper than communicating the whole model. And by doing that, we achieved training speed up and it makes possible to train the model with this large embedding table very fast across different use cases. Okay, thank you, Ashim. Okay, well, it sounds like you've done a lot of improvement
Starting point is 00:32:51 on a lot of different fronts, let's say. So what's on your roadmap for what's coming next? What's the next thing you are working on? Assuming you're actually still working on the same model, and where do you go from here? Yeah, we have seen pretty decent experimental results from using Deep Cloud Network V2, which was originally published by Google in 2017.
Starting point is 00:33:19 Basically, currently we are using multi-layer perception or feed-forward network to do feature interactions between those sparse features and the existing number and content set feature. The deep cross network is designed to capture both low-order and high-order feature interactions efficiently. It does this by incorporating a cross network and a deep network within the architecture. For the cross network, basically, it consists of multiple layers of crossing units. So each crossing unit computes a cross feature from the product of two features
Starting point is 00:33:53 and then add it back to the original input features. So basically this cross network helps capture high-order feature interactions. And we also have the deep network in this architecture, which is just a naive feedforward network with multiple hidden layers. It is designed to capture low-order feature interactions and learn complex patterns from the input features. From our offline and initial programming result, we found it is more effective than simply using the feed neural network to do the implicit feature actions among all the features
Starting point is 00:34:25 we presented to the model. And also this architecture incurs significant latency during online serving. So the team is investing on GPU serving to unblock the capabilities of serving the computational intensive models. Right now we have the capability of serving memory-intensive models. I think moving forward, we want to beef up our serving infrastructure to be able to serve computational-intensive model as well. And also another direction would be
Starting point is 00:34:56 scaling the model size by adding more sparse features. I think one of the good things about LinkedIn is it has standardized the professional data, right? And we believe by transforming more features and crossing them efficiently with the first approach I mentioned would yield further gains and values to our members. Yeah, and also, I think we are actively exploring something called continuous training or incremental training to capture the dynamic of shifting work. So basically embeddings currently are trained on a fixed period, which could suffer the problem of inconsistent offline online results
Starting point is 00:35:37 or diminished gains over the long period of time. Essentially by retraining all your experiments frequently, the embeddings can capture the dynamic of the system and better predict personalized feed for our members. Yeah, indeed. I know this last topic that you mentioned, so real-time machine learning training, let's call it, even though I'm not sure how, if you could actually do it in real, real time. But even having more frequent updates, I know it's a very challenging thing to do, and especially at the scale that you work at LinkedIn.
Starting point is 00:36:18 So, well, thanks for the conversation. And it sounds like you have a lot of work lined up for you. So best of luck with that and success. Thanks for having me. Thanks for sticking around. For more stories like this, check the link in bio and follow link data orchestration.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.