Orchestrate all the Things - LinkedIn's feed evolution: more granular and powerful machine learning, humans still in the loop. Featuring LinkedIn Senior Director of Engineering Tim Jurka and Staff Software Engineer Jason Zhu
Episode Date: October 16, 2023LinkedIn is a case study in terms of how its newsfeed has evolved over the years. LinkedIn's feed has come a long way since the early days of assembling the machine learning infrastructure tha...t powers it. Recently, a major update to this infrastructure was released. We caught up with the people behind it to discuss how the principle of being people-centric translates to technical terms and implementation. Article published on Orchestrate all the Things
Transcript
Discussion (0)
Welcome to Orchestrate All the Things. I'm George Anadiotis and we'll be connecting the dots together.
Stories about technology, data, AI and media and how they flow into each other, saving our lives.
LinkedIn's feed has come a long way since the early days of assembling the machine learning infrastructure that powers it.
Recently, a major update to this infrastructure was released. We caught up with the people behind it to discuss how the
principle of being people-centric translates to technical terms and implementation. I hope you
will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter,
LinkedIn, and Facebook. Hey George, thanks for having me. I'm Jason, a machine learning engineer
at LinkedIn, a foundational AI technologies organization.
So we are a horizontal team aiming to connect
multiple AI-powered applications at LinkedIn
with state-of-the-art AI techniques and algorithms.
We basically prototype and build the foundations
behind those advanced algorithms
that can benefit multiple vertical use cases at LinkedIn,
which can potentially deliver tremendous values to our members.
I grew my interest in machine learning back in undergraduate study,
where I was fascinated about computer vision,
specifically facial recognition, landmark detection, those kind of stuff.
I was trying to connect those with multiple real-world applications.
And then I came to US to pursue
my graduate study at REST University. I got a great opportunity to intern at LinkedIn feed AI
team. I worked with a bunch of talented engineers on high impactful projects. It was really an
exceptional experience. So naturally, I decided to come back and join as a full time engineer,
continuously working on ranking and personalization problems.
Yeah, this is pretty much how I come to the field.
Okay, great. Thank you. Thanks for the intro.
And I also thought it would be a good idea for the benefit of people who may be listening to the conversation to give a little bit of background on myself and LinkedIn.
So it was a long time ago, like six years ago by now,
that I had the opportunity to converse on the principles and inner workings of LinkedIn's feed
and how specifically machine learning
powers the feed with the people who were responsible for it back then.
Now, obviously, it's been a long time since.
I mean, six years is like a century or something in technology.
And back then, it was only the beginning of applying machine learning to the field. And so we got to talk a lot about principles
and how everything came together.
Now, things have evolved quite a bit, I guess, since then.
And to be perfectly honest with you, I missed some of that evolution.
However, I was also wondering how have some key principles
that were outlined back then
hold up today?
For that purpose, we also have joining us today Tim Jurka, who will address these questions
before moving on to the upgrades that Jason's team implemented.
Tim is Senior Director of Engineering at LinkedIn, currently leading a global team of data scientists,
AI engineers, and computational linguists
that use data to guide LinkedIn's flagship app strategy and make the member experience more intelligent.
So, Tim, what I wanted to ask you first is, back in 2017, the human-in-the-loop strategy was a key part of LinkedIn's approach
to how machine learning was used to power the feed
and other applications. Is that still a thing today? Yeah, so I would say human loop is definitely
still a thing. It's actually probably even more a thing than it was, you know, several years ago.
Really, it boils down to two applications for the feed. The first is we're really trying to
prioritize insightful, authentic, knowledge-oriented content in the LinkedIn feed. The first is we're really trying to prioritize insightful, authentic,
knowledge-oriented content in the LinkedIn feed. And that's just, it's a very hard problem to get
right just with AI alone. But we have an entire team of award-winning journalists led by our
editor-in-chief, Dan Roth. And we've actually created AI algorithms that train on the content
that they curate and believe is really the cream of the
crop in terms of LinkedIn insights. And so that AI algorithm then tries to say, you know, would an
editor promote this on LinkedIn's platform? And if so, let's try to distribute on the platform and
see which audiences resonate with that particular piece of content. So that's one particular area
we use it. The second is just with large language models, you know, really taking the world by storm. The way that we make those just really high quality
is by fine tuning them using techniques like reinforcement learning with human feedback.
And so, you know, large language models, again, are presenting yet another way where human loop
is much more deeply integrated into the AI workflow versus where it was maybe five or six years ago.
Okay, that's great. Thank you. And then in terms of spam detection, I know previously in 2017,
LinkedIn used a three-bucket system. Basically, content was classified in three different buckets
according to its likelihood
of being low quality.
And then those three buckets were further refined.
Is that still a similar approach today or has that changed at all?
So our system of spam detection works in the same way.
There's really, like you said, three buckets.
One is, is there content that violates LinkedIn's Terms of Service, in which case we want to
make sure we remove that from the platform we try to identify low quality content
and then everything else is clear and kind of distributes on the platform i would say
what has maybe changed uh since the last time we chatted is that that third bucket the kind of
clear bucket we've we've invested a lot more in deeper content understanding understand the intent
behind the post.
Is this somebody trying to share a job opportunity?
Is it somebody trying to share their opinion about the news?
And getting a lot more granular so we know how to distribute the kind of clear content
on our platform.
Okay.
So how exactly does AI play into this process?
Yeah.
So first and foremost, when content is violating LinkedIn's
terms of services, we try to be very proactive and our AI classifiers try to take that down
in an automated way.
So that's, you know, our first line of defense.
But, you know, as with any probabilistic system, AI is not perfect and some things do get through.
And that's where we have multiple tiers of defense,
including proactive kind of human detection, right? Either our members reporting it and us
reacting and making sure we evaluate the content or content moderators looking out on the platform
for anything that might have split through our automated defenses. So those are the main ways we use AI and humans together to moderate our platform.
Thank you.
Then if you could just quickly describe the rationale
for how the feed has evolved over the years
from 2017 to 2021,
when you all talked a little bit more
about multi-objective optimization
being brought to the feed. and then to where we are today with this use of embeddings and similar features.
Yeah, the evolution of the feed ranking algorithm, I mean, it really follows the evolution of the LinkedIn feed.
And since 2017, we have just a lot more members using the LinkedIn feed product.
We have a lot more content creators publishing on LinkedIn.
And our algorithms have had to keep pace with all the kind of diverse use cases that people come to with LinkedIn feed.
And so a lot of the complexity here is just making sure that our AI models can capture all those diverse use cases.
For example, you might have a post from a member mentioning that they raised
their Series A for their startup, and now they're hiring. And that post can be perceived through
three very different lenses. If you're a job seeker, that post might actually be an entry
point for you to reach out and get a job. And the AI model has to understand that value to you.
If you're a first degree connection of that individual, it might be just to congratulate
them and say, hey, I'm just checking to see, you know, you raised series A, congrats. If it's
somebody who's in maybe the venture capital space, this is actually an insight where they are like,
huh, I didn't know that this particular company raised series A. They may want to dig in and learn
a bit more about the term sheet. And so our AI models actually have to capture all these different
types of value depending on who the member is and what goals they're trying to accomplish with
the LinkedIn feed.
And so the size of our models has kind of grown, complexity of our models has grown,
the architecture has changed to really capture all that nuance.
And so it seems like it was around 2021 that LinkedIn sort of overhauled the feed ranking platform.
And the idea there was to introduce something called multi-objective optimization.
And, well, there were a number of TensorFlow models trained.
And, Jason, I think it would be a good idea if you could sort of gradually introduce us
to what it was that you did back then.
And from there, we can actually use it as a stepping stone to come to what it is that you did today.
Yes.
Before we dive into the details of the architectures and TensorFlow multitask learning model, I'd
like to first give our audience some context on the problem space of the blog post we published,
which is the second-pass ranking problem.
As you may know, like, our homepageage Feed contains a heterogeneous list of updates
from members' connections
and a variety of recommendations,
such as jobs, people, articles.
I think the objective here
is to provide a personalized rank list
that helps our professional on the platform
to be more productive and successful.
So basically, we are adopting a two-state ranking,
where each type of updates are called first pass ranker.
So basically, the top key candidates of each update
are selected by individual algorithms
before sending to the second pass ranker
for a final green ranking.
So essentially, we are working on the second pass ranking
problem, which is a final green ranking.
So basically, the models are presented with a set of features from both member and content set.
The model is trying to predict a set of likelihood of responses, such as the likelihood of starting a conversation, likelihood of performing a viral actions on particular post executive. Yeah, basically, from the modeling perspective of how
we achieve this goal, right? Things evolved so fast, and we
have keep up with rapidly evolving industry to continuously
modernize our stack to deliver the best value and experience to
our members. I think the earlier blog post mentioned that we
transition our ranking model from a linear-based model to a deep neural network model because we firmly believe that the deep model is more effective because it introduces nonlinearities through activation functions, basically enable the models to capture more complex and nonlinear relationships in our data and learn more powerful representations.
I think basically with this unified setup
in the earlier post,
so instead of having individual linear model
for predicting each different responses separately,
we have a multi-task deep learning model
that shares parameters between different tasks.
So this sort of like learning of each task can benefit the other tasks through the transfer
learning, right?
With the large data and this inter-task relationship can be captured more effectively than having
separate linear models for predicting responses separately.
I think some of the challenges we are facing is, for example,
our data set is highly skewed between different responses, prediction tasks.
So we have, for example, we have more click responses than engagement.
We need to do like cautious sampling and rewriting to balance that factor to not cause some negative task interference during the training process
because all those tasks are sharing parameters.
Okay, so you introduced what I would call the cause behind this evolution. So basically,
you wanted to go from linear to non-linear models in order to have, I guess, to be able to have a better feed in the end.
So to be able to accommodate more inputs and more parameters and to have more fine grade models
that are also able to feed on each other's output, I guess. So that also suggests that what happened as an effect of this evolution
was probably that the number of models as well as the number of parameters that these models
work with multiplied. And I wonder if you have any data that you may be able to share as to what exactly happened? So was it like, I don't know,
factor of two, three, 10, or if you have any idea of the quantitative effect that this evolution had?
Yes, I think the model size has grown by 500x and we have also grown our data sets
by around 10x, something like that.
I think you mentioned the model sets scaled a lot, right?
We also changed our data significantly
from our initial version
as our model become more powerful and data hungry.
I think one of the natural things we can think of
is to increase the data sets to harness more from our data, right? Basically, we expanded the
training period to a longer duration and sampled more training data from the served model.
This process also presented some challenges because, you know, the served model may favor certain content, such as numbers,
creators, all those popular ones.
These signals are incorporated in our training data because we are collecting the training
data from those served model.
This kind of bias can be further reinforced when training the model on those data generated
by those models.
So we are tackling those kind of challenges. And we mentioned that
in our latest blog post, we can dive a little bit deeper into that. If you're interested. Yeah.
Yep. Yeah, absolutely. And that was going to be the next set of questions I have for you. But
actually, before we go there, I know that a lot of the work that you did, I will sort of hint that you did,
I think you serve a dual goal in a way.
So on the one hand,
it sounds like you started out with the objective of,
again, once more optimizing the behavior of the model
and actually also the performance, I guess.
But that also drove you in another direction
because you also ended up upgrading the infrastructure
on which those models run.
But before we go into both of those directions, actually,
I would like to take a little bit of time
to just summarize in a very abstract way
what it is that you did.
And because I know that a central part of what you did
actually focuses on embeddings,
again, for the benefit of people
who may be listening to the conversation,
if you can just very quickly explain
what embeddings are in the first place.
Yeah, yeah, sure.
Basically, so in our latest work, I think our focus is on transforming high-dimensional sparse ID features into embedding space.
So we can interpret sparse features as categorical one-hot representations, where the cardinality is several millions.
In these representations, all the entries are zero except one, which corresponds
to ID index. So it can be an example of a hashtag ID or a member ID. They will be transformed to
low-dimensional continuous dense space, which we call embedding space, using the embedding lookup
tables with hundreds of millions of parameters trained on multi-billion records.
Basically embeddings is just a dense vector with continuous value which provides an effective way
of capturing essential relationships and patterns within the data while reducing the computational
concept complexity. And in such a case each member's preference is encoded in a very low dimensional dense flow vector.
And also, semantically similar content or IDs will have smaller distance in the embedding space.
It has been used heavily in the language model domain where you convert those word or tokens into embedding space and similar word will have smaller embedding distance
and by like presenting the model with such dense vector the model will understand the semantic
meanings between the different words or different ids this is pretty much a high level introduction
of the embedding and how we transforming those high dimensional sparse ID features into embedding space. Okay. Just a quick step back again, before we go any further.
So to you, it's obvious, but perhaps it's not obvious to
everyone. So why are embeddings relevant in that context? So in
other words, why do you end up with very, very sparse vectors?
Yes, it is for personalization, right?
Things we are trying to convert the ID feature
into dense representation.
You can think of the ID feature
can have millions of dimensions.
For example, we have millions of dimensions for member ID
because we have millions of members on the platform.
And we want to, for each of the members, we want to learn a personalized dance representation.
So this dancer representation will encode users preference, maybe like interaction with the past
users in the past, a fixed period or the user's preference of the hashtag in the past fixed period.
Basically all those information will be
encoded in this dense vector during the training on the large data and by presenting the model with
such personalized vector, the model will better be able to capture the dynamic shifting work and
also your members preference better. You can think of it in such a way.
Just to be able to visualize that so let's take a random LinkedIn member like I don't know myself
for example would this vector that you create for that member have representation slots, let's say, for every possible hashtag, for every possible
other LinkedIn member, and I don't know, for every possible group and so on.
This would definitely explain why you end up with such a sparse vector.
Yeah, yeah.
I'll just give you a concrete example, right?
So for example, you may have interacted with AI and ML hashtags, right?
In the past, I also interact with the generative AI hashtag.
These are different hashtags, but they have semantic meanings in the embedding space.
They have very smaller distance in the embedding space.
As a result, the model can understand you may have the similar taste as mine,
and the model can recommend similar contents
in the AI domain for both of us.
This is a concrete example of how this
come into effective in terms of personalization.
And in our model, as mentioned in the blog post,
for each of the content we are generating,
for each of the viewer, we are generating the hashtags this viewer has interacted in the past fixed period.
And also the member this viewer has preference or often interact with same type of content or similar
group of other content, other members tend to have similar embeddings resulting in smaller distance
in the embedding space. So this capability basically enables a system to identify and
recommend contacts that is contextually relevant or aligns with our members' preference.
Okay, well thanks. I think that makes it a bit more approachable to people. So given everything
we've just said about embeddings and what they represent specifically in the context of LinkedIn
and the number of parameters that may go into those, specifically affecting
the experience that people have on the feed.
What was the objective when you started this endeavor to optimize the behavior of your
models and how did you go about it?
Yes. So basically we're trying to optimize for engagement,
basically predicting better likelihood
in terms of how user will interact with different posts.
And then by enhancing the prediction of those likelihood,
we can better rank the feed
that's providing a user more engaging experience.
Okay.
So how do you embeddings and sparse IDs
come into play into this high level objective?
So what did you have to do
in order to produce this better engagement?
And how does actually also also the upgrading hardware comes
into play into that?
Yes, I think, for the first question, these embedding tables
are trained together with our objective in terms of our like,
response prediction model.
So basically they are trained end to end.
So basically the embedding the dense vector captures
the user's preference of the likelihood
of these different responses.
So it contributes to the final prediction
of the likelihood and the overall ranking, right?
In terms of hardware upgrade, i think we definitely upgrade our uh
the hardware in our serving and training cluster uh so basically for serving we need to
have like hardware that have large larger gpu and computation power uh in order to serve multiple large models in parallel.
And for training, we also upgrade our GPU hardware and also move to Kubernetes
cluster for more resilient task scheduling.
Okay.
Okay.
That's, that's interesting.
And so it makes me wonder whether this is something that was foreseen at the beginning of this project
or something that came up as a side effect, let's say.
And then I guess if the latter was the case,
then probably you would have to go through some process of approval
for getting these upgraded capabilities.
Yes, usually we have to ramp up prototyping model online
to prove the online metrics gains and before convincing us that we want to fund in
hardware. And also something interesting is we are building this kind of large model in a
progressive way. Initially we are facing challenges of serving dozens of large models in parallel.
Yeah, we're serving host.
After enlarging the model size,
but like by 500X, right?
The host was just not designed
to be able to handle such scale.
So we chose to serve the model
with a two-stage strategy in the first place.
In the first stage,
we have a global model that is
trained as normal. So basically, ID features are converted to dense vector from large embedding
tables first, and then these dense vectors are sent to deep neural network along with existing
features for implicit feature interactions. In the second stage, we split the global model
at the boundary of embedding table and deep neural network.
Since from the deep neural network's perspective, it only needs a set of dense vectors,
which is the embedding representation of those sparse features for predicting the results.
The ID conversion step is not necessary to present it on the critical parts of model serving.
So we can essentially convert those embedding offline and push them to high performance key value stores, such as we don't need to host those embedding tables in our serving
host because serving host has limited memory at that moment. Well, we are upgrading our host.
We use this strategy to deliver tremendous values to our members because the model can still,
the features generally offline don't have some lag compared to
you serve them as a whole in memory.
But it still generates some values
and deliver metrics gains.
So we shape that strategy
while we are working on upgrading the host.
This is something very interesting.
And later on, I think once we upgrade our hardware,
we move everything in memory
and do a bunch of memory optimizations
that we can talk about later on.
Okay.
So obviously the feed is a very, very central feature of LinkedIn
and therefore the work that you did
is also by extension very central.
I'm wondering, however, if those embedding tables are potentially also used by other
models in the context of LinkedIn.
And if yes, then I guess what you did, besides being central in itself, would probably also
serve other functions.
Good question.
Yeah, I think the embedding table itself
is highly coupled with the data set
or the problem we are trying to solve.
But I want to mention that the technique itself
is definitely generalizable.
As I mentioned, we are a horizontal team.
Though the blog post is specifically for feed, the technology or the
infrastructure foundation will be built behind this. It can definitely be leveraged by multiple
different use cases. Actually, we've already leveraged the same set of components or techniques
by different other use cases within LinkedIn, such as job recommendation
and ads recommendation,
those kind of vertical use cases.
We are adopting a very similar strategy
of converting those first ID
to the dense representation.
We believe that's a scalable way
of developing AI at LinkedIn.
So basically develop a generic component
that can be leveraged by different use cases.
And the learnings we have in feed can also be applied in a bunch of other use cases.
But I think the embedding table itself is not that leverageable.
So basically the technique itself can be leveraged by multiple use cases.
Okay.
I see.
I wonder, however, since you know, the
embeddings that you already generate and use are sounds like
you know, they're quite besides being scalable, they're also
quite large. And therefore, do you think that they could
potentially be usable as is? Or is it some? Is it maybe that
specific different embedding techniques are used by different models and therefore each model would have to use its own table?
Yeah, I think in general, the techniques are are also very similar between different use cases such as for training in order to
scale, scale our training to be more faster. We adopted 4D model
parallelism. And that that has been leveraged by different use
cases to speed up the training. And it is technique is designed
is catered towards this embedding lookup architecture, due to some
of the bottlenecks within the architecture. And the tech, the
technique or component review successfully address such kinds
of problem that has been leveraged by different use cases.
Okay, so I guess probably the same. Besides the actual improvement in the technique that you implemented, the upgrade in the infrastructure is probably also just at the time being at least used for the benefit of this specific model.
It didn't entail any upgrades that were also shared,
let's say, with other models or other functions? Yes, I think the infrastructure we built,
the hardware upgrade, and some of the components we built will definitely benefit other use cases such as large language model fine-tuning use cases.
Yeah.
Okay. Okay.
And was there any, did you use in this effort or potentially did you also
contribute to any open source projects?
And because I know that LinkedIn has a number of open source projects, some of
which it contributes to, some of which it contributes to,
some of which it actually started and are maintained within LinkedIn. So I wonder if
any of those was involved in some way. Yes, I think as mentioned in that blog post,
we adopted and extended Horovod at LinkedIn, which is an open source project originally originated from Uber, if I
remember correctly, that provides final grain control for communicating parameters between
multiple GPUs. So the motivation for that is originally we use distributed data power,
which means models are replicated across different GPUs. And each GPU is responsible just for a portion of the data.
At the end of the training step, right,
a communication happens between GPUs
to synchronize the gradient and models,
and then update the model width.
So you can see a lot of inefficiency here
that the whole model stays in the memory for each of the GPU. As the model size grows
further, it will result in out-of-memory and it just does not fit in one host.
And also during the communication, the gradients of the whole model on each rank, on each GPU,
will send to all the other ranks, which is highly inefficient. And we observe this communication
as a significant bottleneck during training.
So basically, we extend the hardware
to support the model parallelism for efficient training.
Basically, in model parallelism,
model is sharded across different GPUs.
In our case, we did table-wise splitting,
meaning that, for example,
if you have two embedding parameter
tables in the model, each GPU hosts only one parameter table. And during training, the gradient
update of embedding table only happens on the belonging GPUs. So at the end of each step,
only the gradient in model split boundary is communicated, which is significantly smaller or cheaper than communicating
the whole model. And by doing that, we achieved training speed up and it makes possible to train
the model with this large embedding table very fast across different use cases. Okay, thank you, Ashim.
Okay, well, it sounds like you've done a lot of improvement
on a lot of different fronts, let's say.
So what's on your roadmap for what's coming next?
What's the next thing you are working on?
Assuming you're actually still working on the same model,
and where do you go from here?
Yeah, we have seen pretty decent experimental results
from using Deep Cloud Network V2,
which was originally published by Google in 2017.
Basically, currently we are using multi-layer perception
or feed-forward network to do feature interactions
between those sparse features and the existing number and content set feature.
The deep cross network is designed to capture both low-order and high-order feature interactions efficiently.
It does this by incorporating a cross network and a deep network within the architecture.
For the cross network, basically, it consists of multiple layers of crossing units.
So each crossing unit computes a cross feature
from the product of two features
and then add it back to the original input features.
So basically this cross network
helps capture high-order feature interactions.
And we also have the deep network in this architecture,
which is just a naive feedforward
network with multiple hidden layers. It is designed to capture low-order feature interactions and learn
complex patterns from the input features. From our offline and initial programming result,
we found it is more effective than simply using the feed neural network to do the implicit feature actions among all the features
we presented to the model. And also this architecture incurs significant latency during
online serving. So the team is investing on GPU serving to unblock the capabilities of serving
the computational intensive models. Right now we have the capability of serving memory-intensive models. I think moving forward, we want to
beef up our
serving infrastructure
to be able to serve computational-intensive
model as well. And also
another direction would be
scaling the model size by adding
more sparse features. I think
one of the good things about LinkedIn is
it has standardized the professional
data, right?
And we believe by transforming more features and crossing them efficiently with the first approach I mentioned would yield further gains and values to our members.
Yeah, and also, I think we are actively exploring something called continuous training or incremental training to capture the dynamic of shifting work.
So basically embeddings currently are trained on a fixed period, which could suffer the problem of inconsistent offline online results
or diminished gains over the long period of time.
Essentially by retraining all your experiments frequently, the embeddings can
capture the dynamic of the system and better predict personalized feed for our members.
Yeah, indeed. I know this last topic that you mentioned, so real-time machine learning training, let's call it, even though I'm not sure how,
if you could actually do it in real, real time.
But even having more frequent updates,
I know it's a very challenging thing to do,
and especially at the scale that you work at LinkedIn.
So, well, thanks for the conversation.
And it sounds like you have a lot of work lined up for you.
So best of luck with that and success.
Thanks for having me.
Thanks for sticking around.
For more stories like this, check the link in bio and follow link data orchestration.