Orchestrate all the Things - How LinkedIn is moving towards a skills-based economy with the Skills Graph. Featuring LinkedIn Director of Engineering Sofus Macskássy

Episode Date: December 13, 2023

What is a skills-based economy and how is LinkedIn moving from vision to implementation? As LinkedIn Director of Engineering Sofus Macskássy shares, there's AI, taxonomy, and ontology involved ...in building the Skills Graph that powers the transition. We discuss the process of extracting skills from text, building a skills graph, and leveraging it for various product lines within LinkedIn. We cover aspects related to explicit and implicit skill provenance, credibility, depth and interoperability. Article published on Orchestrate all the Things: https://linkeddataorchestration.com/2023/12/13/how-linkedin-is-moving-towards-a-skills-based-economy-with-the-skills-graph

Transcript
Discussion (0)
Starting point is 00:00:00 Καλώς ήρθατε στο Αρχιετήριο. Είμαι ο Γιώργος Ανατιώτης και θα συνεχίσουμε τα πράγματα μαζί. Στοιχεία για τεχνολογία, δίκαιο, AI και ΜΕΔΙΑ και πώς μπλούν σε έναν άλλο σύστημα, σχετικά με την ΕΡΤΑΙΚΣ. Ποιος είναι ο οικονομικός σκύλος και πώς κινείται η LinkedIn από βιβλία στην εμπλεκτικότητα. Σαν οικονομικός διευθυντής της Engineering, Σόφος Ναξχάσης, υπάρχουν AI, ταξ, and ontology involved in building the skills graph that powers it. I hope you will enjoy this. If you like my work and orchestrate all the things, you can subscribe to my podcast, available
Starting point is 00:00:33 on all major platforms, my self-published newsletter, also syndicated on Substack, Hackernium, Medium and DZone, or follow or orchesturate all the things on your social media of choice. So my name is SurfaceMatch Cassie. I am a director of engineering at LinkedIn. I have been at LinkedIn for a little over four years at that point. I am currently part of the foundational AI technologies group, where we are focusing on a number of foundational technologies, as you can imagine. And my particular responsibilities lie in the knowledge graph aspects and data notation. When we come to talk about the knowledge graph, which is the topic of this particular podcast, we are focusing on how to extract knowledge from text, particularly skills and other taxonomy and concepts from this and build
Starting point is 00:01:23 it into our knowledge graph to support all our members in their various functions and things that they're looking for within LinkedIn. Myself, I am a PhD in AI. I have done knowledge graphs and recommendations and understanding people and content for all of my career, and I'm very happy to be at LinkedIn to continue to push the envelope on what we can do in this space. Great. Thank you.
Starting point is 00:01:48 And, well, I've had the chance to read through a number of blog posts from LinkedIn that sort of progressively refine, let's say, one central idea, which is the skill-centric approach. And just to go over it very quickly, the idea is basically to move from CVs and from criteria such as where people have graduated from or whom they know to evaluating skills to judge their suitability for professional roles. And I think this is something that makes sense. And actually, it's something that has also piqued my own curiosity. And so I've covered it a number of times in the past. And so initially, to my knowledge, at least,
Starting point is 00:02:39 this was laid out in a blog post by the CEO of LinkedIn. And obviously, I think this was like a couple of years back. And obviously, in order for that to go from a high-level abstract decision to actual implementation, a number of things have to happen. And I guess you must have been, and the team that you lead, must have been very much involved in the process. And so I know that you're also about to publish one more blog post in this series. And this is going to be, I think, the most technical of them all. So going rather deep in detail. And I think the best way to approach this would be to actually go also progressively. So
Starting point is 00:03:19 let's start from the abstract. Then I would like you to give me your own experience, basically, of how you were involved in that and what were the steps involved going from the high level and abstract to actual technical implementation. Absolutely. Skills actually have been on LinkedIn for quite some time at this point. For many years ago, people were able to put in skills and endorse each other for skills. Because we, many years ago, had realized that this was an important vocabulary by which people communicated and understood each other. And so that even preceded a lot of the current conversation that we have taught about. And we early realized that skills, in fact, were very important,
Starting point is 00:04:06 as you mentioned, for people to get matched to jobs, whether they are a seeker or whether recruiters are looking for talent. What we had realized and what then became that blog post from the CEO that you mentioned is that it really goes beyond just finding a job or finding talent. That is really a skills-first economy in many ways, where skills really is pervasive across everything we do, or is one of the fundamental pieces of information we use across the board.
Starting point is 00:04:36 And so when the CEO highlighted we should move into a skills-first world, there, of course, was a question, what does that even mean? And from a technical perspective, we also wanted to understand what the vision was from the CEO. Skills First world, there, of course, was a question, what does that even mean? And from a technical perspective, we also wanted to understand what the vision was from the CEO. And we wanted to understand exactly how they would percolate into the various product lines and how we should think about it. Because that clearly broadened how we should think about skills and also how we should represent them and how they should be used in the various product lines, as I mentioned. And so the first thing we needed to understand was exactly for the various product lines, what are they currently doing and also what are the opportunities where skills might be beneficial. And clearly, as we started digging into the various product lines,
Starting point is 00:05:22 as well as the news feed, advertising, and clearly also in the recruiting domain and talent marketplace, everything that happened in terms of product, in terms of recommendation and searching, there was a clearly benefit for leveraging signals about the content and about the people to do the ranking and recommendation. And so from there, we then tried to understand where were some of the pain points and where are some of the gaps. The first gap we realized, frankly, was that the skills that we had back then were not, and how they were used by members, didn't provide the coverage and didn't provide the depth that we really needed. Many people didn't necessarily have all the skills listed on their profile,
Starting point is 00:05:58 and they did not necessarily use the same vocabulary. And it was not easy to figure out whether two skills were related to each other or even synonyms of each other, for example. And so the first approach we took back then was to build out the skills taxonomy, which we had an earlier blog post to that you alluded to, such that we can expand and create these relationships between skills, such that we can better understand members and content in terms of the skills ecosystem that they live in. Because if you have more skills, then it's a big opportunity to find and connect people to content more rapidly, and you just increase the liquidity.
Starting point is 00:06:37 So that was the first step we took. And then the second step we then took was taking this extra skill graph and actually making sure that our various AI teams and various product lines actually leveraged the skill graph to expand the skills associated with them as part of the recommendation. And we got significant lift from that. And then we started expanding into more types of content. We started first, of course, with the jobs and resumes and member profiles. But then there's also the feed posts. There's also a lot of marketing material in our marketing product line where we could also tie these into the skills to also improve the ranking and recommendation of content.
Starting point is 00:07:17 So that was kind of the progress and the evolution of how we've been looking at it. And then, of course, we have continued to improve the underlying fundamental technologies extracting skills from this content. And that is the theme of the blog post that you mentioned, as well as the theme of this discussion today. Okay, well, thank you. And the next thing I wanted to ask you was precisely if you could give us like a very high level description of the
Starting point is 00:07:46 technical implementation that you have done. And I know that it's a lot to ask because you have written an entire blog post for that, which is quite long and quite technical. But for the purposes of this conversation, let's just stick, let's just keep to the surface. And the reason I'm asking this is because just frankly, to give a little bit of context for the rest of the things that I intend to ask about. So for readers that may not have read the blog post, just to have like a quick anchor point, let's say. Absolutely. At a high level,
Starting point is 00:08:20 the way we should think about this and the way we approach this is more data workflow. First, we need to get this and the way we approach this is more data workflow. First, we need to get content. And so we have clearly pipelines that get the content. Either we go through all the content on a periodic basis or we get messages as certain content get updated. So we have a set of content. Certain content are a little bit more complex than others. For example, resumes and job descriptions have a fair bit of structure to it.
Starting point is 00:08:48 Member profiles and posts are slightly less explicit structure to it. So the first thing we need to do is we need to actually segment the content, particularly for jobs, job descriptions and resumes. We want to segment them into particular blobs of different types of text that are cohesive by themselves.
Starting point is 00:09:03 So for example, for a job description, we want to say, where's the skill section where people talk about the skills required or the skills that you will be applying if you were to get a particular role, such that we understand that, those segments. Next, we then need to take a look at those content, and we need to extract phrases or sentences within that piece of content that might discuss or at least discuss what a skill is. And so we're looking for skill mentions either directly or indirectly within that piece
Starting point is 00:09:34 of text, such that we have now this set of fragments that contain skills mentioned in some way. Then from there, we then have a model that takes these fragments and phrases to actually more explicitly identify what the most likely skills are. And that's a ranking function. These are the potential skills that could be related to this particular phrase. And then finally, we want to then map it directly into our own taxonomy. And so now we have these examples and these potential candidates, and we try to map it into the taxonomy specifically. One of the things we do is we do leverage all the context around these phrases in case there's ambiguity of multiple candidates.
Starting point is 00:10:17 We can use the context around it to really fine-tune and come up with the best candidate, final candidate for that particular skill mentioned in whatever content we have. And so once we have that, we can then map all these skills to a piece of content, whether it's a resume and then indirectly to a member or in a job description or in a piece of content, such that we have that metadata associated with a piece of content that we then later leverage
Starting point is 00:10:43 as part of our various product lines, then the other AI teams can take that to do their candidate selection and ranking down in the actual products. So that's at a high level, the components behind this. And I can certainly go into any of these if there's any of the particular components you would like to dig into more deeply. Okay. As far as my understanding is, and you can correct me if I'm wrong, the core of the whole process is obviously the skills graph. And in a way, it's an iterative process or a flywheel or whatever it is that you want
Starting point is 00:11:23 to call it. So you have your taxonomy that helps you identify skills in structured and unstructured text that you evaluate. And then in turn, you use those to enhance and improve the skills graph, which is then again used to harvest more text and so on and so forth. Would you say that's, I know it's simplistic, but would you say that in principle, that's how it works? I think that's a great articulation of how the iterative process works, because clearly, as you rightly pointed out,
Starting point is 00:11:56 it is a feedback loop where we see new skills and new ways of mentioning skills across the board, whether it's on a member profile, in a resume, in a job description, for example. And so definitely we use this to expand the skill graph dynamically as well, as we see new ways of facing a skill or even a new skill we never had seen before, such as, for example, prompt engineering is now a new skill that really popped up, as you can imagine, over the last couple of quarters. And so we want to dynamically add to the skill graph, which is exactly what we
Starting point is 00:12:26 do, as you mentioned. And then that goes back and we then use these new skills that can then be tagged in the content. And so it's a virtuous loop. Absolutely. Okay. So I imagine that initially, at least, the skills graph that you started with must have been hand curated and then you iteratively developed your processes and those processes must have evolved by now. And what I'm wondering is what's the cadence for the skill extraction process? So how often do you rebuild and iterate over the skills graph? Great question. You're right.
Starting point is 00:13:08 We definitely have improved the processes by which we did this. And it was very manual earlier. And of course, we are a lot more automatic today in how we do this. Now, so that's expanding the skill graph itself. Of course, mapping it to our members and to our content, that is a continuous process. And the cadence by which we do that definitely differs depending on the type of content. So, for example, we ingest many, many jobs on a daily basis that are provided to us by our customers. We also, of course, have people posting job description online through our own workflow. And so as these come in, we will tag all of those as they come in.
Starting point is 00:13:48 On the member side, there is whenever a member edits or updates their profile, we do get sent a signal. And so as soon as we get that signal, we will go and then take a look at that member profile and then update that profile as well. So that cadence is a little bit more based on the members themselves. And so we have on the order of
Starting point is 00:14:05 hundreds of such edits per second every day. So it is a decent volume that we have to go through there and we need to be able to update that rapidly. And that goes through our whole workflow to update that and get it into our ecosystem. When we do update the skill graph
Starting point is 00:14:21 or if we have significant upgrades to our capability to map, to extract, identify, and map skills in content, then clearly, as we do this, we do need to make back and we have to go through all the members as well as all the jobs to update them with the new model and the new extraction capability that we have. And so that also happens on a semi-regular basis based on when we have significant improvements in our models or in the graph itself. Okay. So it sounds like the harvesting process happens on demand and you probably sort of collect and batch potential additions or improvements to the skills graph and you iterate on different contents. So you don't do it every time that any single update happens. You just collect a few of those and you collectively go over them when it makes sense, right?
Starting point is 00:15:28 That is correct. Yes. Okay. I should also mention something else that piqued my interest in the blog that I've had the chance to review that it also describes the process through which what you just talked about happens. So whenever people do updates on their profiles or their skills, you actually, in addition to the hand-curated skills that people add to their profiles in order to match their profiles to job listings, for example,
Starting point is 00:16:00 you also harvest posts from text that they may have on the profile, such as resumes or other descriptions and so on. So I wonder, to the best of my knowledge at least, it doesn't look like people are prompted to add those skills to their profiles. But obviously, you must, since you're harvesting those, you must store them somehow and you're also using them. So I wonder, is there any mechanism through which people can access those harvested skills? Like recommendations to add them or, for example, if they export their profiles, do they somehow get access to those implicit skills as well? That is a great question, George. And first off, as you can imagine, trust and member trust is very, very important to LinkedIn. And we are very proud to be one of the most trusted platforms out
Starting point is 00:17:00 there. And we do that particularly through transparency and to make sure that members understand what we do with the content and data that they provide us. So in this case here, members have full access to all the skills that they themselves add. And so that's explicitly where they add these things. And there are surfaces and flows by which members can add skills, as you can imagine, but also we continue to improve those flows where we make suggestions based on what we've seen either in their resume or in their member profile where we suggest they might want to add certain skills and even associate certain skills with certain of their jobs as well.
Starting point is 00:17:39 So we have those flows by which members can explicitly add these skills. And they have 100% ownership of that because that is how they want to represent themselves publicly and visibly to the rest of the world. And those are clearly the ones that we also anchor ourselves to in a number of places when it comes to recommendations and skills. And so when they export their profile, they definitely have access to all of those skills and they can see them. In terms of some of the more implicit and inferred skills that we do have, since we leverage that behind the scenes and these are never explicitly visible
Starting point is 00:18:12 to anybody, recruiters or members themselves, but they are very highly leveraged to do the recommendation and ranking, those skills we do not necessarily surface because they are more fluid in nature as you can imagine and since everything we do is anchored on the content that they themselves provide, members provide, if they were to update their profile or update their resume, the skills associated with it would also then obviously be updated and as we improve our technology behind the scenes, things will change
Starting point is 00:18:43 so it makes less sense necessarily to surface all of these, and particularly as we are moving forward into how to think about skills more dynamically and more in-depth. These are things we don't necessarily surface to them, but they definitely own the explicit skills, and they can change those, and they can change their member
Starting point is 00:19:00 profile anytime they so desire. So that's how we are thinking about it, and I think we're doing a quite good job at this point. Okay. Yeah. I mean, the way you describe it, it does make sense. My two cents would be that, well, since you go into the trouble of harvesting those anyhow, and it may make sense to at least suggest to members like, OK, we've found something like, OK, we found that you probably have those skills based on the input that you provide. You may want to add it into your profile or something like this. And then they can decide whether they want to do that.
Starting point is 00:19:36 I know that, you know, it's easier said than done because it's a decision that has to go through many, many levels. But just mentioning it. And I think that's also related to something else that you just mentioned. So dealing with provenance and, well, depth when it comes to skills, you said that, well, the implicit skill harvesting process is based on text that members provide themselves. And so it's really hard to evaluate how reliable that is, really. But I know because I read that you also have something else that seems very interesting in that respect.
Starting point is 00:20:15 So something called skill assessment. And I wonder if you'd like to say a few words about how it works and where exactly the connection is with what you do. That's a great question. Let me first make just a comment on your suggestion. I think it's a great suggestion to prompt members to add skills. And we do, as I mentioned, we do have services by where they can definitely suggest skills.
Starting point is 00:20:41 And if they ever go in to say they want to add skills, we make suggestions to them as well. We have done that sometimes a little bit more aggressively and prompting as we get new capabilities. But by and large, there is the opportunity there. And we actually do suggest skills. If people want to edit their skills, we do suggest skills based on their profile. So there is a form to do that. We are just not aggressively asking them to do this because, again, that would not necessarily be a good member experience if we continue to do this.
Starting point is 00:21:08 But it's a point well taken. Now, for the skills assessments, that's an interesting question in the provenance in general. Definitely something that we are paying attention to. Again, coming down to trust and making sure that members trust what we're doing with their data. We do take the skills and the content they have on their profiles very seriously because that's how they want to represent themselves. And we have to take that to somewhat at face value that they are representing themselves well. And I don't think we have necessarily observed that people are really stretching the truth in any way here because this is their professional profile and there's a lot of people looking at this.
Starting point is 00:21:48 So people would like to get called out if they're saying things that is not true. But when it comes down to the expertise and how good they are at particular skills, that is a very good point. And this is where skills assessments come in, for example, and other learning material where people can take courses as well on our platform and share that they have taken those courses and get those certificates on the platform as well. So these assessments and these courses certainly align with skills in our skills taxonomy.
Starting point is 00:22:20 And we are representing that internally in our member data. And these things can be taken care of as part of the ranking function. At this point, if you look at the member rates overall, the amount of skill assessments we have and the amount of certificates that people have is still very lightweight compared to all the other content we have there. So the signal and the benefit across the board is not as high as other places where we can leverage this, but we are definitely taking this into account. Other places that we can also take a look at is, of course, how they actually communicate and who they communicate with on the platform as well. If they're communicating or interacting with other experts in particular fields,
Starting point is 00:23:19 we can definitely also take that into consideration to understand whether these people actually are experts or know what they're talking about when it comes to certain topics or particular areas that then, of course, map into skills as well. So we are exploring these things and we continue to explore them. Again, there's a question as we explore this in terms of what is the benefit to the member and benefit to the community at large. And so we are taking that also under consideration to focus primarily on things that will provide value add to the member experience. And I have one extra question as far as the skill assessment process goes. I noticed that it refers to something called badges. And I couldn't help but wonder what exactly are those badges and if there's like a specific format that is used for that. And the reason I'm asking is because, well, a few years ago I came across a specification which seems to be specifically cut out for this type of thing. It's called OpenBudgets, and it's used to represent skills and to do that in a way that's interoperable and can be exported and exchanged and so on. So I wonder if you're aware of that and if it's something that you may use eventually?
Starting point is 00:24:30 I think that's a great question in terms of tying ourselves to open standards and open badges in this particular case. And so I was aware of the open badges. It is not something that I believe we are currently using explicitly within LinkedIn. One of the things that we are looking at particularly, again, focusing primarily on the value add on our platform specifically is to what extent these would actually improve the experience, whether they get a better field experience or whether they get better job recommendations or they are more easy to find by recruiters. So taking a look at what is the actual member experience value at here. And so far we have been focusing more on that rather than
Starting point is 00:25:11 adopting open standards and giving back in that way. This is not something that's currently a high priority for us. It is definitely something that we are aware of. But at this point, we are focusing primarily on what can help with the member experience on our platform. And this today is not something that we would expect to be the high priority and the most valuable addition to the platform to help the member experience. Definitely something that is on our radar, but hasn't made it up into our high priority
Starting point is 00:25:41 as it is quite yet. Yes, and just a quick thought on that. I think it could also help with skill assessment process itself because if my understanding is correct, it seems like at the moment this process is actually carried out by LinkedIn staff or at least it's tunneled through LinkedIn. If at some point something like open badges were to be used, then people could conceivably import their skill assessment from educational institutions or, I don't know,
Starting point is 00:26:13 other types of organizations as well. So just a thought. I think it's a great thought. And it's not something, as I said, this is something that is on the radar, but it's not something that has gone up to the highest priority quite yet. But this is definitely something we've been discussing as well internally on how to leverage some of this data as well. Yeah, I would personally love to see that happen. And since I guess we're almost out of time, I have one last question. And I say the most technical one for the end. It's something that I guess people may have asked you about previously. It seems like the
Starting point is 00:26:51 skills graph is built as a taxonomy. So basically, for people who may not be familiar with the term, a taxonomy is a tree. It has a hierarchical relation, so parents and children and so on and so forth. Why use a taxonomy and not an ontology? And again, briefly for people who may not know, an ontology is a tree, but in addition, you can also have arbitrary relations. So you can draw links between nodes that are not in the same branch, basically. And I wonder, obviously, there must be some reasoning behind that. I'm just curious as to what that is. That's a great question. And I think this is actually more aligned with the
Starting point is 00:27:31 narrative that we have been publishing on, where we felt to stay focused. We focus primarily on the taxonomy aspect of what we've been doing. We definitely have a much more rich ontology backing this as well. But we wanted to have very concise and focused blog posts that focus on some of the key aspects here. We absolutely have an ontology where we are mapping skills between branches, as you said, as well. We also map skills to other concepts, whether they are titles, for example, or other types of certifications. So we have a very rich backend graph. We just focus primarily here on some of the taxonomy aspects and also how the taxonomy works here, but we definitely use that ontology and it has also provided great improvements in many aspects, as you can imagine,
Starting point is 00:28:21 when it comes to ranking and recommendation. So these are very, very useful. And clearly also with the LLMs coming into the picture now, the breadth and richness of these ontologies that we can create is also something we're exploring in today. So it's not that we're only doing taxonomies. This is just what we have been discussing more publicly. But we have a very rich ontology powering everything behind the scenes. That makes lots of sense from a technical point of view, actually, because it would
Starting point is 00:28:53 really be a shame to just not use any of the other types of relationships. And you gain a lot by doing that, as you obviously know. Yes, if I were to say that, it sounds like the ontology word scares a few people off. So you probably decided to refrain from using that, at least for the time being. I hope you don't mind that if I actually surface that. No, I think it was a great call out and absolutely feel free to surface that. Thanks for sticking around. For more stories like this, check the link in bio and follow linked data orchestration.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.