No Priors: Artificial Intelligence | Technology | Startups - Why the Future of Machine Learning is Open Source with Huggingface’s Clem Delangue

Starting point is 00:00:00 In traditional science, it's really used to be the norm that you would have some research and some research paper, and it wouldn't make its way into production before 10 years, 20 years. And what we're seeing in machine learning is that it's actually making its way into production after a year, a few months, a few weeks, sometimes a few days now. And I really hope in the future that we'll keep this very fast virtual cycle iteration loop between science to production to science, because to me it's the main driver for the speed of progress in machinery. This is the No Pryors podcast. I'm Saragua.

Starting point is 00:00:51 I'm Alad Gail. We invest in, advise, and help start technology companies. In this podcast, we're talking with the leading founders and researchers in AI about the biggest questions. Originally created to be an AI chatbot companion, six years later, Hugging Face is now the collaboration backbone of the open source AI research ecosystem. The company is currently valued at $2 billion with over 10,000 companies using their platform, including Bing and Apple, and has expanded from popular NLP models to other modalities,

Starting point is 00:01:26 including media, biology, and more. Our guest, Clem DeLang, founder and CEO of Hugging Face, is central to the open source movement in machine learning. We'll talk about how he built this company, why open source is the future of AI, and what he sees on the horizon. Clem, welcome to the podcast. Thanks for having me. Thanks so much for joining us. So we were hoping to start with your background, which I think is really interesting. You grew up in France, where you ran an electronics shop on eBay, which is so prolific that you ended up earning an internship opportunity with eBay.

Starting point is 00:01:53 How did you go from that to image recognition and eventually to Hugging Face? Yeah, it's actually quite a funny story because I was one of the biggest French seller on eBay. I was kind of like the user-facing team member. And so they were sending me to all these trade shows in France, which were like the worst experiences ever because at the time PayPal belonged to eBay. And so we had a shared booth. And so all the PayPal users would come to the booth and basically shout at me because PayPal was keeping their money or blocking their accounts or things like that. It's basically kind of like the worst days ever. But during one of these

Starting point is 00:02:31 days, I bumped into a guy with like big round glasses like looking very nerdy. And I remember pretty vividly told me like eBay, you acquired not so long ago a barcode scanning company called Red Laser to recognize objects. But you need to know that pretty soon with machine learning. I mean, And he wasn't calling it machine learning at the time. But with these new algorithms, you won't even need the barcodes anymore. You'll just recognize the object itself. And at the time, I was like, who's this crazy guy? But at night, I camp like did my research and realized that it was a pretty legit guy

Starting point is 00:03:09 coming out of a legit engineering school in France with a small startup, which raised a little bit of money. And one thing after the other, I ended up leaving eBay to join the startup doing machine learning for computer vision. I made the move to machine learning. It was almost 15 years ago now. I don't regret it at all. It's funny how like a single, small encounter like that can completely change your trajectory. That's really cool. Can you tell us a bit more about the early iteration of hugging face, how you decide to start it, the early days as a talking emoji and where it went from there? Yeah, absolutely. With my co-founders, Julian and Thomas, we kind of like always shared

Starting point is 00:03:45 this passion and excitement for AI and for machine learning. And when we started hugging face, we're like, okay, what can we work on that is both scientifically challenging, but also fun? Okay, we're going to build some sort of an AI Tamaguchi. We were heavy users of Alexa and Siri, and we're like, why is it so boring? Why are they only talking about productivity stuff? Why is it just telling you the weather? And so we started to build that, kind of like some sort of an AI friend, Tamaguchi, AI. Basically, what you see in a lot of movies as sci-fi.

Starting point is 00:04:20 a lot of what people are using chat GPT for today, actually. And we did that for almost three years, got some level of traction, billions of messages exchanged between users and the chatbots. So that's how HuggingFace started. And at what point did you decide to shift it towards an open source community and model repository? How did that come about?

Starting point is 00:04:38 It was three years in. After our seed round, we've always been kind of like big open source people, so we've always kind of like open sourced part of what we were doing. When Transformer models started to work, when we started to see Bert getting some traction, we just saw the number of people using our open source just blow up and start to skyrocket, right? We went from a couple of people looking at it to hundreds of companies using it. We raised our series A based on this early traction, and that was really kind of like the signal that we need to put most of the efforts of the company on this new direction.

Starting point is 00:05:16 That's cool. So this was open source that you'd already developed and put out into the wild. And then you saw people starting to use this. And then you said, wow, there's so much attention here. We should go and do that instead. Exactly. That's really cool. Yeah, it's always interesting to see these shifts in direction.

Starting point is 00:05:29 I feel like that's every Stuart Butterfield company, right? That was Slack and that was Flickr before that. They just built something. And then it kind of took off separately. Do you have any sort of advice to founders who are considering changing direction or thinking of new directions for their company? Or how do you keep your eye out for the things that are really interesting or working that may or may not be the core thing of what you're doing? Well, I think the best way to do it is to find the good ratio in your company between like exploitation and exploration.

Starting point is 00:05:54 And I think that's where a lot of startups are not always getting right, not only before product market fit, but also after product market fit, I feel like sometimes companies before product market fit are experimenting kind of like too much, changing directions every week. And I don't think you learn a lot from that. And then after a product market fit, they kind of like stop experimenting. trying new things and trying to stay away from the local optimum in a way and looking more for like the global optimum. So for us, so what we've always done, I think we'll always do with

Starting point is 00:06:26 a fucking face is to make sure to spend at least like 30 or 40% of the company's efforts on exploring new things and kind of like finding the long-term bets that is going to make you successful. And then give these experiments and initiatives like a chance, right? For us, we were lucky that Thomas, one of our co-founders, was leading this kind of like experiments. But we have examples of other initiatives that started as experiments from team members who made it and graduated to a very big bet for the company. One example of that is spaces, which are our machine learning demos that have been insanely successful, we just crossed 50,000 of machine learning demos in the past year and a half.

Starting point is 00:07:11 And it started just from one team member kind of like experiments. implementing with that and being like, oh, I think I can build something cool there. And one step up to the other, it led to where it is today. It seems like in general, companies that iterate or launch new things early, keep launching things later in the life of the company and companies that never innovate early, don't ever innovate again in their lives. It's kind of like the difference between eBay and Stripe or you can name different companies.

Starting point is 00:07:36 And so it's awesome that you folks were investing really early in that innovation. When you first started getting traction with what Hugging Face does now, did that happen organically? And it just started growing and taking off, or did you reach out to specific communities? Or how did you first get those first users to user open source platform? The distribution really started on Twitter at the beginning.

Starting point is 00:07:54 We just started to tweet about some of the things that we were doing on open source, and people who tweeted that the machine learning community on Twitter was already pretty strong. And then it's no bold, I think, classic kind of like network effect kind of things where researchers started to share their models. And obviously, like they were getting visibility

Starting point is 00:08:14 for their models, so people in the industry, in companies using these machine learning models, we're hearing about Hugging Face through that, and then we're asking more researchers to add their models to Hugging Face, so kind of like more typical marketplace network effects. And then something that we did that I think worked really well for us is that we never hired any community manager, any kind of like communication PR team members, because we wanted it to be part of every single team members, even kind of like the most technical specialized scientists, we've always told them like, okay, it's part of your job to interact with the community, to share with the community, to get visibility for what you're working on. And so we ended up with this

Starting point is 00:09:03 organization where talking to the community and getting visibility is part of everyone's job instead of like outsourced to a team. And I think that that's what I'm saying. that the people are so appreciated with us from the community because they could really talk to the builders directly and the people doing the things. And it created kind of like more meaningful interactions, I would say. Yeah, it's clearly really authentic to hugging faces culture, the sort of commitment to community and open source. And I remember hearing that many people in the company run the public Twitter. Yeah, everyone in the team has access to the Twitter account and are tweeting from the Twitter account. I think most organizations are not capable of

Starting point is 00:09:43 that sort of risk taking. What else do you think you guys have done right on the sort of community growth aspect? Because I think now everyone knows that's such a powerful driver for business for an increasing number of technology companies, but it's pretty hard to actually execute against. That's a good question. I think timing, obviously we've been really lucky with timing. Trying to listen, it sounds a bit cliche, right, but actually listening to the community and implementing what the community is asking. Yeah, and then just like build your culture around it to have people who are, like, excited about contributing to the community, even independently of everything else. I think sometimes you have companies where they're doing

Starting point is 00:10:22 community or open source work, but it's almost as like a mean for other things. And it's like sometimes feels like they're almost like they have to do it to get other things that they're more excited about. For us, it's been useful to try to hire people who are genuinely excited about this work. And if they were to do, they could kind of like almost. work for free for the community on open source and they'd be happy about it. And so that that creates like the right culture for this kind of work, I feel like. I feel like one of the roles I see hugging face play is as this conduit for this like amazing pace of research in terms of ingest into industry. And it's interesting to hear you say that you release transformers as a project and

Starting point is 00:11:07 you had a bunch of companies, the open souls model that you guys released for transformers. And You had a bunch of companies using it, but it feels to me like there's a huge distance between where your average enterprise is with their machine learning journey and all the amazing cutting-edge research being shared on Hugging Face. Like, how do you reconcile that and how does that gap close? I think first to traditional science, this gap in machine learning is extremely tiny. My co-founder, Thomas, who did this PhD in quantum physics and some research in quantum physics before could tell you way better than me about it. But in traditional science, it's

Starting point is 00:11:46 really used to be the norm that you would have some research and some research paper. And it wouldn't make its way into production before, you know, 10 years, 20 years. And what we're seeing in machine learning is that it's actually making its way into production after a year, a few months, a few weeks, sometimes a few days now. So this is, in my opinion, this is amazing. And that's what's driving most of the speeds of the progress in machine learning. That's actually why I'm excited to keep investing so much on open source and sometimes a bit worried about more proprietary models coming up is that I think if we move the open source from that equation,

Starting point is 00:12:28 if they wouldn't have been as much open source as there's been in the past five years, we would be like decades away from where we are now. And I really hope in the future that we'll keep this very fast virtual cycle iteration loop between science to production to science, because to me it's the main driver for the speeds of progress in machine learning. So I think a lot of people share this general vein of concern in the ML community that, of course, we have this wealth of open source models, but large transformers-based models tend to get better when they get bigger, and they can be prohibitively expensive to train. So there is a concern that the state of the art, which unlocks a bunch of use cases, will be in proprietary labs like deep mind or open AI or anthropic or what have you. How do you think about the performance of what's in the open source versus the state of the art? So, I mean, I think first, sometimes we tend to say like, okay, open source wins or perverter wins. The truth is that there's always going to be both, right?

Starting point is 00:13:29 I think if you see most technologies, if you look at search, you always have the elastic search and the algoia or like if you look at that. databases, you have that MongoDB and the proper zero approaches. So I'm not too worried about one winning against the other. I think there's always going to be both. And I think the way it works is very similar to how science has always worked in the sense that in some specific area, sometimes you're going to have proprietary approaches that have taken some advances and have gone faster for X or Y reason. For example, that's the case right now, maybe in text generation, right?

Starting point is 00:14:08 With, like, chat, CPD being like more giving better results than open source approaches. And then on other domains, like, for example, text classification, information extraction, arguably like image generation with stable diffusion and stuff like that. Open source is ahead of proprietary. And probably it's going to flip in like a few weeks to the other way around. And that's the case for all the tasks. So it's kind of like the race with. like dozens of racers and sometimes one is going ahead of the other,

Starting point is 00:14:38 but it's at the end there's always going to be like both approaches. I don't really believe in this scenario of like one model, one company to rule them all. And one kind of like data proof that I see is that on Hugging Face, we just crossed 250,000 models, right, a quarter of a million models, uploaded by almost 15,000 companies now. And I don't believe they're building models just to build models, right? if there was one model that would be better for everything than the other is it wouldn't. We're always going to be in a world where there's going to be multiple models for companies,

Starting point is 00:15:13 especially because when you look at why are companies using so many models on Hugging Face, you usually realize that a more specialized model is more efficient. It's cheaper to run. It's usually faster to run. And most of the time, actually, more accurate for the specific use case. So that's what we're seeing now. And that's also, to be honest, what we're hoping to see in the future, because when we think we build or we fund startups to see the future that we want to see, right? And personally, I'm more excited about the future where machine learning is available for everyone and everyone can build machine learning versus a world where it's very concentrated and monopolistic.

Starting point is 00:15:56 So I have to ask you because you have this amazing viewpoint into what's happening in the community. of those 250,000 models, can you characterize the sort of distribution of what percentage maybe is like image versus language, other modalities, and then from an architectural perspective, like diffusion transformers, other interesting approaches?

Starting point is 00:16:15 Yeah, the three main tasks right now are NLP, so text, right, from like information extraction, text generation, text classification. The second one is text to image and computer vision, right? So object detection, text to image, text and image generation. The third one is audio.

Starting point is 00:16:37 So like speech to text, text to speech, information extraction, but from audio, rases and then text. And then we're starting to see more and more models on time series. Right. So for example, the ETA from Uber, when you get your Uber is like transformer time series or like financial models for fraud or like these kind of use cases. And then biology and chemistry, also we're starting to see more and more models, more and more data sets and more and more demos there. So that would be kind of like the main buckets.

Starting point is 00:17:10 And then what's interesting is that you have all sizes of models, like a few million parameters, up to 180 billion parameters, right? The biggest open source models out there. And do all the sizes get used? Yeah, it's pretty distributed. That's something we kind of like always look at to inform our thinking. It depends on your use case, what you want to use. If you, for example, we have Bloomberg as users, right, and as customers. And in the Bloomberg terminal, the more real-time, the better for them, right?

Starting point is 00:17:43 And so because they want to be real-time and have as little latency as possible, they want to use kind of like a smaller model that is automatically going to be faster than the bigger model. So depending on the use case, companies that want to build something very general, very, able to apply to a lot of different use case from customer support to the meaning of life and who don't care so much about latency or cost for it, then they can go for a bigger model that makes more sense. Are there specific areas or trends you're most excited about from either a research perspective or from a model implementation perspective? I mean, I'm really excited these days. It's a bit of an unsexy thing to say, but by the infrastructure side of things. Because

Starting point is 00:18:28 I think so far as a whole for like the machine learning domain and ecosystem, we haven't thought too much about what it costs to run some of these models, how fast they can go, how slow they can be. And I hope that this year there's going to be some sort of like more clarity around that to make sure that as a community, we build something healthy and sustainable and not. I feel like sometimes in the field there's something that I call the cloud money laundering where you almost can't disconnect the infrastructure costs to the actual use cases. I think as the field is maturing, you're going to see much better alignment between the two

Starting point is 00:19:07 and I'm actually excited about that because I think it's going to be a big enabler for the fields in the long run. Yeah, that makes a lot of sense. I guess from an infrastructure perspective or tooling perspective, is there anything that Hugging Face isn't directly working on so it's not going to be competitive with you all that you really wish existed or that people are working on more actively? Something we've worked a little bit on, but we haven't really managed to make it work. And I'd be excited to see more teams working on it is to create some more like

Starting point is 00:19:34 decentralization on the infrastructure side. Because right now it's very centralized, both in terms of like players, but also in terms of like timing, for example. Like most of the time, the way you build models is that you train them once and then maybe you're going to train it again six months later or a year later, which sounds kind of like archaic in a way. And that creates a lot of challenges like not being able to be current, right? Like a lot of these models, like they don't know who is the current president in the United States or like stuff like that. So having more decentralization, more online

Starting point is 00:20:09 learning ways of going from big training to small, more regular training. I'm really excited about that. And the second thing is that I'm really excited about creating more consents from people in the data set. For example, we've been working with a project called B-Code, which has been released code generation, open-source code generation models, on the ability for

Starting point is 00:20:33 developers to opt out from the dataset that the model is trained on. In a similar way, we're starting to see on Hugging Face more and more opt-in datasets, meaning like data sets that have been trained only on data that the creators of the data have

Starting point is 00:20:49 consented to having a model trained on. It's very interesting, for example, for text-to-image models where there's a lot of debates right now with the underlying work of the artists being used in the training. So I'm excited to see more work around that too this year. Will you explain what Bloom is and more broadly, like how you decide where Hucking Face should be a first-party participant in model training or research? Yeah, Bloom is the result of an initiative called Big Science, which also led to big code that I just mentioned. And big science was like the largest collaboration in machine learning to date with like

Starting point is 00:21:26 a thousand researchers from 200 organizations coming together in order to build and train a large language model completely in the open, right? So everything was publicly available. You can see all the runs that they did, all the brainstorm that they did to get to the decisions that they did. And it's been really excited to see that kind of like almost building organic with our support and the support of a lot of organizations, Jean-Zé, for example, which is a French computer that provided the computer for this.

Starting point is 00:21:58 And it informed a lot our thinking around ethics and kind of like openness, because one of the reasons why we so focused on open source and open science at Tugging Face is that we believe like the two main challenges with AI today are one kind of like the concentration of power and second biases that are encoded. in these models. And for both, we learned that building in the open, with open source, is actually more part of the solution than part of the problem, because obviously control of power as you ties it much more. And biases, you actually include in the process people who are impacted by these biases, especially underrepresented populations, which is kind of like otherwise

Starting point is 00:22:44 really hard to do if you only do the work behind closed door in the lab, in Silicon Valley with mostly old white dudes. And so that informed a lot, kind of like I were thinking around open source and open science. There's a lot of excitement in the research community and amongst entrepreneurs on increasingly reinforcement learning with human feedback. Does that impact your strategy at all at Hucking Face?

Starting point is 00:23:10 If that's the next step beyond this pre-training? Yeah, it's a very interesting kind of like additional step on kind of like a classic machine learning pipe that we've been invested in for quite quite a while I think the first reinforcement learning with human feedback models added to the hub I think it was like eight eight months ago so way before it was kind of like as as popular as now we're leading the development of an open source library that is helping companies integrate that into their models and into their workflows it's a really exciting new development as we see we've been around the block a little bit

Starting point is 00:23:49 so each every few months there's always kind of like a new thing that's one of the challenges of building a machine learning startup these days is that you have to have the flexibility to constantly evolve it's been the same when like diffusers started to pick up right

Starting point is 00:24:04 and you started to see this new generation of models so each time we kind of like adapting to it trying to empower the community to be able to take advantage of these new progress in machine learning That's cool. Can you tell us a bit more about Hugging Face's business model today

Starting point is 00:24:20 and how that's going to evolve over the coming years? Yeah, so I'm really excited about the business models for machine learning startups this year. I think it's going to also speaking of like how the field can mature. I think it's going to be the big thing this year. And for us, as a platform, our vision is that the model is going to be probably at the high level, fairly similar to a freemium model, as you would expect. Right. So right now we have like 15,000 companies using the platform and we have 3,000 companies paying us, right? So you see like the majority being open, being free usage and some of them paying. Now the big question is where is the delimitation between the two? Obviously, we're starting to see a lot of interest for security features, for compliance features, especially for like bigger companies like when we're talking about Bloomberg, for example, using us or meta.

Starting point is 00:25:14 using us, both being kind of like customers. So that's one kind of like way of delimiting it. We're starting to see also a lot of interest in more of our features around, around infrastructure. So, for example, you can upgrade your spaces to GPUs or you can use our inference endpoints. There's an interesting thing there, too, around compute and infrastructure. It's kind of like a little bit more early, but we're seeing like a lot of interest of companies there around like helping them optimize. We talked a little bit about the infrastructure costs of machine learning.

Starting point is 00:25:49 So we're seeing a lot of interest from companies in order to help them optimize that for their use cases. Yeah, that's really cool. Yeah, it sounds like to your point, there's an ever-evolving field here. And it's kind of interesting because Hub is obviously an imperfect analogy, but an interesting one because there is so many things that they could have done, some of which they're doing now, some of which they've forgotten. It's everything from the GitLab opportunity in terms of providing like an unprecedented.

Starting point is 00:26:12 enterprise approach, supply chain monitoring and open source software, so things like what SNCC or socket or others are doing, profiles, developer transitions and tooling. Like, there's so much around this sort of product in terms of all the things that you can add, both as something that's very valuable to your users, but also as potential, almost like lines of business, like amazing company, amazing what they've accomplished. There's all these other things that could be done, and I'm just curious how you think about that. Yeah.

Starting point is 00:26:37 I mean, one big things that we're thinking about, and that's like related to. what I said just before, is that maybe GitHub could have gone a little bit earlier in the computer game and the infrastructure game, because what you see with these platforms, and I think it's the same for us, is that when you get so much usage, so much network effects, and you actually, like, for a lot of these projects, you're like the starting of projects, like the same way, I think probably companies are starting on GitHub to see the open source projects before starting a project. For us, for machine learning projects,

Starting point is 00:27:12 we see companies starting with Hugging Face, trying to find a model, find the data set, find a demo and Hugging Face, and then taking their infrastructure decisions based on that. So when you're sorted in projects, you can become some sort of a gate for compute, in my opinion,

Starting point is 00:27:30 or gate for infrastructure. And I think that's something that GitHub started to work on pretty late in their journey. So, first, we're trying and testing that a bit earlier. We already start to have some infrastructure products. We have, like, amazing collaborations with the three cloud providers, three big cloud providers. And so that's maybe if I had to point one, it could be this one, like testing the ability to become a gate for compute and monetize with infrastructure earlier than they did.

Starting point is 00:28:01 Maybe just to zoom out before we run out of time here, what are you most excited about in the next year of AI or expanding it to the next five years? I'm really excited about biology and chemistry for machine learning because I think the way I see machine learning is really has this new paradigm to build old tech, right? It's kind of like this analogy from English property where software 1.0 was like the first paradigm and now we're in software 2.0,

Starting point is 00:28:26 which is like machine learning power technology building. And so if you look at like the big sectors and the big kind of like impactful topics that it could change, obviously biology and chemistry are up there. And we're seeing kind of like the numbers of models and datasets and demos on Hugging Face increasing. Like a few days ago, there was a release of bio-GPT by Microsoft. Meta has been doing a lot of work on protein generation and prediction. So I think there's going to be really cool stuff coming up on these two topics. Are there particular application areas like within biology and chemistry that you think are

Starting point is 00:29:05 to emerge? No, I've stopped trying to predict. Smart. Yeah. It's proven to be too difficult in machine learning. I've made too many mistakes in the past, so now I'm not taking the risk anymore. But I'm particularly excited about what we call kind of like full stack machine learning companies. Companies like what we've seen more like in other domains, like runway, like grammarly, like Wombo, photo room, like stability, like these companies that are not just using machine learning, but really kind of like building machine learning because I think the same way as like for software 1.0, there was like the companies more like using a square space or like using something to build the website. And then there was the companies

Starting point is 00:29:48 who like really building technology. I think we're going to see the same thing for machine learning. And when you look at the capabilities of some of these companies when they're like translated into product building, I mean runway, you've all seen like the videos of runway. I think that's amazing. And I think they're going to be able to really challenge the incumbent thanks to this ability of building machine learning as machine learning native companies. Clem, that's all we have time for today. Thank you so much for joining us on the podcast. Thanks for having me. Thank you for listening to this week's episode of No Priors.

Starting point is 00:30:26 Follow No Priors for a new guest each week and let us know online what you think and who an AI you want to hear from. You can keep in touch with me and conviction by following at Serenormus. You can follow me on Twitter at Alad Gill. Thanks for listening. No Pryors is produced in partnership with Pob People. Special thanks to our team, Cynthia Galdaya and Pranav Reddy, and the production team at Pod People. Alex Vigmanis, Matt Saab, Amy Machado, Ashton, Ashton, Carta, Danielle Roth, Carter, Carter, and Billy Libby. Also, our parents, our children, the Academy, and tyranny.m.m. your average-friendly ATI world government.

No Priors: Artificial Intelligence | Technology | Startups - Why the Future of Machine Learning is Open Source with Huggingface’s Clem Delangue

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.