No Priors: Artificial Intelligence | Technology | Startups - Why the Future of Machine Learning is Open Source with Huggingface’s Clem Delangue
Episode Date: February 23, 2023After starting as a talking emoji companion, Hugging Face is now an organizing force for the open source AI research ecosystem. Its models are used by companies such as Apple, Salesforce and Microsoft..., and it's working to become the GitHub for ML. This week on the podcast, Sarah Guo and Elad Gil talk to Clem Delangue, co-founder and CEO of Hugging Face. Clem shares how they shifted away from their original product, why every employee at Hugging Face is responsible for community-building, the modalities he's most interested in, and what role open source has in the AI race. Show Links: Hugging Face website The $2 Billion Emoji: Hugging Face Wants To Be Launchpad For A Machine Learning Revolution - Forbes Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @ClementDelangue Show Notes: [01:53] - how Clem first became interested in ML, being shouted at by eBay sellers, and the foretelling of the end of barcode scanning [3:34] - early iterations of Hugging Face, trying to make a less boring AI tamagotchi, and switching directions towards open source tools [5:36] - advice for founders considering a change in direction, 30%+ experimentation [7:39] - 1st users, MLTwitter, approach to community [10:47] - enterprise ML maturity, days to production [12:54] - open source vs. proprietary models [15:56] - main model tasks, architectures and sizes [19:12] - decentralized infrastructure, data opt out [24:16] - Hugging Face’s business model, GitHub [28:09] - What Clem is excited about in AI
Transcript
Discussion (0)
In traditional science, it's really used to be the norm that you would have some research
and some research paper, and it wouldn't make its way into production before 10 years, 20 years.
And what we're seeing in machine learning is that it's actually making its way into production
after a year, a few months, a few weeks, sometimes a few days now.
And I really hope in the future that we'll keep this very fast virtual cycle iteration loop
between science to production to science, because to me it's the main driver for the speed of
progress in machinery.
This is the No Pryors podcast. I'm Saragua.
I'm Alad Gail.
We invest in, advise, and help start technology companies.
In this podcast, we're talking with the leading founders and researchers in AI about the biggest
questions.
Originally created to be an AI chatbot companion, six years later,
Hugging Face is now the collaboration backbone of the open source AI research ecosystem.
The company is currently valued at $2 billion with over 10,000 companies using their platform,
including Bing and Apple, and has expanded from popular NLP models to other modalities,
including media, biology, and more.
Our guest, Clem DeLang, founder and CEO of Hugging Face, is central to the open source movement in machine learning.
We'll talk about how he built this company, why open source is the future of AI, and what he sees on the horizon.
Clem, welcome to the podcast.
Thanks for having me.
Thanks so much for joining us.
So we were hoping to start with your background, which I think is really interesting.
You grew up in France, where you ran an electronics shop on eBay, which is so prolific that you ended up earning an internship opportunity with eBay.
How did you go from that to image recognition and eventually to Hugging Face?
Yeah, it's actually quite a funny story because I was one of the biggest French seller on eBay.
I was kind of like the user-facing team member.
And so they were sending me to all these trade shows in France, which were like the worst experiences ever
because at the time PayPal belonged to eBay.
And so we had a shared booth.
And so all the PayPal users would come to the booth and basically shout at me because PayPal was keeping their money or blocking their
accounts or things like that. It's basically kind of like the worst days ever. But during one of these
days, I bumped into a guy with like big round glasses like looking very nerdy. And I remember pretty
vividly told me like eBay, you acquired not so long ago a barcode scanning company called
Red Laser to recognize objects. But you need to know that pretty soon with machine learning. I mean,
And he wasn't calling it machine learning at the time.
But with these new algorithms, you won't even need the barcodes anymore.
You'll just recognize the object itself.
And at the time, I was like, who's this crazy guy?
But at night, I camp like did my research and realized that it was a pretty legit guy
coming out of a legit engineering school in France with a small startup, which raised a little
bit of money.
And one thing after the other, I ended up leaving eBay to join the startup doing machine
learning for computer vision. I made the move to machine learning. It was almost 15 years ago now.
I don't regret it at all. It's funny how like a single, small encounter like that can completely
change your trajectory. That's really cool. Can you tell us a bit more about the early iteration
of hugging face, how you decide to start it, the early days as a talking emoji and where it went
from there? Yeah, absolutely. With my co-founders, Julian and Thomas, we kind of like always shared
this passion and excitement for AI and for machine learning. And when we started hugging face,
we're like, okay, what can we work on that is both scientifically challenging, but also fun?
Okay, we're going to build some sort of an AI Tamaguchi.
We were heavy users of Alexa and Siri, and we're like, why is it so boring?
Why are they only talking about productivity stuff?
Why is it just telling you the weather?
And so we started to build that, kind of like some sort of an AI friend, Tamaguchi, AI.
Basically, what you see in a lot of movies as sci-fi.
a lot of what people are using chat GPT for today, actually.
And we did that for almost three years,
got some level of traction,
billions of messages exchanged between users and the chatbots.
So that's how HuggingFace started.
And at what point did you decide to shift it towards an open source community
and model repository?
How did that come about?
It was three years in.
After our seed round,
we've always been kind of like big open source people,
so we've always kind of like open sourced part of what we were doing.
When Transformer models started to work, when we started to see Bert getting some traction,
we just saw the number of people using our open source just blow up and start to skyrocket, right?
We went from a couple of people looking at it to hundreds of companies using it.
We raised our series A based on this early traction, and that was really kind of like the signal that we need to put most of the efforts of the company on this new direction.
That's cool.
So this was open source that you'd already developed and put out into the wild.
And then you saw people starting to use this.
And then you said, wow, there's so much attention here.
We should go and do that instead.
Exactly.
That's really cool.
Yeah, it's always interesting to see these shifts in direction.
I feel like that's every Stuart Butterfield company, right?
That was Slack and that was Flickr before that.
They just built something.
And then it kind of took off separately.
Do you have any sort of advice to founders who are considering changing direction or thinking of new directions for their company?
Or how do you keep your eye out for the things that are really interesting or working that may or may not be the core thing of what you're doing?
Well, I think the best way to do it is to find the good ratio in your company between
like exploitation and exploration.
And I think that's where a lot of startups are not always getting right, not only before
product market fit, but also after product market fit, I feel like sometimes companies
before product market fit are experimenting kind of like too much, changing directions every
week.
And I don't think you learn a lot from that.
And then after a product market fit, they kind of like stop experimenting.
trying new things and trying to stay away from the local optimum in a way and looking more
for like the global optimum. So for us, so what we've always done, I think we'll always do with
a fucking face is to make sure to spend at least like 30 or 40% of the company's efforts on
exploring new things and kind of like finding the long-term bets that is going to make you
successful. And then give these experiments and initiatives like a chance, right? For us, we were
lucky that Thomas, one of our co-founders, was leading this kind of like experiments.
But we have examples of other initiatives that started as experiments from team members
who made it and graduated to a very big bet for the company.
One example of that is spaces, which are our machine learning demos that have been insanely
successful, we just crossed 50,000 of machine learning demos in the past year and a half.
And it started just from one team member kind of like experiments.
implementing with that and being like, oh, I think I can build something cool there.
And one step up to the other, it led to where it is today.
It seems like in general, companies that iterate or launch new things early,
keep launching things later in the life of the company and companies that never innovate early,
don't ever innovate again in their lives.
It's kind of like the difference between eBay and Stripe or you can name different
companies.
And so it's awesome that you folks were investing really early in that innovation.
When you first started getting traction with what Hugging Face does now,
did that happen organically?
And it just started growing and taking off,
or did you reach out to specific communities?
Or how did you first get those first users
to user open source platform?
The distribution really started on Twitter at the beginning.
We just started to tweet about some of the things
that we were doing on open source,
and people who tweeted that the machine learning community
on Twitter was already pretty strong.
And then it's no bold, I think,
classic kind of like network effect kind of things
where researchers started to share their models.
And obviously, like they were getting visibility
for their models, so people in the industry, in companies using these machine learning models,
we're hearing about Hugging Face through that, and then we're asking more researchers to
add their models to Hugging Face, so kind of like more typical marketplace network effects.
And then something that we did that I think worked really well for us is that we never hired
any community manager, any kind of like communication PR team members, because we wanted it to be part of
every single team members, even kind of like the most technical specialized scientists, we've
always told them like, okay, it's part of your job to interact with the community, to share with
the community, to get visibility for what you're working on. And so we ended up with this
organization where talking to the community and getting visibility is part of everyone's job
instead of like outsourced to a team. And I think that that's what I'm saying.
that the people are so appreciated with us from the community because they could really talk
to the builders directly and the people doing the things. And it created kind of like more
meaningful interactions, I would say. Yeah, it's clearly really authentic to hugging faces culture,
the sort of commitment to community and open source. And I remember hearing that many people
in the company run the public Twitter. Yeah, everyone in the team has access to the Twitter
account and are tweeting from the Twitter account. I think most organizations are not capable of
that sort of risk taking. What else do you think you guys have done right on the sort of community
growth aspect? Because I think now everyone knows that's such a powerful driver for business for
an increasing number of technology companies, but it's pretty hard to actually execute against.
That's a good question. I think timing, obviously we've been really lucky with timing.
Trying to listen, it sounds a bit cliche, right, but actually listening to the community
and implementing what the community is asking. Yeah, and then just like build your culture
around it to have people who are, like, excited about contributing to the community, even
independently of everything else. I think sometimes you have companies where they're doing
community or open source work, but it's almost as like a mean for other things. And it's like
sometimes feels like they're almost like they have to do it to get other things that they're
more excited about. For us, it's been useful to try to hire people who are genuinely excited
about this work. And if they were to do, they could kind of like almost.
work for free for the community on open source and they'd be happy about it. And so that that creates
like the right culture for this kind of work, I feel like. I feel like one of the roles I see
hugging face play is as this conduit for this like amazing pace of research in terms of ingest
into industry. And it's interesting to hear you say that you release transformers as a project and
you had a bunch of companies, the open souls model that you guys released for transformers. And
You had a bunch of companies using it, but it feels to me like there's a huge distance
between where your average enterprise is with their machine learning journey and all the
amazing cutting-edge research being shared on Hugging Face.
Like, how do you reconcile that and how does that gap close?
I think first to traditional science, this gap in machine learning is extremely tiny.
My co-founder, Thomas, who did this PhD in quantum physics and some research in quantum
physics before could tell you way better than me about it. But in traditional science, it's
really used to be the norm that you would have some research and some research paper. And it
wouldn't make its way into production before, you know, 10 years, 20 years. And what we're seeing
in machine learning is that it's actually making its way into production after a year, a few months,
a few weeks, sometimes a few days now. So this is, in my opinion, this is amazing. And that's
what's driving most of the speeds of the progress in machine learning.
That's actually why I'm excited to keep investing so much on open source
and sometimes a bit worried about more proprietary models coming up
is that I think if we move the open source from that equation,
if they wouldn't have been as much open source as there's been in the past five years,
we would be like decades away from where we are now.
And I really hope in the future that we'll keep this very fast virtual cycle iteration loop between science to production to science, because to me it's the main driver for the speeds of progress in machine learning.
So I think a lot of people share this general vein of concern in the ML community that, of course, we have this wealth of open source models, but large transformers-based models tend to get better when they get bigger, and they can be prohibitively expensive to train.
So there is a concern that the state of the art, which unlocks a bunch of use cases, will be in proprietary labs like deep mind or open AI or anthropic or what have you.
How do you think about the performance of what's in the open source versus the state of the art?
So, I mean, I think first, sometimes we tend to say like, okay, open source wins or perverter wins.
The truth is that there's always going to be both, right?
I think if you see most technologies, if you look at search, you always have the elastic search and the algoia or like if you look at that.
databases, you have that MongoDB and the proper zero approaches.
So I'm not too worried about one winning against the other.
I think there's always going to be both.
And I think the way it works is very similar to how science has always worked in the sense
that in some specific area, sometimes you're going to have proprietary approaches that have
taken some advances and have gone faster for X or Y reason.
For example, that's the case right now, maybe in text generation, right?
With, like, chat, CPD being like more giving better results than open source approaches.
And then on other domains, like, for example, text classification, information extraction,
arguably like image generation with stable diffusion and stuff like that.
Open source is ahead of proprietary.
And probably it's going to flip in like a few weeks to the other way around.
And that's the case for all the tasks.
So it's kind of like the race with.
like dozens of racers and sometimes one is going ahead of the other,
but it's at the end there's always going to be like both approaches.
I don't really believe in this scenario of like one model, one company to rule them all.
And one kind of like data proof that I see is that on Hugging Face,
we just crossed 250,000 models, right, a quarter of a million models,
uploaded by almost 15,000 companies now.
And I don't believe they're building models just to build models, right?
if there was one model that would be better for everything than the other is it wouldn't.
We're always going to be in a world where there's going to be multiple models for companies,
especially because when you look at why are companies using so many models on Hugging Face,
you usually realize that a more specialized model is more efficient.
It's cheaper to run.
It's usually faster to run.
And most of the time, actually, more accurate for the specific use case.
So that's what we're seeing now.
And that's also, to be honest, what we're hoping to see in the future, because when we think we build or we fund startups to see the future that we want to see, right?
And personally, I'm more excited about the future where machine learning is available for everyone and everyone can build machine learning versus a world where it's very concentrated and monopolistic.
So I have to ask you because you have this amazing viewpoint into what's happening in the community.
of those 250,000 models,
can you characterize the sort of distribution
of what percentage maybe is like image versus language,
other modalities,
and then from an architectural perspective,
like diffusion transformers,
other interesting approaches?
Yeah, the three main tasks right now
are NLP, so text, right,
from like information extraction,
text generation, text classification.
The second one is text to image
and computer vision, right?
So object detection, text to image, text and image generation.
The third one is audio.
So like speech to text, text to speech, information extraction, but from audio,
rases and then text.
And then we're starting to see more and more models on time series.
Right.
So for example, the ETA from Uber, when you get your Uber is like transformer time series
or like financial models for fraud or like these kind of use cases.
And then biology and chemistry, also we're starting to see more and more models, more and more data sets and more and more demos there.
So that would be kind of like the main buckets.
And then what's interesting is that you have all sizes of models, like a few million parameters, up to 180 billion parameters, right?
The biggest open source models out there.
And do all the sizes get used?
Yeah, it's pretty distributed.
That's something we kind of like always look at to inform our thinking.
It depends on your use case, what you want to use.
If you, for example, we have Bloomberg as users, right, and as customers.
And in the Bloomberg terminal, the more real-time, the better for them, right?
And so because they want to be real-time and have as little latency as possible,
they want to use kind of like a smaller model that is automatically going to be faster than the bigger model.
So depending on the use case, companies that want to build something very general, very,
able to apply to a lot of different use case from customer support to the meaning of life
and who don't care so much about latency or cost for it, then they can go for a bigger model
that makes more sense. Are there specific areas or trends you're most excited about from either
a research perspective or from a model implementation perspective? I mean, I'm really excited these
days. It's a bit of an unsexy thing to say, but by the infrastructure side of things. Because
I think so far as a whole for like the machine learning domain and ecosystem,
we haven't thought too much about what it costs to run some of these models,
how fast they can go, how slow they can be.
And I hope that this year there's going to be some sort of like more clarity around that
to make sure that as a community, we build something healthy and sustainable and not.
I feel like sometimes in the field there's something that I call the cloud money laundering
where you almost can't disconnect the infrastructure costs to the actual use cases.
I think as the field is maturing, you're going to see much better alignment between the two
and I'm actually excited about that because I think it's going to be a big enabler for the fields
in the long run.
Yeah, that makes a lot of sense.
I guess from an infrastructure perspective or tooling perspective, is there anything that
Hugging Face isn't directly working on so it's not going to be competitive with you all
that you really wish existed or that people are working on more actively?
Something we've worked a little bit on, but we haven't really managed to make it work.
And I'd be excited to see more teams working on it is to create some more like
decentralization on the infrastructure side.
Because right now it's very centralized, both in terms of like players, but also in terms
of like timing, for example.
Like most of the time, the way you build models is that you train them once and then
maybe you're going to train it again six months later or a year later, which sounds kind
of like archaic in a way. And that creates a lot of challenges like not being able to be
current, right? Like a lot of these models, like they don't know who is the current president
in the United States or like stuff like that. So having more decentralization, more online
learning ways of going from big training to small, more regular training. I'm really excited
about that. And the second thing is that I'm really excited about creating more consents from
people in the data set.
For example, we've been working with a project
called B-Code, which has been
released code generation, open-source
code generation models, on
the ability for
developers to opt out from
the dataset that the model is
trained on. In a similar way,
we're starting to see on Hugging Face more and
more opt-in datasets, meaning
like data sets that have been trained
only on data that the
creators of the data have
consented to having a model
trained on. It's very interesting, for example, for text-to-image models where there's a lot of
debates right now with the underlying work of the artists being used in the training. So I'm
excited to see more work around that too this year. Will you explain what Bloom is and more broadly,
like how you decide where Hucking Face should be a first-party participant in model training or
research? Yeah, Bloom is the result of an initiative called Big Science, which also led to
big code that I just mentioned.
And big science was like the largest collaboration in machine learning to date with like
a thousand researchers from 200 organizations coming together in order to build and train
a large language model completely in the open, right?
So everything was publicly available.
You can see all the runs that they did, all the brainstorm that they did to get to the
decisions that they did.
And it's been really excited to see that kind of like almost building organic
with our support and the support of a lot of organizations,
Jean-Zé, for example, which is a French computer that provided the computer for this.
And it informed a lot our thinking around ethics and kind of like openness,
because one of the reasons why we so focused on open source and open science at Tugging Face
is that we believe like the two main challenges with AI today are one kind of like
the concentration of power and second biases that are encoded.
in these models. And for both, we learned that building in the open, with open source,
is actually more part of the solution than part of the problem, because obviously control of power
as you ties it much more. And biases, you actually include in the process people who are impacted
by these biases, especially underrepresented populations, which is kind of like otherwise
really hard to do if you only do the work behind closed door in the lab,
in Silicon Valley with mostly old white dudes.
And so that informed a lot,
kind of like I were thinking around open source and open science.
There's a lot of excitement in the research community
and amongst entrepreneurs on increasingly reinforcement learning
with human feedback.
Does that impact your strategy at all at Hucking Face?
If that's the next step beyond this pre-training?
Yeah, it's a very interesting kind of like additional step
on kind of like a classic machine learning pipe that we've been invested in for quite quite a while
I think the first reinforcement learning with human feedback models added to the hub I think it was
like eight eight months ago so way before it was kind of like as as popular as now we're leading
the development of an open source library that is helping companies integrate that into their
models and into their workflows it's a really exciting new development as we see
we've been around the block a little bit
so each every few months
there's always kind of like a new thing
that's one of the challenges
of building a machine learning startup
these days is that you have to have
the flexibility to constantly evolve
it's been the same when like diffusers
started to pick up right
and you started to see this new generation of models
so each time we kind of like adapting to it
trying to empower
the community to be able to
take advantage of these new progress
in machine learning
That's cool.
Can you tell us a bit more about Hugging Face's business model today
and how that's going to evolve over the coming years?
Yeah, so I'm really excited about the business models
for machine learning startups this year.
I think it's going to also speaking of like how the field can mature.
I think it's going to be the big thing this year.
And for us, as a platform, our vision is that the model is going to be probably
at the high level, fairly similar to a freemium model, as you would expect.
Right. So right now we have like 15,000 companies using the platform and we have 3,000 companies paying us, right? So you see like the majority being open, being free usage and some of them paying. Now the big question is where is the delimitation between the two? Obviously, we're starting to see a lot of interest for security features, for compliance features, especially for like bigger companies like when we're talking about Bloomberg, for example, using us or meta.
using us, both being kind of like customers.
So that's one kind of like way of delimiting it.
We're starting to see also a lot of interest in more of our features around, around infrastructure.
So, for example, you can upgrade your spaces to GPUs or you can use our inference endpoints.
There's an interesting thing there, too, around compute and infrastructure.
It's kind of like a little bit more early, but we're seeing like a lot of interest of companies
there around like helping them optimize.
We talked a little bit about the infrastructure costs of machine learning.
So we're seeing a lot of interest from companies in order to help them optimize that for
their use cases.
Yeah, that's really cool.
Yeah, it sounds like to your point, there's an ever-evolving field here.
And it's kind of interesting because Hub is obviously an imperfect analogy, but an
interesting one because there is so many things that they could have done, some of which
they're doing now, some of which they've forgotten.
It's everything from the GitLab opportunity in terms of providing like an unprecedented.
enterprise approach, supply chain monitoring and open source software, so things like
what SNCC or socket or others are doing, profiles, developer transitions and tooling.
Like, there's so much around this sort of product in terms of all the things that you can
add, both as something that's very valuable to your users, but also as potential, almost like
lines of business, like amazing company, amazing what they've accomplished.
There's all these other things that could be done, and I'm just curious how you think about
that.
Yeah.
I mean, one big things that we're thinking about, and that's like related to.
what I said just before, is that maybe GitHub could have gone a little bit earlier in the
computer game and the infrastructure game, because what you see with these platforms, and I think
it's the same for us, is that when you get so much usage, so much network effects, and you
actually, like, for a lot of these projects, you're like the starting of projects, like the same
way, I think probably companies are starting on GitHub to see the open source projects
before starting a project.
For us, for machine learning projects,
we see companies starting with Hugging Face,
trying to find a model,
find the data set,
find a demo and Hugging Face,
and then taking their infrastructure decisions based on that.
So when you're sorted in projects,
you can become some sort of a gate for compute,
in my opinion,
or gate for infrastructure.
And I think that's something that GitHub started to work on
pretty late in their journey.
So, first, we're trying and testing that a bit earlier.
We already start to have some infrastructure products.
We have, like, amazing collaborations with the three cloud providers, three big cloud providers.
And so that's maybe if I had to point one, it could be this one, like testing the ability
to become a gate for compute and monetize with infrastructure earlier than they did.
Maybe just to zoom out before we run out of time here, what are you most excited about in the next year of AI
or expanding it to the next five years?
I'm really excited about biology and chemistry for machine learning
because I think the way I see machine learning
is really has this new paradigm to build old tech, right?
It's kind of like this analogy from English property
where software 1.0 was like the first paradigm
and now we're in software 2.0,
which is like machine learning power technology building.
And so if you look at like the big sectors
and the big kind of like impactful topics that it could change,
obviously biology and chemistry are up there. And we're seeing kind of like the numbers of
models and datasets and demos on Hugging Face increasing. Like a few days ago, there was a release
of bio-GPT by Microsoft. Meta has been doing a lot of work on protein generation and prediction.
So I think there's going to be really cool stuff coming up on these two topics.
Are there particular application areas like within biology and chemistry that you think are
to emerge? No, I've stopped trying to predict. Smart. Yeah. It's proven to be too difficult
in machine learning. I've made too many mistakes in the past, so now I'm not taking the risk
anymore. But I'm particularly excited about what we call kind of like full stack machine learning
companies. Companies like what we've seen more like in other domains, like runway, like
grammarly, like Wombo, photo room, like stability, like these companies that are not just
using machine learning, but really kind of like building machine learning because I think
the same way as like for software 1.0, there was like the companies more like using a
square space or like using something to build the website. And then there was the companies
who like really building technology. I think we're going to see the same thing for machine
learning. And when you look at the capabilities of some of these companies when they're like
translated into product building, I mean runway, you've all seen like the videos of runway. I think
that's amazing. And I think they're going to be able to really challenge the incumbent
thanks to this ability of building machine learning as machine learning native companies.
Clem, that's all we have time for today. Thank you so much for joining us on the podcast.
Thanks for having me.
Thank you for listening to this week's episode of No Priors.
Follow No Priors for a new guest each week and let us know online what you think and who an AI you want to hear from.
You can keep in touch with me and conviction by following at Serenormus.
You can follow me on Twitter at Alad Gill. Thanks for listening.
No Pryors is produced in partnership with Pob People.
Special thanks to our team, Cynthia Galdaya and Pranav Reddy, and the production team at Pod People.
Alex Vigmanis, Matt Saab, Amy Machado, Ashton, Ashton, Carta, Danielle Roth, Carter, Carter, and Billy Libby.
Also, our parents, our children, the Academy, and tyranny.m.m.
your average-friendly ATI world government.