No Priors: Artificial Intelligence | Technology | Startups - Can AI replace the camera? with Joshua Xu from HeyGen
Episode Date: June 20, 2024AI video generation models still have a long way to go when it comes to making compelling and complex videos but the HeyGen team are well on their way to streamlining the video creation process by u...sing a combination of language, video, and voice models to create videos featuring personalized avatars, b-roll, and dialogue. This week on No Priors, Joshua Xu the co-founder and CEO of HeyGen, joins Sarah and Elad to discuss how the HeyGen team broke down the elements of a video and built or found models to use for each one, the commercial applications for these AI videos, and how they’re safeguarding against deep fakes. Links from episode: HeyGen McDonald’s commercial Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @joshua_xu_ Show Notes: (0:00) Introduction (3:08) Applications of AI content creation (5:49) Best use cases for Hey Gen (7:34) Building for quality in AI video generation (11:17) The models powering HeyGen (14:49) Research approach (16:39) Safeguarding against deep fakes (18:31) How AI video generation will change video creation (24:02) Challenges in building the model (26:29) HeyGen team and company
Transcript
Discussion (0)
Welcome, Joshua. We're so excited to have you here today. How are you?
Hey, Sarah. I'm so excited to be here. Thanks for having me today.
It's our pleasure. Let's get started. Welcome to the Huberman Lab podcast, where we discuss
science and science-based tools for everyday life. I'm Sarah Guo, and I'm a professor of
neurobiology and ophthalmology at the school of medicine.
Wait, Sarah, I'm so confused. What's going on here? Is this thing on?
Today, we're here to discuss how AI can benefit your health and what medicinal properties the technology holds.
Sarah, I'm so lost. Isn't this the No Prior's podcast where you interview technology superstars like Gary Tan and Alexander Wang?
No, that's only for humans.
We're really excited to have you. Welcome, Joshua.
Yeah, I'm excited to be here. Thank you for having me.
So let's start with a little bit of backstory. You start this company.
Hey Jan, it's had this amazing growth trajectory and is being used by millions of people now.
What's the story of starting the company?
Yeah, sure.
So, yeah, hello everyone.
My name is Joshua, co-founder and CEO of H-Gen.
We found a company roughly three and a half years ago, and before there I was working at Snapchat for about six and a half years there.
I started robotics at Carnegie Mellon and joined Snap back in 2014 there.
I initially worked on machine learning in Snapchat ads as ranking and recommendation.
Then I spent my last two years at Snap working on AI cameras.
So, you know, Snap leveraged a lot of AI technology to enhance the camera experience.
If you look at, you know, 2018 Snapchat with least the baby filter and Disney style filter.
That was the first time I saw a computer can actually create and generate something that the start exists in the role.
I was just so fascinated by the technology back.
And I had a feeling that would potentially change the way how people create the content.
So, you know, Snapchat as a camera company, and everybody created content through the mobile camera.
But we wanted to replace the camera because we think AI can create the content, and AI could become the new camera.
And that's how we get started with HGEN and our missions to making visual story telling accessible to all.
I love it.
The greatest minds of our generation, you know, inspired by, you know, your face.
as a cute kitten or whatever. What does replacing the camera mean to you? Like, why do we need to
do that? I use my camera a lot. I kind of grew up my career in the whole, you know, mobile
camera space where we work on a lot of the software and technology to enable people
feeling comfortable and make it easier for people create content through the mobile camera.
But, you know, there's still lots of people are not able to create quick content using the camera
today. And we felt that if we can replace the camera, that means we can remove the barrier for
visual storytelling, for visual content creation. And that will help us to step ahead in terms
of the whole content creation space. What are some of the areas that you think, you know,
the technology that you developed is applied to? Because I think you've started with different
forms of like virtual avatars so that you can take a video of yourself and then turn it into
do an avatar that you can then feed text to, it can speak in your voice, it can do all sorts
of really interesting things for different areas.
How did you both decide to start with avatars and then where do you think the main applications
are?
When we initially started the company, we tried to like assemble the whole video production process.
There's really about camera and then editing.
So camera is more about A role, which represent a human spokesperson, the avatar piece.
is more about B-roll, adding different assets, voiceover, music, transition, animation,
stuff like that. So editing, we just learned from customer that editing is not that
expensive because it's pretty standard service, but camera is super expensive. And imagine, you know,
it's a CEO of a company, he wants to recall something, we probably need to schedule that
ahead of, you know, two weeks of time. We need to bring in the camera crew, have a studio to
actually record it. And even for two minutes of footage, sometimes,
we need to record it for 20 minutes because people need to remember the script.
And that's the piece that blocking a lot of a business create new content.
So that's how we get started from trying to replace that piece of the process
and making avatar to replace a camera for the video production.
Where do you think that goes in the future?
So, you know, people are already using H-Gen for all sorts of different application areas
in terms of, you know, marketing and sales and at some cases like internal web
webinars or learning or other things. I'm a little bit curious, like, is the eventual form of
this, you know, everybody has somebody who steps in for them for their zooms or is it used
for entertainment purposes? How do you kind of view the evolution of this sort of technology
over time? Yeah, I'll say there's many possibilities out there. I think what we are tackling
the problem so far, it is the entry point of the content creation where all the content started
with the camera. And then we would have people doing a lot of editing after that, you know,
we can clearly see a path where people can already assemble all this generative footage
and apply the AI editing to assemble the final video. And again, you know, if we push forward
into technology, making the performance much better, I think we will be able to create
experience like generative video in a streaming way. And that actually will, actually,
potentially replace a lot of the, you know, real-time conversation we have today, especially
with the GBT-40 and with all this multi-model real-time streaming technology altogether.
Okay. We're still in asynchronous video creation land in 2024. How do people use H-GEN today?
Like, what are the favorite use cases you have?
I would categorize the use case of H-JET and do three. Create, localized, and personalized.
and, you know, people can select, you know, the cast from a library from our avatar
or create their own digital twin and just like select a template or type the script and generate
a video.
This works the best for product screener, how-to videos, learning development, and some
self-enablement training content.
We can also take the existing video that localize that into more than 175 different languages
and dialists.
And, you know, in this way, we can help customers to really localize their content into local languages.
And last but not least, people can also use HGN to personalize the video messaging at scale.
So I think there is many, many very creative use case on HGN today.
We are a very horizontal platform.
I would say one of my favorite use case is probably the recent launch with Madonna.
And Madonna's launched a suite campaign where they can allow people to send.
a message to a family member in the different languages.
I love him so much. I would run out of words to express my love for him.
You know, I just want to call it out, you know, AI is for everyone, grandma and grandchildren alike.
Yeah, that's really cool. I mean, like, that's a big brand in a public, like, consumer-facing use case.
How do you think about, like, the quality of Hey Jen today?
Like, you know, I would have thought of that as, like, sort of typical.
of the pyramid in terms of quality? And, you know, how can you tell when the avatars are good enough and not?
Yeah. So I would say quality has always been, you know, the number one priority of the product and business and technology.
I would say, you know, I always have frameworks like this. There's an invisible line of the quality, you know, let's say that thresholds to 90.
anything below 90 essentially is unusable for the customers
because we cannot really replace the real-life production process they have.
We really need to focus on making the video generation quality
that go above that threshold.
And I think especially for Avatar today, it is above that.
So we can really help people to replace the real camera
and unleash a lot of our creativity process
that help people to scale the content production there.
And, you know, obviously, you know, there's much more room to improve, for example, generating the full-body avatar, being able to bring all sorts of our element into a video.
Yeah, we're in the process of that.
What are you most excited about in terms of, like, what's next or new releases you guys have coming?
I think there's many things very excited going on in our technology and product roadmap.
I think particularly I'm very excited for the full-body generation of the avatar.
Historically, all the other technology has been focused on the upper body.
It's really hard to generate the gesture and the body motion,
but a lot of academic research has proven that this is very possible now,
and we just need to basically take that into the last mile.
And another thing I would say,
something I'm very excited about the streaming avatar,
especially with the latest release on GPD-4-0,
really really help to improve the performance of the real-time interrestreational,
with text and voice, and Hagen avatar could become a videoization layer for all those applications.
Obviously, you need, like, full gesture control and movement to get to any video of any kind.
But what do customers want to do in terms of full-body motion today?
Like, you had a demo of walking in, you know, the last couple months.
The way how we look at it is that there was a spectrum of the quality requirement.
laying out on different use cases, right?
Let's start from the left side of the spectrum.
It's the learning development content, educational content.
It's more like one to many broadcasting,
talking about educational training content.
The quality there is lower
because the avatar can be more still, more professional.
But if on the right side of the spectrum,
we call it as like the high-end marketing content,
really dynamic, you know, the one example would be the ads creative. And people ship very,
very dynamic content on ads, and because that can really help to improve the ROI of
content, making it more engaging. I think making the full body, enabling that full body rendering
will be able to help us to bring the avatar, to bring the video into the next level of engaging
and authentic. And that will help to unlock a lot of use cases in a broader case of marketing
and sales. Newscasts or other things. To your point, they often have the shot of the people
walking and talking is like a standard can shot. And there's like these standard things that they
use that if you had full body, you could provide for all sorts of application areas. I guess
related to that, what is the technology that you folks are using today? You mentioned some things
like GPT40, but you've also built your own models and has, like, how do you think about
the technology stack that you're using? And how does that have to evolve in order to be able to
do full body or other new things? There's a three model, right? Text, voice, voice,
and video.
So we work with OpenAI,
check jubt on the test generation side.
Obviously, it also serves like the brain
of the orchestration engine
that we built internally.
And we work with, you know,
open AI and EVVILAB on the voice engine,
but we build the entire video stack in-house,
including after creation,
video rendering, and B-roll generation.
So I think over time,
I think the whole technology trend
has been moving towards a direction.
a lot of all this thing will be chained together.
The multi-model model, multimedia all get into one single model.
One of the challenges I want to call out for the full body generation
is actually how do you actually connect that voice together with gesture motion.
And that's actually something will be unlocked by actually getting the voice model
and the video model training together so that it can sort of build a connection
underlying the model as well. And that has been historically really, really hard because we have
to, like, train the TTS model on one hand and then feed that TTI's model outcome into a video
model. And that's pretty hard to build that connection. But with multimodal model training,
that's very possible. Obviously, SORA is not available to developers and end users today, but there
are like world-class text to video generation models that are generic, not avatars. How does this
technology differ from something like SORA?
When we initially started H-Gen, we want to help the business solve the video creation
problem. What does the business look for? They're looking for quality, they're looking
for control, they're looking for consistency. So when we try to look bad, okay, this is a North
Thar, how can we get there? What's the technical path to get us there? This is essentially
probably potential two path. One is the test image, the SORA, where we try to generate the entire
thing from end to end, and so you get the entire video at once.
And the other approach is that what we believe in at HGens,
that we try to did assemble the whole video into different components.
Largely it will be A-roll and B-roll.
B-roll represent all different kinds of elements,
like voice-over music transition, A-roll being the avatar.
And we try to tackle this component one by one,
and then we build an orchestration engine around that
that to assemble their final video together. We felt that this technical path is more capable
to deliver the quality, you know, the control and the consistency that the brand is looking
for. Because, for example, there's some stuff we should probably should not try to generate.
It's the logo and the fonts. That needs to be very accurate. And not to mention that we also need
to be able to learn about, especially in the business context, we need to learn about the brand style,
the color mapping essential from a customer.
And I think the second approach would give us more flexibility and capability to build that system around it.
And in fact, we actually see SORA as our partner because we are able to integrate that as one of the component, you know, generator and then feed that into our acquisition engine for the business application.
How do you think about what research, you know, if you just focus on like components of the experience, in particular the video stack being the thing that you
you really want to own and be state of the art in at Hagen. How do you approach like new
capabilities from a research perspective? Is it, you know, look at what's available in academia,
look at the problems customers give you sort of de novo? I would say it's a combination.
I would add one more thing is that we deeply understand the limitation around the model
and try to find the connection between what is the customer looking for, what's capable
with the technology.
Like, when we really try to look at it,
all the AI model had some sort of a limitation.
And I think the key question is that
in order to deliver a great product experience
for the customer, is that how do we design
the product around it so that we can try
to avoid the limitation of the model
but help to amplify the strength of the model.
And that's something that's really important
to find the new area that unlocked
the new creation experience.
One example would be when we look at, you know, the video translation technology.
It's a whole new way to translate a content compared to traditional dubbing.
We preserve the user, the natural voice and their facial expression.
But if you look at really underlying the model, what enabled that video rendering.
It is actually a lip-sync model, right?
But we kind of figure out a way to combine all this together, together with the voice,
as well as the translation with chat GPT and build a great experience around it.
And sort of like we are creating a whole new experience for localized their video and content.
So there's lots of like great McDonald's like exciting commercial applications.
I think a lot of people also think deep fakes are really scary.
Like and the ability to, you know, abuse somebody's, you know, likeness or voice is scary.
How do you think about safety, election safety, abuse?
First of all, we do not allow any political or election content on our platform today.
H.J.S. policy strictly prohibit those creation of, you know, unauthorized content, and we take abuse of the platform seriously.
So we have our safety, you know, security safeguard, include very advanced user verification,
include, you know, live video consent, dynamic verbal pass code, and rapid human review in the back of all the other have been created on the platform.
Trust and safety is critical to our business, and we are actively partnering across the industry,
you know, continue developing the tools and best practice to combat misinformation and AI safety.
And we actually build the safety as part of the design.
If you look at a lot of, you know, after creation process on HGN, and we bait all this safety concern and safety guard on every single step of the creation process as well.
It makes a lot of sense. I guess it's kind of interesting because if you think about it,
from the positive version of this.
And you talked about how you try to protect against a negative.
The positive version is, you know, you're running for office and you should be able to send
a personalized message to each voter literally into their inbox with a short video clip of
you talking to them specifically, you were talking to issues that they specifically care
about or things like that.
And so you could imagine using this technology in the future for actually hypersonalized
political campaigning.
And as long as you can avoid some of the deep,
fake side of it, then obviously it could actually be quite valuable.
How do you think this ability to really generate large-scale, differentiated, personalized,
et cetera, content of individuals talking?
How does this kind of generation change how people make or use video in general?
You know, if people can generate very engaging and authentic video content, they will
basically create more videos and use video more for their business.
to grow their business.
And we live in a video first role
where every business want to create more videos.
I think the bottleneet today in the industry
is that video is just very expensive to make
and it takes like weeks or month to make the video.
I think it would fundamentally change a lot of ways
how people are thinking about how to grow the business,
how to do the communication, how to do the marketing and sales.
So I do think there's a huge possibility that we can create and generate a very high degree of a personalized video,
especially with the full-body avatar that being able to deliver a very dynamic and high-quality content out there.
So I want to give you one example is that I think a lot of AI generation is not only about,
obviously, you know, cost-saving and time-saving is one aspect of the value.
product. But it's actually we are seeing a lot of a customer using that to unlock new use cases.
And being able to do something, they were not able to do it before. I think that's the key
job in point of a lot of a business outcome today. How do you think about it in the context of
real-time versus asynchronous? It feels like a lot of these technologies are focused right now in
asynchronous use cases. And that's true as well of just pure text-to-speech models. When do you think
we move to any sort of real-time or close to real-time video avatars and sort of the uses of that?
I look at it in two ways.
One is that the real-time application of the avatar.
Even now it's possible.
I think people can already experience that on H-Gen.
We are making a new update that can make it even faster.
So it can potentially become, let's say, the virtual, you know, AI, SDR virtual support
that help to take customer calls or provide support, right?
And, you know, I think the technology has been always developing like this chain.
Two years from now, it would not be crazy to look at a lot of avatar generation
asynchronous pipeline will become real-time streaming capable.
And I also see the world is moving towards a way that we can probably generate the entire
video in real-time as well in the future, in the future.
Let's say five years from now.
I have an opinion like, you know, generative image is still an image, but generative video is not a video.
It is a new format.
What I meant by that is, you know, when we really look at a video, we look at it as an MP4 file.
So it is immutable.
Like, for example, if you and I are on Instagram, we probably get recommended by different, two different ads.
But as long as we are recommended from the same business, we are looking at the same MP4.
file. But it does not need to be the same. Let's say if maybe I like avocado, I should be
watching an app with Coca-Cola and avocado and showing the, you know, the new story about Coca-Cola
with me. And you like something else, you could be looking at something else. And this is not
possible today because making a video is expensive. But this could be very possible. Let's say,
we can actually, you know, real-time generating the video ads that you like according to your
user attribute, that will potentially become a new format.
You know, when we really look at today's video player, it corresponded to only one
the MP4 file.
And it doesn't need to be true.
It doesn't need to be like that.
That video player can actually take in a lot of, you know, user attribute and generate
something in real time to match what's the best way to deliver their content to customers.
Yeah.
Yeah, I think it's, you know, one interesting analogy.
It would just be like if you think about, you know, YouTube as one of the largest learning devices in the world today.
Like, it is static, immutable video for everyone, but it's pretty clear, Bloom Studies and everything else that, like, personalized education is going to be the path that is more effective.
And people want to learn by video, but it's very hard.
It's too expensive to make that video personalized.
This feels like, you know, opportunity for a very different.
educational future too. Yeah, and one of the use cases we have seen from
customers that, you know, Pub's group, they generate more than a hundred thousand
videos, a thank you video to send to all the employees globally and localize into
different languages personalized with their name and they are what they like about
about it, you know, when they join the company and stuff like that. And
historically that is actually only delivered with one video, right? So they
maybe the CEO of the executive team hop on it into a camera and we call something, you know,
saying thank you for the, you know, 2023. But now that message and communication can really
personalize at a very big scale. So one thing you mentioned is the various aspects of research
that you're doing in terms of building your own video models as well as using third-party APIs.
What's been difficult or hard from a research perspective? Unlike a lot of other model, I think
building video model, you know, being able to integrate aesthetics into the AI models pretty
hard. So, you know, video generation is not only about solving a mathematical problem. It's actually
about creating something the customer love and appreciate. So essentially, a model with a lower
optimized cost function doesn't mean it actually produce a better visual outcome. So I guess that is
the piece that making it really hard to evaluate, but also
also really important to deliver the mile, the last mile of the value for the customer.
And, you know, generally, evaluation is also hard.
We have to rely on in-product signal, for example, AD test, to know which model is actually better
because, you know, only the customer can be the judge for that.
And this process generally is just not differentiable from a mathematical standpoint.
We kind of have to form a system, build a system around it, and be able to feed back those data
into our model training so that we can continuously to improve.
Did this approach come to you because of your work at Snapchat, working at consumer products,
or is it something that you had to come up with in the context of, hey, Jan itself?
I would say it's very similar, especially when we walk on the camera software.
So how do we know whether this parameter was better or the other one was better?
And I think we can definitely come up with some, you know, very objective matches about,
hey, you know, lightning score, this is lighting score, this is resolution.
But there's many things we think out, hey, a better resolution, I mean high resolution
doesn't mean it's a better image quality for customers.
If you look at iPhone, it does not have the best resolution always compared to a lot of other
phones, but it does produce the image that most of people like about using iPhone to capture
the image.
And yeah, there's a very similar lessons out there we learn from in early days.
days helps that. Yeah. What can you say about how big Hey Jen is today? We are a little bit over 40 people,
but we are serving over 40,000 paying customers on the platform today. And I think what's so
interesting about our customers is that these are not the typical AI early adopters. These are
mainstream companies from European manufacturers to small business, to global non-profits, to Fortune 500
companies. We should space the problem we are solving. Given the 1,000 customers for
employee, which is an incredibly impressive metric, are there specific key roles that you're
hiring for or other things that maybe members of our audience may want to apply for?
Sure. Yeah. We're hiring across different teams, basically, product, design, engineering,
AI research, and go to market. Yeah. This has been a great conversation. Thanks, Joshua.
Thanks so much. Yeah. Thank you. Thank you for having me.
Find us on Twitter at NoPriarsPod.
Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no dash priors.com.