No Priors: Artificial Intelligence | Technology | Startups - Can AI replace the camera? with Joshua Xu from HeyGen

Starting point is 00:00:00 Welcome, Joshua. We're so excited to have you here today. How are you? Hey, Sarah. I'm so excited to be here. Thanks for having me today. It's our pleasure. Let's get started. Welcome to the Huberman Lab podcast, where we discuss science and science-based tools for everyday life. I'm Sarah Guo, and I'm a professor of neurobiology and ophthalmology at the school of medicine. Wait, Sarah, I'm so confused. What's going on here? Is this thing on? Today, we're here to discuss how AI can benefit your health and what medicinal properties the technology holds. Sarah, I'm so lost. Isn't this the No Prior's podcast where you interview technology superstars like Gary Tan and Alexander Wang?

Starting point is 00:00:47 No, that's only for humans. We're really excited to have you. Welcome, Joshua. Yeah, I'm excited to be here. Thank you for having me. So let's start with a little bit of backstory. You start this company. Hey Jan, it's had this amazing growth trajectory and is being used by millions of people now. What's the story of starting the company? Yeah, sure. So, yeah, hello everyone.

Starting point is 00:01:09 My name is Joshua, co-founder and CEO of H-Gen. We found a company roughly three and a half years ago, and before there I was working at Snapchat for about six and a half years there. I started robotics at Carnegie Mellon and joined Snap back in 2014 there. I initially worked on machine learning in Snapchat ads as ranking and recommendation. Then I spent my last two years at Snap working on AI cameras. So, you know, Snap leveraged a lot of AI technology to enhance the camera experience. If you look at, you know, 2018 Snapchat with least the baby filter and Disney style filter. That was the first time I saw a computer can actually create and generate something that the start exists in the role.

Starting point is 00:01:52 I was just so fascinated by the technology back. And I had a feeling that would potentially change the way how people create the content. So, you know, Snapchat as a camera company, and everybody created content through the mobile camera. But we wanted to replace the camera because we think AI can create the content, and AI could become the new camera. And that's how we get started with HGEN and our missions to making visual story telling accessible to all. I love it. The greatest minds of our generation, you know, inspired by, you know, your face. as a cute kitten or whatever. What does replacing the camera mean to you? Like, why do we need to

Starting point is 00:02:31 do that? I use my camera a lot. I kind of grew up my career in the whole, you know, mobile camera space where we work on a lot of the software and technology to enable people feeling comfortable and make it easier for people create content through the mobile camera. But, you know, there's still lots of people are not able to create quick content using the camera today. And we felt that if we can replace the camera, that means we can remove the barrier for visual storytelling, for visual content creation. And that will help us to step ahead in terms of the whole content creation space. What are some of the areas that you think, you know, the technology that you developed is applied to? Because I think you've started with different

Starting point is 00:03:14 forms of like virtual avatars so that you can take a video of yourself and then turn it into do an avatar that you can then feed text to, it can speak in your voice, it can do all sorts of really interesting things for different areas. How did you both decide to start with avatars and then where do you think the main applications are? When we initially started the company, we tried to like assemble the whole video production process. There's really about camera and then editing. So camera is more about A role, which represent a human spokesperson, the avatar piece.

Starting point is 00:03:46 is more about B-roll, adding different assets, voiceover, music, transition, animation, stuff like that. So editing, we just learned from customer that editing is not that expensive because it's pretty standard service, but camera is super expensive. And imagine, you know, it's a CEO of a company, he wants to recall something, we probably need to schedule that ahead of, you know, two weeks of time. We need to bring in the camera crew, have a studio to actually record it. And even for two minutes of footage, sometimes, we need to record it for 20 minutes because people need to remember the script. And that's the piece that blocking a lot of a business create new content.

Starting point is 00:04:25 So that's how we get started from trying to replace that piece of the process and making avatar to replace a camera for the video production. Where do you think that goes in the future? So, you know, people are already using H-Gen for all sorts of different application areas in terms of, you know, marketing and sales and at some cases like internal web webinars or learning or other things. I'm a little bit curious, like, is the eventual form of this, you know, everybody has somebody who steps in for them for their zooms or is it used for entertainment purposes? How do you kind of view the evolution of this sort of technology

Starting point is 00:05:00 over time? Yeah, I'll say there's many possibilities out there. I think what we are tackling the problem so far, it is the entry point of the content creation where all the content started with the camera. And then we would have people doing a lot of editing after that, you know, we can clearly see a path where people can already assemble all this generative footage and apply the AI editing to assemble the final video. And again, you know, if we push forward into technology, making the performance much better, I think we will be able to create experience like generative video in a streaming way. And that actually will, actually, potentially replace a lot of the, you know, real-time conversation we have today, especially

Starting point is 00:05:48 with the GBT-40 and with all this multi-model real-time streaming technology altogether. Okay. We're still in asynchronous video creation land in 2024. How do people use H-GEN today? Like, what are the favorite use cases you have? I would categorize the use case of H-JET and do three. Create, localized, and personalized. and, you know, people can select, you know, the cast from a library from our avatar or create their own digital twin and just like select a template or type the script and generate a video. This works the best for product screener, how-to videos, learning development, and some

Starting point is 00:06:31 self-enablement training content. We can also take the existing video that localize that into more than 175 different languages and dialists. And, you know, in this way, we can help customers to really localize their content into local languages. And last but not least, people can also use HGN to personalize the video messaging at scale. So I think there is many, many very creative use case on HGN today. We are a very horizontal platform. I would say one of my favorite use case is probably the recent launch with Madonna.

Starting point is 00:07:06 And Madonna's launched a suite campaign where they can allow people to send. a message to a family member in the different languages. I love him so much. I would run out of words to express my love for him. You know, I just want to call it out, you know, AI is for everyone, grandma and grandchildren alike. Yeah, that's really cool. I mean, like, that's a big brand in a public, like, consumer-facing use case. How do you think about, like, the quality of Hey Jen today? Like, you know, I would have thought of that as, like, sort of typical. of the pyramid in terms of quality? And, you know, how can you tell when the avatars are good enough and not?

Starting point is 00:07:48 Yeah. So I would say quality has always been, you know, the number one priority of the product and business and technology. I would say, you know, I always have frameworks like this. There's an invisible line of the quality, you know, let's say that thresholds to 90. anything below 90 essentially is unusable for the customers because we cannot really replace the real-life production process they have. We really need to focus on making the video generation quality that go above that threshold. And I think especially for Avatar today, it is above that. So we can really help people to replace the real camera

Starting point is 00:08:30 and unleash a lot of our creativity process that help people to scale the content production there. And, you know, obviously, you know, there's much more room to improve, for example, generating the full-body avatar, being able to bring all sorts of our element into a video. Yeah, we're in the process of that. What are you most excited about in terms of, like, what's next or new releases you guys have coming? I think there's many things very excited going on in our technology and product roadmap. I think particularly I'm very excited for the full-body generation of the avatar. Historically, all the other technology has been focused on the upper body.

Starting point is 00:09:08 It's really hard to generate the gesture and the body motion, but a lot of academic research has proven that this is very possible now, and we just need to basically take that into the last mile. And another thing I would say, something I'm very excited about the streaming avatar, especially with the latest release on GPD-4-0, really really help to improve the performance of the real-time interrestreational, with text and voice, and Hagen avatar could become a videoization layer for all those applications.

Starting point is 00:09:42 Obviously, you need, like, full gesture control and movement to get to any video of any kind. But what do customers want to do in terms of full-body motion today? Like, you had a demo of walking in, you know, the last couple months. The way how we look at it is that there was a spectrum of the quality requirement. laying out on different use cases, right? Let's start from the left side of the spectrum. It's the learning development content, educational content. It's more like one to many broadcasting,

Starting point is 00:10:16 talking about educational training content. The quality there is lower because the avatar can be more still, more professional. But if on the right side of the spectrum, we call it as like the high-end marketing content, really dynamic, you know, the one example would be the ads creative. And people ship very, very dynamic content on ads, and because that can really help to improve the ROI of content, making it more engaging. I think making the full body, enabling that full body rendering

Starting point is 00:10:51 will be able to help us to bring the avatar, to bring the video into the next level of engaging and authentic. And that will help to unlock a lot of use cases in a broader case of marketing and sales. Newscasts or other things. To your point, they often have the shot of the people walking and talking is like a standard can shot. And there's like these standard things that they use that if you had full body, you could provide for all sorts of application areas. I guess related to that, what is the technology that you folks are using today? You mentioned some things like GPT40, but you've also built your own models and has, like, how do you think about the technology stack that you're using? And how does that have to evolve in order to be able to

Starting point is 00:11:27 do full body or other new things? There's a three model, right? Text, voice, voice, and video. So we work with OpenAI, check jubt on the test generation side. Obviously, it also serves like the brain of the orchestration engine that we built internally. And we work with, you know,

Starting point is 00:11:46 open AI and EVVILAB on the voice engine, but we build the entire video stack in-house, including after creation, video rendering, and B-roll generation. So I think over time, I think the whole technology trend has been moving towards a direction. a lot of all this thing will be chained together.

Starting point is 00:12:05 The multi-model model, multimedia all get into one single model. One of the challenges I want to call out for the full body generation is actually how do you actually connect that voice together with gesture motion. And that's actually something will be unlocked by actually getting the voice model and the video model training together so that it can sort of build a connection underlying the model as well. And that has been historically really, really hard because we have to, like, train the TTS model on one hand and then feed that TTI's model outcome into a video model. And that's pretty hard to build that connection. But with multimodal model training,

Starting point is 00:12:48 that's very possible. Obviously, SORA is not available to developers and end users today, but there are like world-class text to video generation models that are generic, not avatars. How does this technology differ from something like SORA? When we initially started H-Gen, we want to help the business solve the video creation problem. What does the business look for? They're looking for quality, they're looking for control, they're looking for consistency. So when we try to look bad, okay, this is a North Thar, how can we get there? What's the technical path to get us there? This is essentially probably potential two path. One is the test image, the SORA, where we try to generate the entire

Starting point is 00:13:30 thing from end to end, and so you get the entire video at once. And the other approach is that what we believe in at HGens, that we try to did assemble the whole video into different components. Largely it will be A-roll and B-roll. B-roll represent all different kinds of elements, like voice-over music transition, A-roll being the avatar. And we try to tackle this component one by one, and then we build an orchestration engine around that

Starting point is 00:13:58 that to assemble their final video together. We felt that this technical path is more capable to deliver the quality, you know, the control and the consistency that the brand is looking for. Because, for example, there's some stuff we should probably should not try to generate. It's the logo and the fonts. That needs to be very accurate. And not to mention that we also need to be able to learn about, especially in the business context, we need to learn about the brand style, the color mapping essential from a customer. And I think the second approach would give us more flexibility and capability to build that system around it. And in fact, we actually see SORA as our partner because we are able to integrate that as one of the component, you know, generator and then feed that into our acquisition engine for the business application.

Starting point is 00:14:49 How do you think about what research, you know, if you just focus on like components of the experience, in particular the video stack being the thing that you you really want to own and be state of the art in at Hagen. How do you approach like new capabilities from a research perspective? Is it, you know, look at what's available in academia, look at the problems customers give you sort of de novo? I would say it's a combination. I would add one more thing is that we deeply understand the limitation around the model and try to find the connection between what is the customer looking for, what's capable with the technology. Like, when we really try to look at it,

Starting point is 00:15:31 all the AI model had some sort of a limitation. And I think the key question is that in order to deliver a great product experience for the customer, is that how do we design the product around it so that we can try to avoid the limitation of the model but help to amplify the strength of the model. And that's something that's really important

Starting point is 00:15:52 to find the new area that unlocked the new creation experience. One example would be when we look at, you know, the video translation technology. It's a whole new way to translate a content compared to traditional dubbing. We preserve the user, the natural voice and their facial expression. But if you look at really underlying the model, what enabled that video rendering. It is actually a lip-sync model, right? But we kind of figure out a way to combine all this together, together with the voice,

Starting point is 00:16:26 as well as the translation with chat GPT and build a great experience around it. And sort of like we are creating a whole new experience for localized their video and content. So there's lots of like great McDonald's like exciting commercial applications. I think a lot of people also think deep fakes are really scary. Like and the ability to, you know, abuse somebody's, you know, likeness or voice is scary. How do you think about safety, election safety, abuse? First of all, we do not allow any political or election content on our platform today. H.J.S. policy strictly prohibit those creation of, you know, unauthorized content, and we take abuse of the platform seriously.

Starting point is 00:17:11 So we have our safety, you know, security safeguard, include very advanced user verification, include, you know, live video consent, dynamic verbal pass code, and rapid human review in the back of all the other have been created on the platform. Trust and safety is critical to our business, and we are actively partnering across the industry, you know, continue developing the tools and best practice to combat misinformation and AI safety. And we actually build the safety as part of the design. If you look at a lot of, you know, after creation process on HGN, and we bait all this safety concern and safety guard on every single step of the creation process as well. It makes a lot of sense. I guess it's kind of interesting because if you think about it, from the positive version of this.

Starting point is 00:17:59 And you talked about how you try to protect against a negative. The positive version is, you know, you're running for office and you should be able to send a personalized message to each voter literally into their inbox with a short video clip of you talking to them specifically, you were talking to issues that they specifically care about or things like that. And so you could imagine using this technology in the future for actually hypersonalized political campaigning. And as long as you can avoid some of the deep,

Starting point is 00:18:26 fake side of it, then obviously it could actually be quite valuable. How do you think this ability to really generate large-scale, differentiated, personalized, et cetera, content of individuals talking? How does this kind of generation change how people make or use video in general? You know, if people can generate very engaging and authentic video content, they will basically create more videos and use video more for their business. to grow their business. And we live in a video first role

Starting point is 00:19:02 where every business want to create more videos. I think the bottleneet today in the industry is that video is just very expensive to make and it takes like weeks or month to make the video. I think it would fundamentally change a lot of ways how people are thinking about how to grow the business, how to do the communication, how to do the marketing and sales. So I do think there's a huge possibility that we can create and generate a very high degree of a personalized video,

Starting point is 00:19:36 especially with the full-body avatar that being able to deliver a very dynamic and high-quality content out there. So I want to give you one example is that I think a lot of AI generation is not only about, obviously, you know, cost-saving and time-saving is one aspect of the value. product. But it's actually we are seeing a lot of a customer using that to unlock new use cases. And being able to do something, they were not able to do it before. I think that's the key job in point of a lot of a business outcome today. How do you think about it in the context of real-time versus asynchronous? It feels like a lot of these technologies are focused right now in asynchronous use cases. And that's true as well of just pure text-to-speech models. When do you think

Starting point is 00:20:18 we move to any sort of real-time or close to real-time video avatars and sort of the uses of that? I look at it in two ways. One is that the real-time application of the avatar. Even now it's possible. I think people can already experience that on H-Gen. We are making a new update that can make it even faster. So it can potentially become, let's say, the virtual, you know, AI, SDR virtual support that help to take customer calls or provide support, right?

Starting point is 00:20:50 And, you know, I think the technology has been always developing like this chain. Two years from now, it would not be crazy to look at a lot of avatar generation asynchronous pipeline will become real-time streaming capable. And I also see the world is moving towards a way that we can probably generate the entire video in real-time as well in the future, in the future. Let's say five years from now. I have an opinion like, you know, generative image is still an image, but generative video is not a video. It is a new format.

Starting point is 00:21:27 What I meant by that is, you know, when we really look at a video, we look at it as an MP4 file. So it is immutable. Like, for example, if you and I are on Instagram, we probably get recommended by different, two different ads. But as long as we are recommended from the same business, we are looking at the same MP4. file. But it does not need to be the same. Let's say if maybe I like avocado, I should be watching an app with Coca-Cola and avocado and showing the, you know, the new story about Coca-Cola with me. And you like something else, you could be looking at something else. And this is not possible today because making a video is expensive. But this could be very possible. Let's say,

Starting point is 00:22:10 we can actually, you know, real-time generating the video ads that you like according to your user attribute, that will potentially become a new format. You know, when we really look at today's video player, it corresponded to only one the MP4 file. And it doesn't need to be true. It doesn't need to be like that. That video player can actually take in a lot of, you know, user attribute and generate something in real time to match what's the best way to deliver their content to customers.

Starting point is 00:22:43 Yeah. Yeah, I think it's, you know, one interesting analogy. It would just be like if you think about, you know, YouTube as one of the largest learning devices in the world today. Like, it is static, immutable video for everyone, but it's pretty clear, Bloom Studies and everything else that, like, personalized education is going to be the path that is more effective. And people want to learn by video, but it's very hard. It's too expensive to make that video personalized. This feels like, you know, opportunity for a very different. educational future too. Yeah, and one of the use cases we have seen from

Starting point is 00:23:22 customers that, you know, Pub's group, they generate more than a hundred thousand videos, a thank you video to send to all the employees globally and localize into different languages personalized with their name and they are what they like about about it, you know, when they join the company and stuff like that. And historically that is actually only delivered with one video, right? So they maybe the CEO of the executive team hop on it into a camera and we call something, you know, saying thank you for the, you know, 2023. But now that message and communication can really personalize at a very big scale. So one thing you mentioned is the various aspects of research

Starting point is 00:24:03 that you're doing in terms of building your own video models as well as using third-party APIs. What's been difficult or hard from a research perspective? Unlike a lot of other model, I think building video model, you know, being able to integrate aesthetics into the AI models pretty hard. So, you know, video generation is not only about solving a mathematical problem. It's actually about creating something the customer love and appreciate. So essentially, a model with a lower optimized cost function doesn't mean it actually produce a better visual outcome. So I guess that is the piece that making it really hard to evaluate, but also also really important to deliver the mile, the last mile of the value for the customer.

Starting point is 00:24:48 And, you know, generally, evaluation is also hard. We have to rely on in-product signal, for example, AD test, to know which model is actually better because, you know, only the customer can be the judge for that. And this process generally is just not differentiable from a mathematical standpoint. We kind of have to form a system, build a system around it, and be able to feed back those data into our model training so that we can continuously to improve. Did this approach come to you because of your work at Snapchat, working at consumer products, or is it something that you had to come up with in the context of, hey, Jan itself?

Starting point is 00:25:23 I would say it's very similar, especially when we walk on the camera software. So how do we know whether this parameter was better or the other one was better? And I think we can definitely come up with some, you know, very objective matches about, hey, you know, lightning score, this is lighting score, this is resolution. But there's many things we think out, hey, a better resolution, I mean high resolution doesn't mean it's a better image quality for customers. If you look at iPhone, it does not have the best resolution always compared to a lot of other phones, but it does produce the image that most of people like about using iPhone to capture

Starting point is 00:26:05 the image. And yeah, there's a very similar lessons out there we learn from in early days. days helps that. Yeah. What can you say about how big Hey Jen is today? We are a little bit over 40 people, but we are serving over 40,000 paying customers on the platform today. And I think what's so interesting about our customers is that these are not the typical AI early adopters. These are mainstream companies from European manufacturers to small business, to global non-profits, to Fortune 500 companies. We should space the problem we are solving. Given the 1,000 customers for employee, which is an incredibly impressive metric, are there specific key roles that you're

Starting point is 00:26:48 hiring for or other things that maybe members of our audience may want to apply for? Sure. Yeah. We're hiring across different teams, basically, product, design, engineering, AI research, and go to market. Yeah. This has been a great conversation. Thanks, Joshua. Thanks so much. Yeah. Thank you. Thank you for having me. Find us on Twitter at NoPriarsPod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no dash priors.com.

No Priors: Artificial Intelligence | Technology | Startups - Can AI replace the camera? with Joshua Xu from HeyGen

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.