This Week in Startups - The Future of Sound: Udio’s Vision for AI-Generated Music | E2016

Starting point is 00:00:00 Hey, everybody. Welcome back to Twist. This is Alex, and we have a special interview for you today. I am an enormous fan of music. You may not know it, but I grew up playing classical and jazz trumpet throughout my youth. And music has remained an absolute huge passion of mine throughout, really, my entire life. So when the AI revolution of the last couple years came to the world of music, I was incredibly curious. Two companies have really caught our eye here on this weekend startups, UDO, and of course, its competitor, Suno. Today we have David Ding, the co-founder and CEO of UDO on the show to tell us what it's for, who's paying for, and where AI-based music creation is going. This weekend startups is brought to you by dot-tech domains.

Starting point is 00:00:44 Don't miss our Jam With J-Cal contest. To apply and get more details, go to JamwithjCal.com. Brought to you by DotTech domains. LinkedIn ads. To redeem a $100 LinkedIn ad credit and launch your first campaign, go to LinkedIn.com slash this week in startups. And Brave. If you're building AI and search-based applications,

Starting point is 00:01:08 train your models with the Brave Search API. Get started for free at brave.com slash Jason. We're going to talk about AI and music. David, hi. How are you? And welcome to the show. Hi. Hello. I'm really excited to be here. So I want to start with some background stuff because I know you were at DeepMind for a while

Starting point is 00:01:26 and UDio is a relatively young company. I think it was founded in 2023. So just give us, what was the moment in time in which you said, I have to leave where I am and go found this company because why? Yeah, sure. So, yeah, as you said, UDO was founded last year, November of 2023. And before that, I was a researcher at DeepMind. And throughout my entire childhood, I've always been interested in two things primarily.

Starting point is 00:01:54 So one is technology, and the other is music. So as a kid, I always wanted to build computers that can simulate the way a human brain works. You know, like maybe you can wire the neurons together and then try to model the brain. And then it turns out that when we went to college, like, this thing was starting to pick up traction. My first year of college, I took a machine learning course so that I can like participate in this field. And my other passion growing up was music. So I played classical piano and played it for at least like 10 years, like growing up before going to college. And I always thought that would be really, really cool if a computer technology could compose and make music.

Starting point is 00:02:43 And so then fast forward to when I was working at DeepMind, generally modeling really took up. You see technology if like chat GPT or Dali or Mid Journey like emerge. that really revolutionize the way that computers can make art. And so at that point in time, I was like, hmm, what happens if we apply the same technology that I've been learning how to build and apply to music to have a machine that can help people create music, ideate, and create songs? And so this is why we left DeepMind to create a company that produces a product

Starting point is 00:03:20 to help artists and songwriters, turn their ideas into reality. So I want to go back to the point about LLMs to image generation to music generation because, I mean, my day job is writing. So that's kind of what I know the best. And so to me, the idea of a large language model taking in a lot of data and then helping kind of do next word prediction, admittedly, it's more complicated than that. But I can really understand it.

Starting point is 00:03:46 I kind of get how we can use LLMs to do image generation. But when we expand the work done to music, to me, I feel like I'm missing a link in how the technology actually functions. So without spilling any secret sauce, if you will, how does the AI models that I best understand end up creating tunes? Because it just seems to be like a real stretch of what was possible, but clearly it works. Yeah, so similarly to how these large models learn how to produce images and text,

Starting point is 00:04:20 All models, they learn how to produce music by listening to lots of examples of music. So you listen to music and it tries to synthesize the common elements across music. So like elements of music theory, like which chords follow which other chords or how rhythm interacts with the overall structure of the song. As well as other elements, what does it mean to the country music, which is rock music. or how does a guitar string vibrate and how does the sound of a piano echo and reverberate around the room? And finally, how does this all interact with the recording technology? How do you turn this sound and turn it into stereo?

Starting point is 00:05:06 And so this model, because it's trained on the final output music, it learns how to do everything. So from like the very fundamental music theory level, all the way to how the sound is really. recorded by the microphone. Okay, so prepping for our chat today, I was playing around with you, Dio. By the way, I am now your most recent paying customer. Shout up.

Starting point is 00:05:30 And I decided to throw a curveball at your software. And I said, okay, look, I wanted to do a progressive metal song that sounds a little bit like periphery, band that I love. But I'm like, look, let's do it in six, eight time. Now, you're a classical train pianist. I'm a classical Italian trumpet player. You and I know that when it comes to time signatures in the world of music, 6-8 is not very complicated, right?

Starting point is 00:05:52 We're not doing like 11-8 or something crazy. You count in sixes and then five. So this is pretty simple. And it kind of did it, but not perfectly. And I know this technology is still improving, so I'm not trying to be negative. But is this going to a direction in which I could tell a service like UDio? Like, I want to do a song, first half in 6-8,

Starting point is 00:06:12 second half in 7-8, and I want to do a chord change from C to C major. and then, like, how specific can we get? And then is that underpinned based on a very granular understanding of how music was put together, or does the software better understand, like, broader chunks of it versus, like, down to the individual note level? Yeah, so this is definitely a direction that we do want to support, giving users and musicians more ways of controlling the model, like time signature, key, tempo, BPM, or instrumentation, or even like dynamic levels, like start quiet, start swelling, and then like, and then die down again.

Starting point is 00:06:55 So there's something that we definitely want to support. Time signature is something that we do not support at the moment because in our music, you know, when we're training our models, we did not teach it the concept of a time signature when we were annotated data. Our key signature, the key of the song, is something that we do support. And this is something that we did not support when we launched the model, version 1, back in April, but it's something that we added in July, because we recognize that users want to be able to control the key. And so then we annotated our dataset to contain, oh, this is B major, this is C-sharp minor.

Starting point is 00:07:34 So that now when you go to Edeo and you specify a key, A minor, it will produce a song in that key. Well, A minor is now the most famous key in, I think, all of music things do, Mr. Kendrick Lamar. If you don't get that reference, congrats for being offline for the last three months of music history. Wow, this jam with J-Cal contest has been a blast. So far, I've had the opportunity to meet with four great founders from companies like Corpod, Ulama, Uptrans AI, and the ROMAP, all because they all use dot-tech domains. And we have room for one more.

Starting point is 00:08:08 Do you want to come on the pod and tell me what you're building? Well, you only need two things to answer. You got to be a founder with under $2 million in funding, and you've got to have one of those awesome dot-tech. domains. So head to jam with jacal.com.com and tell me what you're building. And if you win, I will invite you onto this week in startups and you'll get to share your vision with me and the world. I'm working with dot tech domains because killer startups use them. You know 1x.tac.tac.rabbit.com, so many others. And guess what? We use it too. That's right. Dottech powers our founder Friday

Starting point is 00:08:37 program. So tell me about your awesome. Dottech domain and startup. Apply for the Jam with Jal contest today at jam with Jal.com. we're picking the final winner soon. Okay, so it sounds like what I did there was I asked UDO to do something that it doesn't do quite yet, which is probably why it got a little bit funky. But you said something interesting there, which is data annotation. And that I think is the thing that I was missing because it sounds like you guys, the human label and like help it understand like this is a rock drumbeat in 4-4.

Starting point is 00:09:09 So does that create like a flag that then the software or the model can kind of go back to and like point to and understand? Yes, so by annotating the data in a training data set, you teach the model how to associate certain descriptive words with musical elements. So then it sees 3-4, the time signature, and it hears a song that's in 3-4. And it knows, oh, 3-4, it means like you have like three beats. And then like the first beat is emphasized. When a user then asks the model to create 3-4 music, it can then like,

Starting point is 00:09:44 take its understanding of 3-4 and apply it to the competition of the song. It's kind of like when a human learns, if you never teach a human, oh, this is 3-4, you can't ask the human, hey, create me 3-4 music. Even if it can produce 3-4 music, it just doesn't know what 3-4 the words actually mean. So it sounds like the data annotation then provides almost like a connective layer between music and the user's request

Starting point is 00:10:12 and kind of helps natural language input translate to something the computer can understand as a command prompt, essentially. Yes, exactly. And we aim to improve our model by giving it more annotations to understand more elements of music so that the model can produce these elements upon command. Okay, so I want to go back in time, though,

Starting point is 00:10:34 because I've been playing with UDO since, and this is a true story, one of my friends started sending us funny songs he made for us, in the group checked. And they were, they were whimsical things like, Alex doesn't want to go to work tomorrow and like stuff like that. And I was like, okay,

Starting point is 00:10:49 where is this coming from? And it was from you guys. And so I got a kind of early look at the software and I've made different songs and I've gotten to play with the new model some. But back in the beginning day, when you were first getting like the point one version of this out, one, how good or bad was it?

Starting point is 00:11:06 And how, how easy was it to get from like proof of concept, if you will, to something you were confident that people might want to actually use. Yeah, so funny that you mentioned, like the first version of our model, like the baby, the very, very baby version, when we were still debugging our overall code base and training structure. We spent a couple weeks trying to figure out why our model couldn't produce any lyrics.

Starting point is 00:11:32 You provide lyrics, and the model just refuses to sing the lyrics. And then we spent a while looking at the model, like analyzing different loss curves. And then eventually we found the reason to be quite simple is that when we were feeding the data set to the model, there was some kind of bug that caused the lyrics to not appear. And so the model never saw the lyrics, and so therefore it couldn't possibly know how to turn the lyrics into a song.

Starting point is 00:12:05 So essentially, it couldn't run the engine of lyrics because there were no words going in. Exactly, yeah. And so this really goes to show how the process is quite dependent. You had to pay attention to detail, and it's all about the input data. And so we fix the bug, and then after we fix the bug, the model actually just kind of took off. Every week we saw improvements. The first week, I probably knows the broad genres like rock versus jazz.

Starting point is 00:12:35 As model training progressed, it started learning more, more spruce, specific keywords like energetic, what's hard rock, what is like smooth jazz, and also the sound quality improved, starting from something that sounds like very noisy to something that's much more refined and more like what you get from a studio. Yeah, no, the actual fidelity is pretty good in my experience. And one thing, as a fan of heavier music in general, there are certain heavy metal subgenres that depend a lot on orchestral additions that are mostly programmed. And so I'm familiar with like the current state of the art for studio music with added digital elements, if you will.

Starting point is 00:13:17 And we're not that far off from this just with UDO's own creation software. So that's very exciting. But it sounds like from the the point of inception of the initial like it works to go into market to 1.5 release in July is pretty quick. And that chart's going up and here's a model quality and fidelity and so forth. Do you think that trajectory continues for a long time, or were there early winnings, David, that lets you improve faster than you might be able to now and in the future? So obviously there is a point where you start from zero and you get something.

Starting point is 00:13:53 And so that's the biggest at Delta. And as you observed, our audio quality is actually pretty good, although there are still areas for improvement which we are working towards. but the big focus going forward is additional controls for users so giving people more ways of controlling the music like maybe you want to provide

Starting point is 00:14:13 like a guide like you had this melodic line already and you want the model to follow this melodic line and add musical elements to it or maybe you have this like musical style but you don't really know how to describe it using words so how would you take this musical style

Starting point is 00:14:30 synthesize it and feed it as an example from the model to follow. And so we want to enable these additional controls because we recognize that music creation, the creator wants to have a lot of control over the music because that is their own creation, right? And so that's the area that we really want to focus on going forward. Okay, I want to do some demos in a little bit

Starting point is 00:14:54 to show people what we're talking about because you and I've used this, of course, a lot and they might not have. But one thing that I was thinking about is who this is for, because I am an enormous music fan. So to me, music is part of my day from kind of when I get out to when I go to bed. I'm either listening to audiobooks or music, right? And so to me, it's very personal, very important, and I know music theory, and I love it, and it's key to me, not everyone's like that.

Starting point is 00:15:21 And so people have different music tastes, different consumption habits. And so I don't know, is UTO aimed for folks that want to create stuff for their own consumption? Is it more of a rough draft machine for artists as they explore new ideas? Is it a way to generate Muzak for elevators? So I guess kind of like, who do you think this is for now and in the future? So we think that EEO is for people who love music, people like yourself. And also like artists and songwriters who obviously love music as well. We just want to create a tool to allow, to make music creation a lot,

Starting point is 00:15:59 easier than before, kind of like other tools that have come, that came before in the past. Like for like DAWs, sampling, drum machines, these are all innovations that turned something that was a little bit harder before, and with the aid of new technology, just making this creation process easier so that more people can participate in the creation process, and that existing artists can leverage this to try out ideas at a faster pace and come out with music that incorporates these elements and creative ways that maybe even the creator of the technology never had in mind.

Starting point is 00:16:36 I think one good example for this is Autotein. When Autotein came out, a lot of people, they had qualms about using it. It's like, oh, it's like cheapening the experience. That's a flight way of saying it. Yes, but sorry, keep going. Yeah, like people were like, oh, this thing is like cheapening the experience. Like it allows people who can't sing to sing, and that's a bad thing.

Starting point is 00:16:57 But then, like, you know, like, what really happened was, like, you know, it, like, really transformed the industry. Like, people were using it. And then people found ways of using it very, very creatively, like, you know, like bumping it up beyond, like, the spectrum and embracing the altruiting sound as a musical style, right? And so we think that with these technologies, it makes music creation easier, and people will find, like, creative ways of using it. Okay. So it sounds like for someone like me, a big music for you. I could use it to create fun things for myself to listen to. If I was a musician, I can use it to expand ideas and give me new ideas.

Starting point is 00:17:36 But this doesn't replace, you know, I don't know, my spouse's Spotify account at some point in time. This is more like distinct acts of creation in the future versus passive consumption. Exactly. So, I mean, you might, you say that you play a trumpet, right? Like your trumpet doesn't replace listening to your music. like great trumpet players of the past on Spotify, right? Because you enjoy listening to music that other people create, but you also want the joy of creating music yourself.

Starting point is 00:18:08 Yeah, no, I think that's right. And what I like about the idea behind taking modern AI techniques and applying them to music is it just allows a lot more people to do stuff. You know, David, like five years ago, people talked a lot about low code and no code. And there was this big chat about the democratization of software development. And that's kind of worked out. But I love the idea of more power to more people.

Starting point is 00:18:33 And this to me seems to fit into that. Now, on the critical side, though, some musicians are worried that they're going to be replaced whole cloth or diminished in some way. I want to run my theory past you, which is that I don't think that's going to happen because the musicians that I love to listen to have their own very specific, sometimes experimental style that probably couldn't be replicated by. even very intelligent models. So to me, this exists, if you will, side by side with kind of how music is made today. I'm curious if that's your view as well. Yeah, so we, so that's definitely my view as well.

Starting point is 00:19:11 We, we, so I believe that people will continue making music the way they've always made music. And this is simply another tool in a toolkit that they can choose to use, or they don't have to use it. But then it just, um, something additional, right? just because the electric guitar got invented, it doesn't mean that the acoustic guitar got completely replaced, right? It just means that there's yet another instrument that you can add onto your band. Yeah, actually, I remember in my high school jazz band,

Starting point is 00:19:40 we had a song, I think it was an old buddy rich tune, and it had a little bit of guitar by itself, and our guitar player played electric, and one time he forgot to turn his guitar up, so we got to that part of the song, and he played, and no sound came out, and our director was like, well, you know, he was a trumpet player, and he was making fun of the electric guitar

Starting point is 00:19:57 for needing, you know, help essentially, and I was like, I don't know, that seems a little bit old fashion, but this probably fits somewhere in there. I do want to ask a quick question, though, about where UDio will come up, because you mentioned DAWs or digital audio workstations earlier, very much now a well-known entity in the musical world.

Starting point is 00:20:17 Does UDio ever become part of one of those, a plugin, an API that I can call? Does it leave the website and end up somewhere else? Quite possibly. We think a lot of our power users, they use ETO to come up with ideas, and then they download the individual stems, which is a feature that we allowed. We allow, so people can download stems and then load the stems up in their DAW for further post-processing. Okay, so essentially they take the route draft, bring the stems over, and then you can do more, okay, all right, that's pretty cool. Is it hard to do individual stems? Because that implies that,

Starting point is 00:20:56 the model is making a collection of tracks that are then mixed together. Is that how it's always been, or is that a new change to how the underlying model works? The underlying model always produces the fully mixed track.

Starting point is 00:21:12 But then recently with Vision 1.5, we added the ability for users to download the individual stamps, which are separated post hoc from the mixture. Oh, post hoc. Interesting. Yeah. Oh, okay.

Starting point is 00:21:26 So you create something that's mixed and then you isolate. I would have thought this other way around, but that's why we ask questions. Yeah. Okay. So before we talk about money, stems are the individual tracks inside of a song for, for example, bass or guitar, piano or whatever. I just want to make sure that everyone listening to understand stems. David, is that how you would define them as well? Yep.

Starting point is 00:21:49 Okay, cool. So if you don't know what stems are, now you do. Okay, there are more than 50,000 venture back startups in the end. United States alone, this means marketing has to be perfectly targeted. You got a lot of competition out there. Or you're just going to fade into the background and your money will go with it. All your ad spend will be for naught. You got to make sure you target the right prospects. So how are you going to do that? Especially in a business to business context. Well, the answer is obviously LinkedIn ads, where you can precisely reach the professionals who are likely to find your ad relevant. Just think about

Starting point is 00:22:21 it. Wouldn't it be great to target your ads by the job title where the industry the location that that company is, you know, and maybe even a very specific company. Maybe you've got a list of 20 lighthouse customers that you want a bear hug, that you want them to know about your product or services, LinkedIn ads is going to help you do that by building a relationship and driving results. LinkedIn is the environment where people are receptive to business. They're not there for food or politics or entertainment or music. They're there to do business. A billion members. A hundred thirty million of them are decision makers and ten million of them are C-level executives. So start converting your B2B audience into high-quality leads today. Get $100 from your boy, J-Cal,

Starting point is 00:23:06 LinkedIn.com slash this week in startups to claim that credit. Again, LinkedIn.com slash this week in startups, no spaces, no dashes, terms and conditions. Why? Because it's giving you a hundee. Okay, so UDEO raised $10 million. That was earlier this year, Andrews and Horowitz was in there a number of artists, including the producer, Take Heath, and I love to see venture capital funds, glad you guys raise some money. But my thought is this. I currently pay you $10 a month to use something like 1,200 song creation credits. I look at that and I know how much AI costs to run. People talk a lot about that. I feel like I'm burning through your bank account. So is it as expensive as I imagine it is to run the model to create music because it sounds very

Starting point is 00:23:55 compute intensive. Yeah, obviously it's a balance for us. We want to make sure the price is set at a point where we allow people who are curious about the technology to try it out while being able to make this process sustainable. So we chose a price in a way to basically allow for this, like to be able to sustain usage while not being very expensive. And going down the road, we definitely want to optimize our models, make them more efficient so that they can run at a cheaper cost

Starting point is 00:24:30 because we want to maintain this commitment to users, but we also want to run a sustainable business. Yeah, so we've seen this with just a big one example out there, OpenAI's GPT family of models. When 4-0 came out, it was one cost, and then it's come down, I think, and we've seen that pretty frequently. Does that mean that you guys are able to extract a lot of efficiency from the underlying model

Starting point is 00:24:54 and that this should get much cheaper to run over time, or is there less low-hanging fruit because it's doing music, which is just, to me, seems harder than doing text? We think that there is a lot of room for improvement for sure. I wouldn't comment on whether or not it's on the same scale as Open AI. Open AI obviously has entire teams of incredibly talented engineers working on this, and we are a much smaller company.

Starting point is 00:25:21 But we do believe that there are similar levels of efficiency gains to be had. Okay, so essentially, yes, UDio is a smaller company. I think Open AI has over 1,700 people now, but with work a similar-ish curve. Okay, that's actually a really good question for me to ask. How big is the company today? What's your current staff size? So we currently have about 17 people.

Starting point is 00:25:48 So we've grown quite a bit. How many people did you have before? When we launched the company, when we launched our model back in April, we only had eight people. Eight. Oh, God. Yeah.

Starting point is 00:26:01 And just because this is a startup show, let's do some basics. Remote, hybrid, or in office? Mostly in office, but some working remote. Okay. And just thinking about staffing for the rest of the year, are you going to keep hiring as aggressively as you have?

Starting point is 00:26:18 Or will that slow down that you've more than doubled in size? We'll probably stay a little bit more steady. Okay. So I know you guys raised from Andresen, I mean, a venture firm that everyone watching this show knows. And you guys raise 10 milly. Is that enough money, David? Because some of your competitors have raised more.

Starting point is 00:26:37 And we are in the era right now of companies that use AI raising, lots of money, let's say. So I'm just kind of curious why 10 million was the number and also, you know, how soon are you going to be back on the show telling me about your shiny new round? Yeah, so when we started,

Starting point is 00:26:56 we raised 10 million because we want to be disciplined in how we spend the money. We believe that like there's some amount of truth in the idea that scarcity produces innovation. And so we try to be super efficient in a way that we use our capital to develop our models. So tell me about that

Starting point is 00:27:17 because, you know, developing a model, I mean, people talk about how models are eventually going to cost like a billion dollars to put together, but that's for a very general purpose model and so forth. So for you guys, how do you ensure that your capital expenditures on model creation and improvements are cash efficient? Yeah, so one thing that we do is to try to secure a cheapest

Starting point is 00:27:40 compute power that's available like the chip that's cheapest in terms of floating point operations per second versus dollars and so we ended up choosing Google Cloud's GPUs which we identified as offering significant

Starting point is 00:27:59 savings over other like chips like NVIDIA GPUs Google's startup cloud program is one of the occasional sponsors of this show So I just want to point out that no, no one asked them to say that. That was off the cuff, but there you go. So we're not being biased. Just to put it in your perspective for me, though, because I don't get to go to those negotiations.

Starting point is 00:28:21 How much cheaper was GCP for UDO compared to competing providers? Was it a lot cheaper or was it more of a marginal differential? Yeah, I'm not sure if I should comment on specific numbers, but it is. Oh, you should. David, you definitely should. You should drop all the numbers you can right now. Yeah, but it's definitely quite a bit. cheaper. And so that's

Starting point is 00:28:42 one factor. And the other factor is that we have quite a few really talented modeling research scientists among our co-founders. And because they have a lot of experience training these really big alternative models, they know how to

Starting point is 00:29:00 make maximum use of the available hardware, how to create really efficient programs and how to design architectures that can train efficiently. So if you're doing that work, though, because that's nitty-gritty stuff, if you're doing all that already, why not buy your own H-100s or equivalent and just run your own mini data center? It seems to me like if compute's going to be such a core element of what makes the digital brain that you use, why not own the neurons themselves? I guess for us, we as a startup, we didn't really want to deal with the logistics of running our own.

Starting point is 00:29:39 own data center. And we thought would be simply to go with a cloud option. Do you see the company in, let's say five years from now, just looking down the road so far that I know we're making kind of almost like a joke here, but like, do you think that you'll still be on a major public cloud provider

Starting point is 00:29:55 in five years? Or do you eventually off-ramp when you have more money and staff and so forth and do your own data crunching? Yeah, it's hard to say. Like, the cost of computation on the cloud has been going down I think there's like some kind of new law, like Huang's law or something,

Starting point is 00:30:14 that supplanted Moist's law about the cost of GPUs over the years. The cost of a cloud company could go down very significantly, and it's very hard to predict five years down the line, whether or not it will be more economical to buy your own chips or to lease them from the cloud. Okay, so Huang's law, by the way, this was, of course, a reference to Jensen from Invidia, I presume? Yeah.

Starting point is 00:30:37 Yeah, okay. So if you know in video, you know the guy, Huang's Law is, and I'm Wikipedia in this live, so this is not very lettered of me. But it's a general idea that as Moore's Law predicted that the number of transistors would double about every two years, Huang's Law is that GPUs will more than double their performance every two years. So it's essentially an acceleration or a faster version of Moore's Law for GPUs. That speaks very well for you guys, because that means that your gross margins should improve over time, just naturally as chip companies make better chips. That's kind of cool. That's a tailwind for you as a CEO.

Starting point is 00:31:20 Yeah, it's definitely something that we're very excited about, like, cheaper compute, making even more powerful technologies possible. All right. Are you building the next great AI product? Well, if you're doing that, you know how expensive all these APIs can be for model training data, obviously, and training AI is very expensive. That's a fact. We all know that. So you have to try Braves' new search API. Yes, I'm talking about Brave, the privacy browser that I use every day and on my mobile phone. Braves browser has 65 million users. And that drives a lot of data into the Brave search engine, which is the only global scale independent search index outside of big tech. And that index is available to anyone with a Brave Search API. So you're going to be able to use the Brave Search API to power your chatbot or train models in full. form answers to real-time queries, and serve images, web results, even rich text snippets. The Brave Search API features an easy-to-use intuitive data structure, so you're going to be able to get things done quickly, and its data is populated by real human interaction, not web crawlers.

Starting point is 00:32:25 That's critical. And it's all done at a fraction of the cost of the major players, free for up to 2,000 queries per month, so you can try it on, play with it, really sort of brainstorming, and then plants start as little as $3. So here you go. If you're building next-gen AI apps or chatbox, you've got to try the Brave Search API. Get started today at brave.com slash Jason.

Starting point is 00:32:47 On the public cloud front, I'm going to not ask about an individual provider because I don't want you to get in trouble, but I have heard that there is a capacity crunch out there, that there's not enough total GPU-based compute for people that wanted. Has UDio hit any issues getting the amount of compute capacity that it needs at any point?

Starting point is 00:33:07 Yeah, I mean, it's always a balance where eventually we'll be able to get the compute, but definitely at times it just takes a while for different cloud providers to be able to find the chips that are available. Okay. Now, I want to go from there to a demo so everyone can see the product that we're talking about from a compute perspective. So, David, we drew straws before, and you're going to drive because you told me that you have some new stuff to show off.

Starting point is 00:33:36 So let's pull up UDio. If you're watching this on YouTube, you can see what we're doing live. If you're watching, listening to this on Spotify or Apple Podcasts, we will narrate as best we can, but we are going to do a little bit of testing around here to show off what we can pull off.

Starting point is 00:33:51 So David, talk to me. What are you showing me? Yeah, so I'm showing you the UDO create page. It's a dedicated, you can think of it like a creation studio, where you have a list of your recent creations on the right-hand side.

Starting point is 00:34:07 And then on the left-hand side, you have a place where you can specify the type of music you want to create, as well as any lyrics that you want to have. So maybe we can start very simple. Let's start with just creating, I don't know, like rock music. So here we can type rock. And then for simplicity's sake,

Starting point is 00:34:30 I'll just create a song about New York. And so this is a feature that we launched just yesterday, actually, where you can ask the language model to write lyrics for you before you submit the song, and can even tell it to give it suggestions on what to do. Like for example, let's say we want to make it a little bit shorter. Nice. Okay, so you can essentially tell it to get more verbose or less verbose.

Starting point is 00:35:02 You can do other things as well. I don't know, like make sure to mention New York. Does it keep the last prompt in mind when you give it another instruction? So is it still thinking keep this short as you add the make sure to mention New York in the update box? Oh, yes. So like we actually have a prompt history that shows all the prompts that have accumulated so far. So we aim for this to be to improve upon our previous lyrics writing experience. by giving people the aid of AI to help them come up with ideas

Starting point is 00:35:40 when they might have writers' blog. They don't really know what to do, for example, like me at this current moment. And so now that I have like the genre and like lyrics, I can hit create. And that will chew up this creation. Yeah. And while that goes, I just,

Starting point is 00:35:58 I mean, thinking about this, because whenever I sit down to use UDio or similar product, I tend to think in not genre term, but in terms of bands that I love and kind of how they approach to the world. And I'm kind of curious when you're using UDO, do you tend to stick more towards like rock, or do you get a hyper-specific, like, make me a rock song with a touch of, I don't know,

Starting point is 00:36:23 Tom Petty or something like that, because you can pull in different influences. Yeah, so I would say that I usually just stick with genre information, but for users who have a specific artist, mind. We provide functionality for a user to type in the name of the artist and we look up the style for the artist. So we don't actually put the name of the artist anywhere in the prompt because we don't want to create something that sounds exactly like that artist. But then for example, when you type in like Led Zeppelin, it will like replace Led Zeppelin with the list of genres like he is that they are like, you know, like commonly associated with. So like, you know, like maybe hard rock, maybe male vocal, vocalist, 70s, just things like that. Okay, so if I put in, I mean, this is again a niche genre, but like, periphery, it's going to think progressive metal, guitar forward, male vocals,

Starting point is 00:37:19 so it'll essentially atomize an artist's name. So essentially, then artists become shorthands for genre and style. That's correct. And we try to make sure that like the generally outputs are definitely influenced by the style of that artist, But it's not that exact style, because that's one thing that we really do want to avoid. And we are dancing around the lawsuit here, and I'm trying to deliberately ask questions you can't answer. But yeah, thank you for answering that. Can we play this? Let's hear it. Yeah, sure. So this is the first example that came out.

Starting point is 00:38:19 It's better than something that I could write. So shout out to that. And then, for example, you can then like add additional descriptors to this. So maybe not just rock, maybe you want to make it a minor. And then you hit create again and then then cues it up. And so you, I'm curious about this,

Starting point is 00:38:41 because there is a little bit of time that goes through from when you click create to when UDio gives you the song, which, by the way, to me, is no big deal. But it does seem to be variable. David, so what makes it a longer or shorter calculation process on the UI side? Yeah, so on the UI side, so we submit the request over to the server, and the server reads the prompt like Rock A minor

Starting point is 00:39:06 and tries to figure out what to do with it before it sends it to the model. So we have some processing that goes on. We also run checks for every sound that gets submitted. We take the lyrics and we do a copyright check to make sure that the lyrics are not copyrighted lyrics. and this is something that's probably a little bit overly strict right now. There are a lot of public domain songs or things that should not be, that are not really copyrighted that gets flagged, but we want to earn on the side of caution rather than not flagging something that's actually copyrighted.

Starting point is 00:39:41 Okay, now we have this new song, same idea, but now in A minor. Let's listen to the new song, Chasing the Pulse. Let's see. I'm not sure if you have a perfect pitch or not, but like, it's a little bit hard for me to tell, It's definitely a minor key. David, I'm not going to lie. I do not have perfect pitch. Indeed, if you had ever heard me sing a bedtime song,

Starting point is 00:40:23 you would think to yourself, that guy plays music because it doesn't sound like it. So I can't tell if it's a minor or not. It did sound minor. But what hit me, though, is when the, in the chorus or maybe it was the bridge, when the harmony voices came in, wasn't on the first note.

Starting point is 00:40:40 They came in a little bit later, which feels very stylistic. and therefore, and I mean this in the best possible sense, like human. It felt like something that that would be a music editorial decision that a human can make to say, hey, we're going to have lead singer and then the harmony come in and delay. I don't know. I'm always a little bit torn with this between going, this is the coolest thing I've ever seen. And, oh God, are humans going to lose?

Starting point is 00:41:09 Just because, you know, I've sat in the guts of a symphony as we took Beethoven's fifth out of the studs and rebuilt it. I don't want a future in which we lose that, but at the same time, I'm going to click these buttons 100 times because it's a lot of fun. So maybe I'm part of my own problem, I suppose. Yeah, well, I think that people who play in bands, they don't necessarily, they won't stop

Starting point is 00:41:34 just because there's this additional source of music. One of our co-founders actually is involved in a band, and he regularly toured the UK to, to perform with his band. And so I think, and he continuously enjoys this, right? Because it's just fun for humans to create music. And we just want to, like, give people,

Starting point is 00:41:54 more people the opportunity. Like, he can create music with his band. But previously, he couldn't create music in his bedroom, or, like, lying down, like, on his couch, and then, oh, I want a song. And, like, previously, what would you do? Like, you can't even do anything. And so now, this is possible.

Starting point is 00:42:14 Yeah. And just because I'm going to be an enormous brat because I can. Court, one of our fine producers here at Twist, has given me a prompt he would like us to try. So, David, if you're up for it, in Zoom chat, there is a prompt entitled, a jazzy, neo-noir, offbeat rap song about dinosaurs, which is evidence that Cort is Gen X, but we'll leave that aside for now. But can we give, can we give that a try? Yeah, Jazzy, Neo-Noir.

Starting point is 00:42:43 offbeat rap song about dinosaurs let's see what we get the person who requested this song for everyone who's listening to this later on wasn't a punk band once so there you go

Starting point is 00:43:01 this is what a punk fan is going to put into the the UDO generation process all right while this is waiting David one question I had written down just because you know I love this sort of thing

Starting point is 00:43:12 what's the craziest song that you guys have seen people come up with because everything that I've done thus far as been pretty standard but I'm curious, has anyone like really blown your head off? Well, one of the examples of a song that took off pretty unexpectedly

Starting point is 00:43:28 is a song called BB-O-Jizzy. It's a song that one of our users created. He himself is not a musician, but he is a comedian, actually. And so he wrote like the funniest lyrics and he used Edeo to turn this set of lyrics into a song and it ended up being sampled from by Metro Boomin

Starting point is 00:43:51 who created a beat part of like entire like Drake and Kendrick Field and challenge people to create like a racist on top of this and the funny thing is like Drake himself actually wrapped on top of it it's pretty amazing watching it from sidelines to see your tool be genuine in most in pop culture. We'll come back to that in a minute,

Starting point is 00:44:15 but I want to play everyone this song. So here is the first sample clip of jazzy neo-noir off-beat rap song about dinosaurs. Hit it, David. Bowman through the night for your start roar. Lost in a city lights, Zeno feet at the floor. Bar so is rude.

Starting point is 00:44:30 T-racks on above. Terradacto by rhythm digging in my soul. Bones of the past. We dance like where ageless history. We breathe in jazzy night. Boundless stages. Actos in the alley. shadows keep time

Starting point is 00:44:42 Jurassic jazz notes And the moons climb Dinosaurs in the urban globe Rhythms are time Let the ancient show Underneath the story flow Where the city And wild collide

Starting point is 00:44:55 We go I mean, I'm not gonna lie That's not bad That's not bad Brontosaurus groove T-rex on a roll Teraductural fly Rhythm Digging in my soul

Starting point is 00:45:04 That is Actually probably better Than some stuff That I listen to on Spotify currently Yeah, our lyrics are very peculiar. It's an artifact of language model. No, no, no, no. I meant all that.

Starting point is 00:45:21 That was not sarcasm. I mean, I never thought I'd see Brontosaurus, T-Rex, and Teradactyl, all within one rhyming couplet, essentially. Yeah, no. Again, I don't know if you answer this, but is the model that writes the lyrics the same model as what does the music

Starting point is 00:45:40 generation or are those two different models that then are brought together for this final product that we just heard? We use different models. So the model that creates the music is a proprietary model that we trained because there's nothing like that elsewhere. And the model that writes the lyrics, we just use GPT actually. Oh, simple enough. Yeah.

Starting point is 00:46:04 Well, I mean, it works pretty well. Okay. I want to talk about 1.5 a little bit and then we'll wrap on virality. So 1.5 came out back in July. That brought key control, improved, I think it was global language, and then also audio quality. So how has been the reaction to 1.5? And then what's next from UDO in the feature context?

Starting point is 00:46:26 Yeah, I think people were excited by the changes. Like for better global languages, a lot of our Chinese-speaking users remarked how the model suddenly became a lot better at producing lyrics that have Chinese in them. Our key control is definitely a feature that people have wanted

Starting point is 00:46:47 for a very long time. And people love the ability of specifying the key and then modulating within the song. So you can specify a key for the first section. And when you extend this section, you can then specify a different key and you can kind of like, you know, specify your own harmonic progression throughout the course of the song.

Starting point is 00:47:06 How long until that's like super visual? Like I can imagine myself like, let's say as long as three minutes and I have like a line and I'm like, this chunk should be in A minor and be uptempo X, Y, Z and then I want six measures of this, that. Like, does this become a visual tool versus just something that I prompt, at least in my experience with words? Yeah, so this is something that going forward, we do want to make more visual over time. So we recognize there are some deficiencies in our current interface that make it a little bit harder than necessary. And so we want to do like user research to figure out how best to craft the interface

Starting point is 00:47:44 in a way that's intuitive for musicians. Okay. So because musicians are already super familiar with editing software and so forth. So that kind of interface would be a second nature to them. Now, I want to talk about virality because if we go back to when you guys announced your fundraise,

Starting point is 00:48:00 I think Bloomberg reported, and I have it somewhere in my notes here, that you were seeing something like 10 songs created every minute or something like that. I forget the exact pace. But how has the company been doing in usage terms in the last couple of months and how much

Starting point is 00:48:14 bigger is it compared to that April, June timeframe? Yeah, so it was actually like 10 songs every second, not minute. But people are people still like super engaged with the entire process. We have a very dedicated

Starting point is 00:48:31 group of power users who, you know, go on Discord all the time and they they share the songs that they have created. There's actually a bit of a collaborative flow as well where people work on lyrics and songs together. And you see this because in the final output, they will credit each other.

Starting point is 00:48:50 They will say, oh, this song created with the help of this other user. And it's really fun to see people working on music in this way. It's kind of like how people would jam together in the past, right? And well, people still jammed together. But now we have another way for people to jam together. And I think this is also part of what music is about, like bringing people together with a common passion. I agree with that entirely.

Starting point is 00:49:14 I was just thinking that, you know, not everyone now has to go to a jam room, which means that they don't have to get hearing loss. Like I did growing up in my ska band, which rest in peace did not make it big and turn us all into multi-millionaires. I'm sad to say. Now, on the virality point, you mentioned a Discord,

Starting point is 00:49:32 you mentioned power users. I learned about you guys from a friend, but I'm kind of curious, is the product here inherently viral because, you know, I was sent a song about me and my friends, I like to use it, and then I've been playing with this,

Starting point is 00:49:47 or any to people. And so I'm just kind of curious if that limits your sales and marketing costs because people are almost taking your product to their own networks ambiously. Yeah, I mean, most of our growth, almost all of our growth is we are completely organic channels

Starting point is 00:50:02 where people just like, sharing amazing outputs that they have, and then people are asking each other, how do you do that, and then, like, are spreading this way. So how has been, like, registered user growth at the company? Is it still as quick as it was before in, like, percentage terms or gross number terms? How should they think about growth at the company, essentially?

Starting point is 00:50:23 Yeah, I mean, obviously, there was a very large initial spike when we launched, but we do still see, like, steady growth every single month. and we believe that once we launch new versions of the model, there will be renewed excitement and as people find new ways of controlling the outputs and using the model for their own production needs. Okay. All right. Well, I mean, I'm going to be watching with very close eyes

Starting point is 00:50:49 because I'm a user and now a customer. But one thing I've heard from VCs lately, I think Sarah Taville from Benchmark wrote about this, and she said that a lot of the AI, the big AI model companies, the open AIs and so forth, are going to go kind of up stack in time. And so that startups, not yours, but some startups that do build products using well-known commercial models,

Starting point is 00:51:12 for example, might eventually get supplanted by their model provider, essentially going upstack and taking their lunch. Are you at all worried as a company at one of the larger model companies, a mistral and anthropic and open AI? I'm going, hey, music is cool. We should do that too. and then kind of bulldozing into your market. Yeah, so we think that inevitably in the future

Starting point is 00:51:36 there will be more companies who enters this music creation space. We believe that music is sufficiently different from text, and there's a significant product element as well. You want to have the right interfaces for people to interact with these models. So like a chatbot, kind of like, like chat GPD is probably not the right interface for people who want to create music. And so I think there's actually like an open question

Starting point is 00:52:06 on how to best produce this type of product. It's something that we're working towards. It's a tight coupling between the model and the product and like getting the right level of controls in the model so that you can expose them to the user in an intuitive way in the product. Okay, and then just to wrap things up here, David. I just want to ask you one more thing before I let you go, which is, you know, I think about UDO and Suno

Starting point is 00:52:29 is the other company that I think people best know in your space. And so I'm kind of curious, where do you see UDO today in comparison to Suno and how many of their engineers are you currently trying to poach? So I think, like, we want to position ourselves as allies to artists and soundwriters and producers. We want to focus on, like, giving them the highest quality tools available. So instead of focusing on, like, meme songs,

Starting point is 00:52:55 in particular, we want to focus on the really powerful creation course to help creatives like make music and make high quality music that they're proud of and maybe eventually they want to incorporate in their other music workflows.

Starting point is 00:53:11 That was a very, very deaf to non-answer, but let me take another run at this. Do you consider UDio's music model to be the best in the market today? I would say so, yes. It's the only model that that produces stereo music

Starting point is 00:53:28 at like 44 kilohertz sampling rate. It's a lot higher fidelity than any other music model that's out there. It has a better understanding of genre as like almost any other music model. Okay, I'll take that. I do want to have you back, though,

Starting point is 00:53:46 in I was going to say a year, but given that you launched the product in April and it already feels like we've gone through two generations, probably sooner than that, because I'm curious to see how fast things improve, how competition evolves. And if you guys do decide to go back out into the market, because I think that given your traction, early monetization,

Starting point is 00:54:04 and so forth, you should be able to raise more. So it'll be very curious. But David, thank you so much for coming by Twist. I really appreciate the information and the notes. And thank you for making a new tool for me to play with, because I absolutely love music. Yeah, thank you for hosting the podcast. It's very fun.

Starting point is 00:54:19 All right, everybody. Twist is back. We do live news. If you're not with us on YouTube, see you there. or also in every single podcast platform you can possibly find, and we are always trying to find the best and most interesting founders to explain the market as it is. This has been UDio, David Deng, and Alex.

Starting point is 00:54:33 Hey, we're out of here.

This Week in Startups - The Future of Sound: Udio’s Vision for AI-Generated Music | E2016

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.