No Priors: Artificial Intelligence | Technology | Startups - Music consumers are becoming the creators with Suno CEO Mikey Shulman
Episode Date: May 16, 2024Mikey Shulman, the CEO and co-founder of Suno, can see a future where the Venn diagram of music creators and consumers becomes one big circle. The AI music generation tool trying to democratize music ...has been making waves in the AI community ever since they came out of stealth mode last year. Suno users can make a song complete with lyrics, just by entering a text prompt, for example, “koto boom bap lofi intricate beats.” You can hear it in action as Mikey, Sarah, and Elad create a song live in this episode. In this episode, Elad, Sarah, And Mikey talk about how the Suno team took their experience making at transcription tool and applied it to music generation, how the Suno team evaluates aesthetics and taste because there is no standardized test you can give an AI model for music, and why Mikey doesn’t think AI-generated music will affect people’s consumption of human made music. Listen to the full songs played and created in this episode: Whispers of Sakura Stone Statistical Paradise Statistical Paradise 2 Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @MikeyShulman Show Notes: (0:00) Mikey’s background (3:48) Bark and music generation (5:33) Architecture for music generation AI (6:57) Assessing music quality (8:20) Mikey’s music background as an asset (10:02) Challenges in generative music AI (11:30) Business model (14:38) Surprising use cases of Suno (18:43) Creating a song on Suno live (21:44) Ratio of creators to consumers (25:00) The digitization of music (27:20) Mikey’s favorite song on Suno (29:35) Suno is hiring
Transcript
Discussion (0)
Hi, listeners, and welcome to No Pryors.
Today we're talking to Mikey Schulman, the co-founder and CEO of Suno, an AI music generation tool,
trying to democratize music making.
Users can make a song complete with lyrics just by entering a text prompt.
For example, I was playing with it this morning, and you guys all get to hear Kodo Boombop with Lofi intricate beats.
Great. Nature weave tails in each gentle race.
Interior pedals, fall time slows its pace.
Every feeding cherry boom, a hint of grace.
Okay. So I'm feeling really excited about quality here for a company that is just under two years old, but is making waves in the AI music industries.
Since you came out of stealth mode late last year, Mikey?
That's right.
All right. Well, we're excited to talk to you about AI music models and how it's been going since launch.
Thanks for much for doing this. Welcome.
Thank you. I'm super excited to be here.
Okay, maybe just start us off with a little bit of background.
You're a kid who loved music, playing in bands.
How do you go from that to, you know, Harvard Physics PhD building, you know,
couple AI companies?
Yeah, I guess a bit of a circuitous route.
I've been playing music for a really long time since I started playing piano when I was four.
I played in a lot of bands in high school and college growing up.
And the dirty secret is I'm not that good.
And so the smart move, I suppose, for me, was to pursue the thing that I was relatively better at, which was physics.
I went to college and then to grad school and did a PhD in physics.
Studied quantum computing.
Maybe for your next podcast, I can tell you about why you shouldn't go into quantum computing.
What did you think you were going to do?
Like, did you think you were going to be like a theoretical physicist or like an academic?
Oh, goodness.
Two things.
Like, I've never had a master plan, so I don't think I thought what I was going to do or not going to do.
But I am certainly not great of physics.
You know, I think I had a reasonably successful PhD, not because I'm good of physics.
The quantum mechanics that I studied was worked out in, like, the 50s.
There was a lot of very tricky low-temperature microwave engineering that turns out to be really important for actually doing this stuff.
I got lucky that I was relatively good at that compared to all the other physicists.
So, you know, kind of something on the boundary between two disciplines.
I enjoyed every second of that.
I would do it all again, even knowing what I would be when I grew up or when I grew out of that.
Still very close with my PhD advisor.
I still live walking distance from my old lab.
You know, it's kind of a fun place to just walk around Cambridge, Massachusetts.
But, yeah, quantum computing is cool, not what I wanted to do.
do with my life. I found a company called Kentro by accident, not founded, found. They were local
and I met them and probably 10 people at the time and I met all 10 and I really, really liked
them. And I said, let's go do this. And I was hired as a software engineer. And I think I got
really, really lucky in terms of timing about a month after I joined, the machine learning opportunities
came along. And in 2014, guy with PhD in physics is what passes for machine learning
engineer. And so I took full advantage of that opportunity, learned a ton, got to build a team,
got to build some fun products. We were acquired by S&P Global in 2018 and got to pursue a lot
of fun stuff after that acquisition as well. So I guess I found my way into AI somewhat by accident,
but I really like it. It's a lot of fun. So you guys actually started with,
this open source model bark. Can you talk about what the idea was at the very beginning
and how you ended up in music generation? We were doing all text at Kencho, and we did our first
audio project after we were acquired by S&P Global, which was learning to transcribe earnings calls.
So I'm sure both of you have read an earnings call transcript. Exceedingly likely it was done by
S&P Global. It used to be done completely manually, very painful. And we could lend a lot of speed and
scale by bringing automation to that. And we fell in love with doing audio AI. Like,
we happened to be musicians, but it kind of took this very honestly non-sexy project of earnings
call transcription to show us how much we loved it. And we also realized that certainly compared
to images and text, audio was really, really far behind. And this was in 2020. And I think that's
maybe even more true now if you just look at everything that's happened in images and text in the last
years. Like I said, we never had a master plan. We made Bark as an open source project, and
even before we released Bark, we knew we wouldn't be focusing on speech. I think, if I'm honest,
a lot of people told us, go build a speech company. It is more straightforward. You'll build
a great B2B product, and people will love it. And we couldn't help ourselves. We just love
music too much. And so we decided to build a music company. Why did you know you weren't
going to focus on speech? Speech is super interesting, but the inherent creativity that we were
so drawn to. It was like not really present in speech. Speech just needs to be right. Just like
read me this New York Times article. And if it's a tiny bit non-expressive or a tiny bit robotic,
that'll still get the job done. And the real creativity was happening in a totally different
part of audio, which is music, which all I care about is how it makes me feel.
That's really cool. And then as a price that you've taken, because I guess the two main
architectures that people have used for different forms of audio models. I mean, a lot of them are
traditionally on diffusion models. I know there's been more work on the Transformer side.
And then there's obviously a few other types of architectures.
Is there anything you can tell us about sort of the technical approaches you've taken or how you think about it?
And one of the reasons I ask is obviously for a lot of the transformer models,
people just look at scaling laws and how things will sort of adapt with scale.
And I'm a little bit curious how that applies to music and how you think about that future relative to models and approaches.
We don't make it a secret that these are just transformers.
This is somewhat our backgrounds doing text before, but also transformers.
scale nicely.
A lot of work ends up being done for you
by the open source text community,
which is always really nice.
We can really be choosy with where we innovate,
and where we end up innovating a lot
is how do you tokenize audio?
You know, audio does not give us the good favor
of being nicely discretized.
It's sampled very, very quickly,
approximately 50,000 samples per second.
It's a continuous signal.
And so you have to use a set of heuristics
or models in order to turn
that into a manageable set of tokens. And that's where we expend, I think, a lot of our
kind of innovation cycles is really understanding that. As you said, the thing that matters is how it
makes you feel. And so, like, how did you measure quality in your own models? Like, what do you
know about how to train something that creates great generations? Is it just all like Mikey as
human eval? It's definitely not all Mikey as human eval. But, you know, one thing we say here is,
is that aesthetics matter, and I think that is a recognition that I think in all branches of
AI we become slaves to our metrics, and you say I did this accuracy on this benchmark and this
benchmark, and in the real world, sometimes it doesn't necessarily matter. And these benchmarks
are extra terrible in audio just because the field is so new. And so aesthetics matter is like a way
of saying that you have to use your ears in order to evaluate things. You can look at the
things like at what your final losses or something like that. But ultimately, it's definitely
more tedious to evaluate than you want it to be. I think the good news is everybody here really
loves music. And so evaluating your models, which means listening to a lot of things and getting
people to listen to a lot of things and doing a lot of A-B tests, turns out to be fun. But I think
we have a long way to go in this journey on how we're actually going to evaluate these things.
And I think we learn a lot about human beings and human emotions while we learn to evaluate these things.
Yeah, it's interesting because as an analog, I know that in the early days of mid journey,
one of the ways it really stood out is people just felt that there is better taste exhibited,
you know, it's better aesthetics versus, hey, there's a much better e-gal function that they're optimizing against,
although obviously there were things that were doing there as well.
And so it feels similar here where that sort of taste component really matters,
particularly early on.
Are there other ways that your music background has impacted the development of
Suno or really helped sort of facilitate some of the things that you're doing?
There's this cliche about it being really important to look at your results
and look at your data in machine learning and in AI.
And if that is pleasurable, it is not nearly as tedious.
And that's not just for me, that's kind of everybody here.
And that ends up mattering a lot.
I've learned a lot about music actually since starting.
this company and just the exposure to different genres that I never knew existed and exposure
to hybrids of genres that have yet to be created by people has been like really, really
eye-opening. But it's funny because you ask like, okay, you know, maybe the stuff that I know
about music, we actually try very hard not to put too much, play implicit bias in the model.
The model shouldn't know about music theory. You don't tell GPT this is a noun and this is
a verb. GPD figures it out. If I tell my model, there are only 12 tones. My model will only
know how to output 12 tones. If I tell my model, there's 50 different instruments. I will never get
that unique sound. And so we've really tried very hard not to do anything like that. And honestly,
I don't think this is so smart of us. This is something that we've stolen from the text world
of there's something beautiful about Next Token prediction that ends up being very, very powerful.
Mikey, what's hard in AI music? Like, I know less about what this frontier looks like. I know less about
like what this frontier looks like, where do you want to push in terms of things that are really
hard for the model to get right? Like, you know, in visual models or video like human hands,
object permanence, like there's lots of things that are more intuitive to me there. Yeah,
that's a really good question. I confess, I've not really thought about that too much. There are the
easy things or the easy to describe things like, you know, did you get the stereo right? Did you get
the bit rate high enough, et cetera? Again, I think the reason you
music is so special is because it makes you feel a certain way. And like to the extent that
any of this is difficult, it is because you are really targeting human emotions in some way.
And that's not terribly well understood by anyone. And it is also super, super, super, super
diverse and super culturally dependent and super age dependent or demographic dependent.
So, you know, I think what we're doing is so far from objective truth. And it's very easy for
people who spend all their days in text LLMs to be thinking about things like,
this is how well I did on the L-SAT.
You know, I can pass the bar with this size model, like the law bar.
And none of that exists for us.
It's really just like I made a song and it made me feel a certain way.
And it may have been grainy audio that made me feel a certain way.
It may have been a long song, a short song.
I think there's a lot more unanswerable questions in this domain.
And then one of the things that you all did quite early is,
I believe you have like a free tier
so people can make up to 10 songs a day
and then you have a subscription-based approach.
How do you think about your users over time
in terms of consumer versus prosumer versus business users?
And is it too early to tell?
Is that there a specific area
that you're most focused on?
Like, how do you think about all that stuff?
Yeah, that's a great question.
I would say, you know,
we are trying to change how the entire globe
interacts with music and to open new experiences for people.
And so what that means is that this is a consumer
product. This is not sprinkling AI into Ableton or Logic or Pro Tools. This isn't for the person
already staying up all night as a hobbyist trying to produce music. This is for everybody. This is for
like my mom. And, you know, I think the business side of things, it may not be conventional
wisdom to say start charging immediately for your product. But it's actually really important as we are
trying to create something that is a set of behaviors that does not exist, to be able to
understand what actually makes people want to part with their hard-earned dollars.
If I'm being honest, people ask about the business model of generative AI a ton, and I think
everybody's doing kind of something that looks like SaaS pricing, and it's kind of done very
crudely, and we are certainly no exception to that. But I don't know if this is right in the long
term. And it strikes me as probably just a vestige of it is the same types of people who were
building SaaS companies five years ago and the same investors who were investing in SaaS companies
five years ago who are building it and investing in it this time around. And so it kind of feels like
a bit of a vestige. No offense to you guys. You guys are both great investors. But like this feels
like something that's not totally worked out yet. Yeah. It's interesting because like I remember
talking to some people were very active in the 90s as the web browser was really coming to the fore
front, and they were trying to figure out the right business model for web pages. And a lot of
the emphasis was actually, should we do micropayments? So every time you read a New York Times article,
you pay a fraction of a cent instead of ads-based models, right? And of course, the world ended up
collapsing on that side to ad-based models. But nobody that I've talked to from that era
actually thinks that was necessarily the right answer. They just think it was the easiest thing to do
in the short run. And so I think there's a really interesting question here, to your point, in
terms of, you know, subscriptions, there's ads, there's other sorts of paid placement. There's a
variety of things you could do over time. There's micro transactions. And so there's reselling things
in a marketplace and letting people take a cut of subscribers, you know, almost like a next-gen
Spotify or something. So it's super interesting to wonder how all this evolves and where you take
it. So it's really cool that you're thinking deeply about it right now. Yeah, it's actually funny to hear
you say that because I remember back in the 90s, my older brother was a beta tester for AOL.
And I actually remember some of these things happening.
And I remember actually watching him beta test these things.
Yeah, that's cool.
Are there any ways that people have started to use a product that were very unexpected for you?
Or surprising use cases or applications or other things people have done with it?
I think so much has been really fulfilling and cool to see and definitely surprising.
And, you know, one thing I'm constantly reminding everyone is that we are eliciting a set of behaviors that are not
common and that are not regular for people to do. And so it's not going to be surprising when we
see stuff that comes out. It's maybe not surprising that people love to feel creative and they love
to feel ownership over what they produce and they love to share it with others. If you want to be
a little bit more reductive about it, they love to feel famous. But I think it's not the same way
that famous people are famous. It's a little bit different. And so
we've seen that people will spend a lot of time in front of their computers enjoying making songs.
This is really cool, and it is different from, I think, the way music is done now.
Music is done now, sometimes painfully, but only in service of the final product.
And I think when you open this up to people, sure, you definitely care about the final product,
about what the song sounds like on the other end, but you also really cared about the journey
and that people will really enjoy making music, regardless of the final product.
product. And I can tell you, you know, personally, uh, the most font I have ever had doing
music is playing music with friends, jam sessions, even when you're not recording. And I think
there's something that's like very, um, very akin to that, that we are able to open up with
some of these technologies. It's such like a magical experience. And I, um, I feel like everybody
should, should feel some of that like joy of creation with other people. Maybe you already see
it in the product, but are you imagining that you get that collaboration?
joy from like or you know the creations joy working with yourself feeling like you are more
skilled you're collaborating with AI with Suno or are people jamming do you see like mix tape like
sharing behaviors today you can talk about we see all of that which is super cool like a video game
music is fun by yourself and maybe more fun in multiplayer mode and so we see people enjoying this
by themselves but we see people basically hacking multiplayer mode uh into this in in lots of fun
ways where you can have people co-writing lyrics together, trading off words, trading
off verses, I'll write the verse, you write the chorus, or I'll write the lyrics and you
pick all the styles and I'll make a song and then I'll send it to you and you'll make a song
back. And so it's not surprising. I think humans really evolved to resonate strongly with music
and want to do music together. Every culture basically has music. And so it really shouldn't be
surprising that we see all of this. But it is really fulfilling from our perspective because it really
brings people together. It makes people smile. I don't pretend like we're here in cancer at
Suno, but it is really cool to make a lot of people smile. One of the things that you and I talked
about previously was in creation platforms, you have like a very skewed ratio in general. And there
varies by, you know, what the platform is of like creators and people who are listening, absorbing,
viewing, whatever, right? There, of course, are a lot of people who make music today. But you listen
to the creations of a relatively few number of people, right?
How much do you think that changes with something like Suna?
I think a lot.
I will say, you know, I'm speculating here.
It's still super, super early.
But I think of us opening a few important avenues.
The first is, I guess, all of the sort of smaller niche micro-sharing that is possible
where we can make songs that the three of us are going to listen to.
because it is capturing a moment that three of us had the same way we might take a selfie.
And that is sharing dynamics that just are completely absent in music right now.
Let's do it.
Let's do it.
Sorry to interrupt you.
Okay.
I need some genres.
What should we make a song about?
My favorite genre, but I don't know that it's supported yet as funk, P-H-O-N-K.
But it made me too obscure.
Okay, that'd be very exciting.
No, I think we can do some.
But let's do some hybrid also, like, I don't know, a song.
Reggae song?
How about some like, yeah, or like Hawaiian R&B?
Ooh, Hawaiian.
You want to choose like an instrument to add in there?
Yeah.
You said Koto before.
Koto?
Or sitar.
Something random.
Satar.
Satire sounds cool.
Yeah.
I have heard a lot of really good sitar trap on Suno.
Yeah.
Goes really well together.
That's my second favorite.
About.
Okay.
of priors in statistics.
We have no priors here.
Let's see how we do.
Just learning from the world.
Ground up.
I've learned a lot.
I've learned a lot of new genres since starting this.
What's your favorite new genre, by the way?
They came out of that.
Gosh, there's some recency bias here, but sitar trap is freaking fantastic.
Yes.
That sounds good.
That sounds good.
Tryers in the game
It's all about
Pobility
Got my Hawaiian shirt
Feeling real fly you see
Safe old day to fly you see
Try the other one
Like ways on the shore
I'm got a deep in
It's no probability's the Lord
But since I'm strong
Spreading with the fog
Meats as I am a modest data
Try the other one
We should just change this to the intro
For No Priors going forward
I'm trying to fly
In a statistical paradise
With beats of that
price in the game
So about probability
I'm a fly
To see a bit of
Throwing like waves on the shore
I'm diving deep
And swarm abilities galore
The Stata strongs
Lending with the phone beats
I like that
It's fun. We're going to have to, like, get an image where I don't even own a Hawaiian shirt, but we're going to have to get an image where we're all, like, wearing the Hawaiian shirts. Spin these. The Palmer Lucky. Yeah, yeah, fine. I'll just get Palmer to do it. Um, uh, I'm maybe the only person who does this with Suno. But every time, like, I create a song, I imagine what the artist would look like that creates the song. I'll, like, visualize it where I'm like, it's the big dude with the Hawaiian shirt and he's got the sitar with him.
I love that. I will tell you, you know, one like very cool and unexpected thing we saw is we shipped a very simple feature of you can edit your song title. Maybe you fat fingered it or something. And as soon as we did that, people started to put their names in their song titles when they hit our trending page. And people, people like to feel good about their creations and you should know. In hindsight, it's obvious. And people will hack your product and tell you what they want out of it. Just one thing back to your point before, though, Sarah, I think,
we talk a lot about, you know, how asymmetric the creation versus consumption is on different
platforms, and TikTok is famously very creation heavy, although still most of TikTok is consumption.
And I think these set of technologies have the ability to skew that much, much farther,
because the creation process is so enjoyable. But I actually think if we do this right in the future,
these are not going to be the terms that we use to describe what we're doing.
going to say I'm creating or I'm consuming. These things will bleed into one another. We'll have
a lot of lean-in consumption. We'll have a lot of lean-out creation. And I think we will eventually
decline to draw the line of how many people are creating, how many people are consuming. And we'll
just say, like, people are enjoying all of this music stuff. That's a really interesting vision of
the future. I guess that has pretty deep implications as well in terms of how you think
about music, the music industry, how it permeates society.
Do you have a view in terms of what all this looks like five years from now?
If we are correct that there are just modes of experience around music that people don't have access to,
that we can get a billion people much more engaged with music than they are now,
that just in terms of the number of dollars or the amount of time people are spending doing music,
both of those are going to go up dramatically.
That I feel quite confident about.
The exact nature of how this looks, I think, is up for some more debate.
So this is just an opinion.
I don't, because music is so human and so much emotional connection involved in it,
I don't really see people losing connection with their favorite artists at all.
In fact, if you labor around music and you understand the process, you feel a much deeper connection
with the artists that you love.
Another thing I think is likely to happen, if we look at like the last wave of technologies to enter music,
let's say the DAW, this really accelerates how quickly music can change and how quickly
culture can change. You know, music is really just the reflection of culture. And I think
the way that happened is the DAO really let a lot of people start making music who could never
make music. You could do this from your dorm room if you had a good pair of headphones and you had
a good ear and you were willing to put in the work to learn the tool. And I think if we can give
this to so many more people, yes, a lot more people will create. A lot more people will become
tastemakers, but the rate at which culture changes the rate at which the styles of music
changed, the rate at which new styles of music are uncovered is likely to go up a lot. And I think
even if you were just going to only ever listen to music, which some people will, that will get
so much more interesting. Things are going to change so much more quickly. You will not have people
really, I think, cribbing off of one another in the same way. So I'm really excited about that.
just because not every listener will like mix a DA, like a digital audio workstations like Ableton or something, right?
Like it's, you know, you can generate music, put it on a timeline and create sound as, as Mikey was saying, in your dorm room, in your apartment cheaply.
That was pretty revolutionary when it turned out you didn't need, you know, a $500,000 SSL mixer and a staff of 10 people to cut an album.
that was that was really revolutionary and people made tremendous contributions to like our collective
culture when that happened like there were there were 15 year olds who got discovered that and
that was extremely rare before that I actually think it's really like an untold story I'm not the
right person but somebody with like really rich musical history understanding should like
explain what happened with digitization of music we're like ah I have like infinite set of like every
snare drum sound in the world I can think of just the ability to completely unconstrained
on, as you said, something that's much cheaper than traditional tooling, where you don't need
to know how to play any instrument. And now I think of some of what Suno is doing is making
the assembly of that, like, another magnitude easier. I think that's right. There's one other thing
that I'm really excited about getting unlocked, which is that if you look at the last 10 years of
music. A lot of the changes are, let's say, sonically, it's like interesting sounds and maybe
slightly less so evolving how interesting songs are. And it's a function of the technology that
got unlocked, like a lot of digitization of things. And I'm actually really excited for the opposite.
Like, AI is certainly able to produce interesting sounds that we've never heard before.
But putting these tools in people's hands, we can unlock song structures and chord changes
and borrow different styles
and mix them with other styles
and make stuff that is not only sonically new
but kind of melodically new.
And I think that has the ability
to really keep people listening to stuff
and on my most optimistic days,
I'll say, on TikTokify music,
like get us listening to stuff
for more than 30 seconds at a time.
Maybe I'm a little bit naive and optimistic,
but I think it's very possible.
Yeah.
Okay, before we wrap,
Like, I played a song at the beginning.
We made a song.
You got to play one that's your favorite.
That's a creation.
Oh, that's a, let me find it.
I'm tempted to play a song that's at the top of our showcase.
And it's by an artist called Oliver McCann.
It's got a lot of plays.
It's a really interesting song.
It is certainly the public's favorite.
So I can play it now.
Oh, my love, my friend, you know, it's been a while without thinking of you, but the thought makes me smile.
I'm so tough wanting.
But what am I to do?
I need some space to breathe.
So give me some room.
Oh, my love.
It's unbelievable.
The amazing thing about this, by the way,
which, you know, just for a listener's sake,
is the vocals are completely mission-created.
The music is completely mission-created.
The lyrics are machine created.
And so this truly is a synthetic song,
which I think is pretty amazing.
Yeah, it certainly is easy to lose sight of that fact
when you do this day in and day out,
but it is incredible.
I'll say one step further,
the machine doesn't know that there is even a concept of voice.
Like, it's just all sound,
and somehow it's able to produce the sounds
that we have been evolved and acculturated to,
resonate with. And so all of that makes me think I have the coolest job in the world.
Not bad for a quantum physicist, a failed one, I guess. Exactly. Mikey, how big is Suno? It's
obviously very popular now. You're growing the team. What are you looking for? Yeah, we are always,
we are always on the hunt for the best people, people who love technology, people who deeply love
music, people who are excited about bringing more music to the world. We're hiring in primarily the
post, Cambridge, Massachusetts, or New York, come drop us a line, careers at suno.com.
Great. Well, thank you so much for joining us today. I think we covered a lot of great things.
I had a great time. Thanks so much for having me.
Find us on Twitter at NoPriarsPod. Subscribe to our YouTube channel. If you want to see our faces,
follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode
every week. And sign up for emails or find transcripts for every episode at no dash priors.com.
Thank you.