No Priors: Artificial Intelligence | Technology | Startups - State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia

Episode Date: June 27, 2024

This week on No Priors, Sarah Guo and Elad Gil sit down with Karan Goel and Albert Gu from Cartesia. Karan and Albert first met as Stanford AI Lab PhDs, where their lab invented Space Models or SSMs, ...a fundamental new primitive for training large-scale foundation models. In 2023, they Founded Cartesia to build real-time intelligence for every device. One year later, Cartesia released Sonic which generates high quality and lifelike speech with a model latency of 135ms—the fastest for a model of this class. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @krandiash | @_albertgu Show Notes:  (0:00) Introduction (0:28) Use Cases for Cartesia and Sonic  (1:32) Karan Goel & Albert Gu’s professional backgrounds (5:06) State Space Models (SSMs) versus Transformer Based Architectures  (11:51) Domain Applications for Hybrid Approaches  (13:10) Text to Speech and Voice (17:29) Data, Size of Models and Efficiency  (20:34) Recent Launch of Text to Speech Product (25:01) Multimodality & Building Blocks (25:54) What’s Next at Cartesia?  (28:28) Latency in Text to Speech (29:30) Choosing Research Problems Based on Aesthetic  (31:23) Product Demo (32:48) Cartesia Team & Hiring

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome back to No Pryors. We're excited to talk to Karen Goal and Albert Gou, the co-founders of Cartesia, and authors behind such revolutionary models as S4 and Mamba. They're leading a rebellion against the dominant architecture of Transformers, so we're excited to talk to them about that and their company today. Welcome, Kern, Albert. Thank you.
Starting point is 00:00:26 Nice to be here. And Kate, tell us a little bit more about Cartesian, the product, what people can do with it today, some of the use cases. Yeah, definitely. We launched Sonic. Sonic is a really fast text-to-speech engine. So some of the places I think that we've seen people be really excited about, you know, using Sonic is where like they want to do interactive, low latency voice generation.
Starting point is 00:00:45 So I think the two places we've really kind of had a lot of excitement is one in gaming, where, you know, folks are really just interested in powering, you know, characters and roles and then PCs. The dream is to have a game where you have millions of players and they're able to just interact with these models and get back responses on the fly. And I think that's sort of where we've seen a lot of excitement and uptake.
Starting point is 00:01:10 And then the other end is voice agents and being able to power them. And again, low latency there matters. And even with what we've done with Sonic, we're already kind of shaving off like 150 milliseconds off of what they typically use. And so the roadmap is let's get to the next 600 milliseconds and try to shave the.
Starting point is 00:01:28 off over the course of the year. That's been the place where it's been pretty exciting. I love to talk a little bit just about backgrounds and how you ended up starting Cartesian. Maybe you can start with the research journey and like what kinds of problems you were both working on. Carter and I both came from the same PhD group at Stanford. I did a pretty long PhD and I worked on a bunch of problems,
Starting point is 00:01:46 but I ended up sort of working on a bunch of problems around sequence modeling. It came out of kind of these problems I started working on actually at DeepMind during internship, and then I started working on sequence modeling. Around the same time, actually, that Transformers got popular. I actually, instead of working on them, I got really interested in these alternate kind of recurrent models, which I thought were really elegant for other reasons. And it kind of felt like fundamental in the sense.
Starting point is 00:02:07 And so I was just really interested in them and I worked in them for a few years. Couple years ago, me and Kern worked together on this model called S4, which kind of got popular for showing that some form of recurrent model called a state-space model was really effective in some applications. Man, I've continuing to be pushing on that direction.
Starting point is 00:02:23 Recently proposed a model called Mamba, which was kind of brought these to language modeling and showed really good results there. And so people have been really interested. We've been using them for applications and other sorts of domains and so on. So yeah, it's really exciting. Personally, I also, I just started as a professor at CMU this year.
Starting point is 00:02:45 My research lab there is kind of working on the academic side of these questions while at Cartesia were kind of putting them into production. Yeah, I guess my story was that I grew up in India, so I came from an engineering family. all my ancestors were engineers so I actually was trying to be a doctor in high school but my aptitude for biology was very low so I abandoned it instead became an engineer so you know I kind of took a fairly typical path went to IAT
Starting point is 00:03:13 came to grad school and then ended up at Stanford actually started out working on reinforcement learning back in 2017 18 and then once I got into Stanford I started working with Chris who was somewhat skeptical about reinforcement learning as a field. And so this is Chris Ray. Yes, Chris Ray, who was our PhD advisor. So I had a very interesting sort of transition period where I started a PhD because I had no idea what I was working on. And so it was just exploring.
Starting point is 00:03:43 And then ended up actually, we did our first project together too. Oh, yeah. And actually, we knew each other before that. And I think then we started working together on that first project. and we would hang out socially and then start working together. The only memory I have that project was I kept filling up this disk on GCloud and expanding it by one terabyte every time and then it would keep filling up and I would insist on only adding a terabyte to it, which he was very mad about for a while.
Starting point is 00:04:12 Well, by the end of the project, it was like running a bunch of experiments and logs would get filled up faster than... Yeah, then you would... Basically, like, I would be there like tracking the experiments and Karen would be there deleting logs like in real time so that our runs didn't crash it was a really interesting way to get started working together yeah so we we started working together then and then you know I eventually like started working with Albert on the S4 push when he was pushing from Neurips and I think he was working on it alone and then needed help I got recruited in to help out because I was just not doing anything for that in Europe's deadline so ended up spending about three weeks on that two or three weeks something like that and then we really pushed hard And that's kind of how I got interested in it, because, you know, he had been working on this stuff for a while, and, you know, nobody really knew what he was doing. To be honest, in the lab, it was just, like, over in the corner, scribbling away, talking to himself. We don't really know what's going on.
Starting point is 00:05:05 Could you actually tell us more about SSMs and, you know, how is it different from transformer-based architectures and what does some of the main areas that people are applying them right now? Because I think it's really interesting is sort of another approach. They really kind of got started from work on RNNs or current neural networks that I was working on before. as an intern in 2019, it kind of felt like the right thing to do for sequential modeling because the basic premise of this is that if you want to model a sequence of data, you want to kind of process the sequence one at a time. If you think about the way that you will kind of process information, you're taking in sequentially and kind of encoding it into like your representation of the information
Starting point is 00:05:43 that you know, right? And then you get new information and you update your belief or your state or whatever with the information that you have. You can basically say almost any model actually is doing this. And then there were some connections to other like dynamical systems and other things that I found really interesting mathematically. And I just thought this kind of felt like a fundamental way to do this. It just felt right in some ways.
Starting point is 00:06:04 You can kind of think of these models as doing something, there's like some loose inspiration from the brain even where you kind of think of the model as encoding all the information it's seen into a compressed state. It could be kind of fuzzy compression, but that's actually powerful and something. some ways because it's a way of kind of stripping out unnecessary information and just trying to focus on the things that matter and code those and process those and then and then work with that we can get more than technical details but kind of like at a high level is just this thing it's just representing this idea of this fuzzy compression and fast updating so you're just
Starting point is 00:06:37 keeping this this state in memory that's just always updating as you see new information is a better um architecture for certain types of data or did you have you know applications in mind besides the sort of general architectural concept? Yeah, so it really can be applied to pretty much everything. So just like kind of Transformers, these are applied to everything, so can these sort of models. Over the course of research over a few years, we kind of realized that there are different advantages
Starting point is 00:07:03 for different types of data. And lots of different variants of these models are better at different types of data or others. So the first type of model we worked on, we're really good at modeling kind of perceptual signals. So you can think of text. data as kind of a representation that's already been really compressed and tokenized, right?
Starting point is 00:07:22 Be cooked. Yeah, sure. And it's kind of like very dense. Like every token in text already has a meaning, it's kind of just dense information. Now, if you look at like a video or an audio signal, it's highly compressible. It's, for example, if you sample at a really high rate,
Starting point is 00:07:36 it's basically like it's very continuous. And so that means it's compressible. And it turns out that different types of models just have different inductive biases or like strains at modeling these things. The first types of models we were looking at were really good actually at modeling kind of these raw waveforms, raw pixels, things like that, but not as good at modeling text, and transformers are way better there.
Starting point is 00:08:00 Newer versions of these models like Mamba, which was the most recent one that's been out for a few months, that's a lot better at modeling the same types of data as transformers. Even there, there's subtler kind of trade-offs. But yeah, so one thing we kind of learn is that in general there's no free lunch there. So people think that like you can throw a transformer at like anything and it just works. Actually it doesn't really like if you try to throw it at like the raw pixel level or the raw sample level in audio waveforms, I think it doesn't work nearly as well. And so you have to be a little more deliberate about this. They really evolved hand in hand with the whole ecosystem of the whole training pipeline.
Starting point is 00:08:35 So it's like the places that people use transformers, the data has already kind of been processed in a way that helps the model. For example, people have been talking a lot about tokenization and how it's both extremely important but also like very counterintuitive, unnatural, and has its own issues. That's an example of something that's kind of developed hand in hand with the transformer architecture. And then when you kind of break away from these assumptions, then some of your modeling assumptions no longer hold and then some of these other models actually work better. Do you think of the advantages like natural fit that translates to quality for certain
Starting point is 00:09:11 data types, at least if we think about like, let's say, perceptual data or, I don't know, richer raw pre-cooked, not pre-cooked data, or like, you know, how do you think about efficiency or the other dimensions of, like, comparing the architectures? Yeah, so I guess so far we've talked kind of about the inductive bias or the fit for the data. Now, the other reason why we really cared about these is because of efficiency. So, yeah, maybe we should have led with that even. So people have yelled for a long time about this, like, quadratic scaling of transformers. One of the big advantages of these alternatives is the linear scaling.
Starting point is 00:09:44 So it just means that basically the time it takes the process, any new token is basically constant time for a current model. But for a transformer, it scales with the history that you've seen. This is obviously a huge advantage when you're really scaling to, like, lots of data. But it is actually something that's sort of a little bit of like a no-free lunch thing. The fact that the transformer is processing is taking longer to process things also means that there are things that it's better at modeling. So this is kind of what I was talking about. There's some like subtleties when you're talking about the trade-offs there. One way that we've sort of been thinking about it more is kind of thinking of,
Starting point is 00:10:18 so as I mentioned in the beginning, we think of these state-based models as kind of being fuzzy compressors. And I think maybe kind of like the bulk of the processing should be done there. But at the same time, it benefits from having some sort of like exact retrieval or some cache. And that's exactly what a transformer is. So one way to think about the transformer is that you're processing all this data. And it's just memorizing every single token it's seen, basically. I mean, some kind of representation of it, but it's literally remembering every single thing you've seen, and you're allowed to look back over all of it.
Starting point is 00:10:48 So that's why it's a lot slower, but that could be useful. But probably that shouldn't be what the bulk of your model is doing. Kind of the same way that like, I mean, again, using like very rough and probably not accurate analogies, but like the way a human brain is probably, most of the intelligence is in this, you know, it's the statefulness, this real time processing unit. but it is helpful to augment it with some sort of scratch pad or like lookup ability retrieval right and so these these ideas are actually quite synergistic and what people have recently been doing is finding that combining them into hybrid
Starting point is 00:11:22 models tends to work really really well seems like it's better than either of them individually so and interestingly kind of also maybe in line with that intuition people have found that the optimal ratio tends to be mostly as a some layers with a little bit of attention. So maybe a ratio of like 10 to 1. I know of at least probably like five groups that have independently verified that this is kind of the optimal ratio of things.
Starting point is 00:11:47 So yeah, I think it makes intuitive sense. Are there specific domains that you're seeing initial applications of these hybrid approaches? People are mostly using this on text because that's what everyone cares about. I think they've been investigated a bit on some other things. So I actually just heard from some collaborators today that they applied a mama-based model
Starting point is 00:12:08 on DNA modeling. They're basically bringing this idea of foundation models to DNA, which is kind of this new idea. Sure. I'm just wondering, like, because DNA is just, do you mean translation of DNA into proteins? Is it a protein folding model? No, so what you do is you can like pre-train a model on long DNA sequences and then fine tune it or use it downstream on things such as even just like DNA itself just encodes proteins and RNA that fold into certain molecular shapes. So that's why I was wondering what the problem set is. Yeah, there's a, bunch of them and i'm honestly not familiar with every single like a lot of the exact details yeah i used to be a biologist that's okay yeah yeah that's right that's my background uh well just like
Starting point is 00:12:48 current uh my brain can't handle biology okay yeah yeah yeah i think uh i get it i was just curious like what the specific application area oh there's this is not protein folding um per se but probably more like classification tasks like one thing that people are interested in is like detecting whether like um point mutations in DNA like what downstream effects that can have and stuff um but i'm not sure the exact. Okay. That's really cool. Yeah. And then one of the areas that you folks really started focusing on from a company perspective is tech to speech and voice. How did the research lead into that domain? We were kind of interested in showing the, like, the versatility and the actual use case of these models. So previously, it was done mostly in academic context. And at CMU, my students are
Starting point is 00:13:28 still kind of carrying that fundamental research forward. But we were pretty sure this would just work in a lot of like places that are interesting. And so like we think it will really work on all sorts of data, but kind of audio seemed like a pretty natural fit at first for some of the benefits. Like what we talked about is like much faster inference and so on. And so, um, doing like streaming settings and so on or natural fit, uh, we just thought this would kind of be like cool first application. Maybe Karin can kind of say more about. Yeah. I mean, there's so many applications that are interesting for these models because they're so generically useful. Um, and I think part of the challenge is, you know, sort of picking the ones that are most interesting and impact
Starting point is 00:14:07 long-term. Obviously, DNA is an interesting one, but you know, we don't personally have, it doesn't personally motivate us as much, or we would have worked on DNA. But I think, like, to me, like, the things that are interesting about multimodal data are really the, the places where SSMs have the most advantage, right, which is that you have data that's very, very information sparse. Compression is actually an advantage because you can stream data through the system really fast and then process it very quickly. So you update it. this sort of memory and so being able to handle very large context is kind of something that you want by design the other thing that's i think really interesting about audio is that i think commercially
Starting point is 00:14:48 there's a lot of very interesting applications where audio is starting to be important i think both on the voice agent side and like being able to kind of interact with your system in a more you know natural way like you would with a human is something that you know a lot of people want to be able to do is because there's a lot of places where you don't want to type into your computer you actually want to to talk to a human. Even on things like gaming and stuff, I think it's really interesting to think about how in the future, you will be essentially replacing graphics and rendering with essentially models that are outputting streams of data in real time. So I think the real time aspect of it is sort of really core to signals and sensor data and of which audio and
Starting point is 00:15:28 video are both very important. And I think audio in particular for us felt like a very natural place to start because of, well, I did some work on audio in my PhD. which was also something that was helpful. And I think there's just so many applications in audio that are really emerging right now that require these types of capabilities to exist. So I think that's particularly exciting. The other piece I think that's really interesting about SSMs
Starting point is 00:15:51 and that we're very excited about and are trying to do is the fact that the model is so efficient that you can hope to put it on smaller hardware and actually push inference closer to on device and to the edge. And I think that so far the theme in a lot of the models that people use has been data center, very big model, lots of compute, lots of GPUs being burned. I think that in the long run, like what we would hope is that you're pushing this closer and closer to the edge. You're actually using much less compute to new inference, and you're actually able to basically reproduce a capability that maybe, you know, today costs a million dollars in the data center, in $10 on a commodity. GPU or accelerator at the edge.
Starting point is 00:16:37 So I think that will be a very powerful shift because essentially what it means is that instead of running batch-oriented workloads in the cloud, you're basically pushing the processing and the intelligence to closer to where the data is being acquired and where the sensors are. And that's kind of what you want because, you know, if you think about like security cameras or any kind of like really sensor that's deployed, you really do want to be able to kind of sift through the information very quickly discard what's not useful because most of it isn't and then really kind of remember all the stuff that is and use that to do prediction and generation and understanding problems. So I think that's the theme. I think in general for what we're trying to build is
Starting point is 00:17:18 sort of the infrastructure to be able to train these models, make them run fast, and then bring them closer and closer to kind of be, you know, very edge oriented rather than cloud oriented. That's super interesting. I guess as part of the recent Apple announcement, they mentioned that a lot the models that they're running on device for three billion parameters so in size. And so you really have to focus on small models. So it is partly like in my head like it's always like two waves, right? Like there's like the first wave of companies that came was sort of really about like how do we figure out if we can do something interesting, right?
Starting point is 00:17:49 You know, nobody knew that scaling to this amount of data and compute would be interesting. So somebody took a bet there and did that and that was great because now we have these, all these great models. I think the second wave is always about efficiency and that's been the case in computing. as well where like you know now we have phones that can do so much you know powerful work and similarly i think for models what you would want is the smartest model ever but run it real cheap so you can run it repeatedly at scale you can like run it a hundred times where you might run it once today so a lot of that needs to needs to happen so i think that's what's interesting
Starting point is 00:18:22 so the 3b models are interesting i think but they're like they're still small and not very capable i think the question is like how do you make the capabilities just very very good but then have that low footprint on device, all of that. So I think that's where the technology that we've been kind of playing with for the last few years and then building with now has really huge potential to actually be kind of the default to run these workloads. Because I think that part of the challenge of the Transformers is the fact that if you try to take a LLM 7B and try to run it on your Mac and you open your profiler, you will notice that
Starting point is 00:18:56 the tokens per second goes down and the memory goes up. So I think that's obviously not great. and power and all these things aren't something that people have like, even I think in the data center, people are talking about it now in the last year or so, but now, you know, you'll start to see more of that conversation shift. And so I think that's kind of where we want to be, which is like, you know, the future will be more intelligence everywhere. And how do you kind of enable that piece, I think is kind of what we're excited about.
Starting point is 00:19:20 Yeah, I think we get really different applications if people start making that assumption, right? As you see, as you said, like we see it in the data center first where even as an investor betting on applications or, you know, full-stack companies that say, like, it costs a great deal to do a thousand calls per query right now, but we're just going to assume we can make it cheaper and we can focus on quality first. I think when you assume that you can run the model or if you make it possible to run the model on hardware everybody has already, and you just get very different applications continuously and that quality without that cost and ongoing computing a problem. Yeah, just the set of things you would want to be able to do. will change because in the same dollars you will be able to just do way more intelligent computation and so i think that's cool i think the way that you know you run games on your computer and the games are like very power like very very rich and interesting like how do you kind of bring models to that place where you know if you if you think about on device like i shouldn't be able to have a music model on device
Starting point is 00:20:20 and that should be my personal musician that i can like talk to and get it to play whatever i want and i don't need to you know go to the cloud to do it so i think these are all things that should be possible and just require like this type of infrastructure and work. Yeah. And Cartagena recently launched its initial sort of Texas speech product and it's really impressive in terms of performance and how fast you've gotten to ship something really is that performing. Can you tell us a little bit more about that launch and that product?
Starting point is 00:20:47 Yeah, I think it was sort of a natural, you know, transition for us to kind of now start thinking about how to put the technology to work because, you know, there was a lot of pre-work that happened and Albert continues to do the pre-work for the next set of things. But I think it's sort of like, how do you kind of build an efficient system that will allow you to do, for example, in this case, voice and audio generation. So I think the way we're thinking about it
Starting point is 00:21:09 is we're building these fairly general models inside the company that allow us to kind of do fairly generic tasks very efficiently. So in this case, it's audio generation and then being able to condition on things like text transcripts. The philosophy is like, oh, audio generation is a problem, needs to be very efficient, needs to be very real time. And so we need to kind of work
Starting point is 00:21:28 on the groundwork there to build the model stack. And then we need to have great training stack so that we can actually train a model that's high quality that people want to use and that actually has a really great experience. So when we were putting together the Sonic demo, it was sort of like we wanted to show that the tech that we were using really can kind of give you something that's really interesting.
Starting point is 00:21:49 And Texas speech is very interesting to me because people have been building Texas speed systems for last probably 30, 40 years. There's constantly improvements happening. And yet we're not. at ceiling, right? Like, there's still so much more you can do in this area. Can you actually talk about that? Because I think a lot of people would say, like, that feels a lot more solved in the last year, which is text audio generation.
Starting point is 00:22:09 Like, what's left between here and the ceiling in terms of thinking about the application experience? Yeah, I think, like, the way I think about it is, like, would I want to talk to this thing for more than 30 seconds? And if the answer is no, then it's not solved. And if the answer is yes, then it is solved. And I think most text to speed systems Kern's audio touring test, yeah. Are not that interesting yet. Yeah. You don't feel as engaged as you do when you're talking to a human.
Starting point is 00:22:33 I know there's obviously other reasons you talk to humans, which is, you know, sorry, I don't want to come across as crazy here, but yeah, there's a society that we live in. So we want to talk to people for that reason, obviously. But I do think the engagement that you have with these systems is not that high. When you're trying to build these things, you really kind of get so into the weeds on like, oh, I can't say this thing this way, and it's like so boring when it says it that way. And how do I control this part of it to say it like this, you know, the intonation? Are there specific dimensions that you look at from an eval perspective that you think are most important in terms of how you think about?
Starting point is 00:23:06 Yeah, evals for, you know, generation are generally challenging because they're qualitative and based on sort of, you know, the general perception of someone who looks at something and says, this is more interesting than this. And so there is some dimension to that. But I think for speech, like, you know, emotion is something that matters a lot because you want to be able to kind of control, you know, the way in which things are said. And I think the other piece that's really interesting is how speech is used to embody kind of the roles people play in society. So like different people speak in different ways because they have, you know, different jobs or work in different, you know, areas or live in different parts of the world. And that's sort of the nuance that I don't think any models really capture well, which is like, you know, if you're a nurse, you need to talk in a different way. you're a lawyer or if you're a judge or if you're a venture capitalist, you know, very different
Starting point is 00:23:57 forms of speech. The highest form of voice. So those are all very challenging, I would say. So it's not solved, is my claim. There's also an interesting point which is kind of like, even just for your basic evaluations of like, can your ASR system like recognize these words or can your generation, can your TTS system say this word? Even that is actually not quite a local problem and for a lot of hard things you actually need to really have the language understanding in order to process and figure out what is the right way of pronouncing this and so on and so actually to really get like perfect even just TTS or like speech-to-speech you actually really need to have like a model that
Starting point is 00:24:35 has more understanding like at least of the language but kind of like it's not really an isolated component anymore and so you have to start getting into these multimodal models just to even do one modality well and so that's kind of like somewhere where that we were kind of eyeing from the beginning as well, and we were kind of using this as an entry point into building out the stack toward all of that, and hopefully that's all going to help, it's going to help the audio as well, but also start getting other modalities into that. That's really cool.
Starting point is 00:25:02 I mean, I guess we've done so much pioneering key work on the SSM side. How is multimodality or speech really impacted how you've thought about the broader problem or has it, and it's more just the generic solutions are the ones that make sense? I don't think multimodality by itself has been kind of a driving more. motivation for this work because I kind of think of these space models I've been working on as like basic generic building blocks that can be used anywhere. So they certainly can be used in multimodal systems to good effect, I think. Different modalities have presented different challenges which has influenced the design of
Starting point is 00:25:33 these. But I always look for kind of like the most general purpose, fundamental kind of like building block that can be used everywhere. And so that's like multimodality is more of like a sort of a different set of challenges in terms of like, how are you applying the building blocks to that, but like you still use kind of the same techniques and they mostly work. Given that like versatility of model architecture, generality of the building block, like what's, what do you do next for Cartesia?
Starting point is 00:26:00 You focus on like the headroom for Sonic and audio, you work on other modalities. You know, I'll take that one. You know, we're obviously really excited about the Sonic work because I think it It kind of shows the first example of something that we're excited about, which is it's a real time model. You can run it really, really fast at low latencies, and it's capturing this idea that you want to generate a signal of some kind. So we're going to continue to obviously improve that piece. Also, you know, just generally things that folks want out of speech systems that need to get built that are orthogonal to the technology piece, which is, you know, being able to support lots of languages and just generally providing more controls.
Starting point is 00:26:42 controls, orthogonal access that's really important for generative models, which is, you know, how do you kind of add more controllability in general to the system so you can kind of get the desired output that you want. So that's obviously one focus for us is how to kind of put that piece in. Few things that we're doing that I think in the short term are really interesting. One is bringing Sonic more on device. You know, you can run the model real time in the cloud. Wouldn't it be cool if you could run it on your MacBook and it ran real time and it, you know, was just as good? I actually have a demo I can show there that I think is super cool. Over time, what we want to do is what Albert said, which is that, you know, audio benefits from text reasoning and, you know, the ability to kind of converse with these models
Starting point is 00:27:22 and actually have them understand what you're saying beyond just, you know, superficial understanding is very important. So what we want to do is enable that piece next, which is, you know, you should be able to have a conversation with this thing and actually be able to have it respond to you intelligently and reason over data and context in order to do that. And so Sonic, I think of as sort of the output piece of that in some sense, which is like, what does the response for that model look like? And then there's the input piece, which is ingesting audio natively into these models and doing that kind of thing.
Starting point is 00:27:49 Is the intention then to train a large-scale multimodal language model on your side as well? Yes, but, you know, we have our own sort of set of techniques that we're developing in order to be able to do that effectively. I think that I will maybe leave for another podcast. But I think, yeah, I think that is the intention at the end of the day is build a great multimodal model, but then make it really, really easy to run on device and make it really cheap to run.
Starting point is 00:28:16 And really focus on kind of the audio piece and making that as good as possible because I think that's sort of where, you know, the fidelity and the quality that you get from SSMs is just very different than what you can able to see. That's pretty amazing because it seems like a lot of the limitations right now in terms of different application areas or use cases or text to speech is basically the extra latency around trip
Starting point is 00:28:36 associated with pinging a language model in the middle. Yeah. As you go from speech to text to the text and then out. And so if you do have multimodality, then obviously that shrinks the time on the inference side dramatically and that has a huge impact in terms of the... Yeah, I think the latency is going to be a big theme there
Starting point is 00:28:50 because it's obviously quite painful to orchestrate multiple models to do this piece. And then I think also just the orchestration itself adds so much overhead. It turns what is, in my mind, something that the model should do into an engineering problem that requires so much, you know, orchestration
Starting point is 00:29:08 and just engineering work. It feels almost inelegant from the computer science perspective. Yeah, maybe that's, you know, some of the thematically the general bias here, which is, you know, the ineligent things we're trying to chip away at. In the end, all the systems go away and it's just one model. And then we also go away, apparently. People ask me, like, how do I treat my research problems? And I can't explain. My answer is just aesthetic.
Starting point is 00:29:35 It's just like there's something that find elegant and we're aesthetically pleasing about things. to me that's almost the most important thing and that's kind of driven a lot of these things too so like like i said like for how did ss come about in the first place it's just like i just felt like there was something like really like nice about it like elegant about it and um you just want to keep working on it uh man i'm continuing to try to like do that like find like the simple um nice solutions to hard problems uh but it's not always possible so al cartia we of course need to solve the actual like uh the engineering challenges and there's always going to be hairy things um but as much as I can, I'm always trying to strive to kind of like make everything simple, unified as
Starting point is 00:30:15 possible. That's great. Yeah, I remember, I can't for it, is it Erdos or somebody used to talk about certain theorems coming out of like God's book or something like that? Or so elegant. Yeah, I very much adhere to that idea. So it's called proofs from the book is what he would say. Yeah. And that's actually kind of thing that kind of guides a lot of the way that I like picking, choosing problems. And what you're referring to course, like in pure math. Sometimes you see like proofs or ideas that just feel like this is obviously just the right way of doing things. It's so elegant. It's so correct. Things are not, in machine learning world, things are often not nearly that clean. But you still can have still the same kind of concept, just, you know, maybe a different level of abstraction. But sometimes
Starting point is 00:31:01 certain approaches or something just seems like the right way of doing things. Unfortunately, this thing is also kind of like, it can be subjective. Yeah, sometimes I tell people this is just the right way of doing it, and I can't explain why. But maybe we should kind of have like one of our pillars should be about the book. So I can start saying this. Let's see the demo. Yeah, I'd love to show you.
Starting point is 00:31:25 Cool. Yeah, I have our model running on our standard issue Mac here. Basically, this is our text-to-speech model, Sonic, on our playground is running in the cloud. And so, you know, part of what I talked about earlier was how do you kind of bring this closer to on-device and edge? And I think the first place to start is your laptop and then hopefully bring it, shrink it down and bring it closer and closer to smaller footprint. So let me try running this. It's great to be on the No Priya's podcast today.
Starting point is 00:31:58 You know, we have the same feature set that's in the cloud but running on this. Prove it's real time and not Coup. Say you don't have to believe in God, but you have to believe in the book. I think that's the erdosh quote. Was that the quote? Let me grab a interesting voice for this one. Ordoches, where is Erdush from? Hungary.
Starting point is 00:32:15 Hungary. I mean, that's a default gas for any mathematician. Oh, yeah, sure. He's just assumed it. All right, I'm going to press enter. You don't have to believe in God. You have to believe in the God. That's pretty good.
Starting point is 00:32:28 Lancy is pretty good. Yeah, it works really fast and I think that's part of what I think gets me really excited, which is like, you know, it streams out audio instantly. I would talk to Ardosh on my laptop. Yeah, me too. That would be a great way to get inspired every morning. Yeah, I know. Yeah.
Starting point is 00:32:47 Yeah, that would be great. Your team is now, how many people? We are 15 people now. And eight interns. Sarah always gives me shit for this. It's a big intern class, yeah. That's amazing. Yeah, we have a lot of interns.
Starting point is 00:32:58 I really like interns. They're great. You know, they're excited. They want to do a cool thing, so, and yeah. And are there specific roles that you're currently hiring for adding up? Yeah, we are. Our hiring for model roles specifically, we're hiring across the engineering stack,
Starting point is 00:33:14 but really want to kind of build out our modeling team deeper. So always looking for great folks to come to team SSM and help us build the future. The rebellion. Yeah, the rebellion. Yeah, we used to actually call it. Yeah, yeah. What do we call it, overthrowing the empire?
Starting point is 00:33:30 Yeah, yeah, that was the theme during our PhDs. And yeah, I would love to continue to have folks inbound us and chat with us if they're excited about this technology and the use cases. A lot of exciting work to do, both research and bringing it to people. Yep. Find us on Twitter at No Pryor's Pod.
Starting point is 00:33:52 Subscribe to our YouTube channel if you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priars.com. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.