Everyday AI Podcast – An AI and ChatGPT Podcast - AI Can Finally Hear What You Actually Mean. What this unlocks

Starting point is 00:00:00 This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Meet Firefly AI Assistant, now live in Adobe Firefly, the all-in-one creative AI studio. Just describe what you want to create and the assistant handles the rest, orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome. The assistant accelerates execution. What's the difference between text and understanding tone?

Starting point is 00:00:51 Probably a lot. It's something that I think a lot of us overlook, especially when it comes to using AI. We just assume that as an example, if we're talking to an AI, that they understand us. Well, maybe not. Maybe all they're seeing or actually understand. understanding is words or even maybe worse, all they're looking at or understanding is a series of tokens.

Starting point is 00:01:19 And I think as we look into 2026 and beyond, something I've been personally very bullish on is actual voice AI. But it goes a step further than that, because like I said, there's a big difference between a model, maybe quote unquote, hearing what you're saying or understanding the words, that you're saying versus understanding the tone. And now I think the technology is finally there that we can accomplish the latter. And what that unlocks for the everyday business leader is huge. So that's what we're going to be talking about today on Everyday AI. I'm excited for it. I hope you are too.

Starting point is 00:02:01 Let's get into it. If you're new here, welcome Everyday AI. Well, it's for you. It's a daily live stream podcast and free daily newsletter helping everyday business leaders like you and me keep up with the latest AI. technology, how to make sense of it, and to leverage all the good stuff, ignore the boring stuff or the stuff that doesn't matter, and use it to grow our companies and career. So it starts here with the unedited, unscripted, live stream podcast, but to take it to the next

Starting point is 00:02:25 level, please go to our website at your everyday AI.com. All right. And if you're looking for the daily AI news, that's going to be in the newsletter as well. All right. Enough of chit chatting with me. Let's bring on a real expert, someone that is building what I think is kind of the next frontier of AI technology, and that's voice enabled and AI that actually understands what we mean, not just what we say. So live stream audience, please help me welcome to the show.

Starting point is 00:02:53 Mike Pappas, the CEO and co-founder of Modulate. Mike, thanks so much for joining the everyday AI show. Andrew, excited to be here. All right. So if you're an avid, like an avid everyday listener of the show, you've heard Modulate a little bit this week, especially. But Mike, for those that maybe aren't familiar. What does Modulate do, especially around voice? Yeah, so Modulate is a frontier AI developer focused on true voice understanding. So we actually got our start in the online gaming space. We work with Call of Duty, Grand Theft Auto, Rainbow Six Seed on voice moderation, understanding the difference between friends who are trash talking and having a good time with it, which we want to encourage, and that very thin difference that starts to make it actually very

Starting point is 00:03:38 damaging to someone who's not expecting those kinds of interactions. You can't do that based on transcriptions. You have to understand how people are actually experiencing the conversation. So that's what really drove us to push AI to the next level for this kind of understanding. These days, we're working with Fortune 500s across fraud and AI guardrails and all these different things. But the core of it all is we just want to push AI to actually hear us as humans and get the same meaning from those conversations that we can do ourselves. Yeah. And I think even just with how AI works under the hood, right, which we don't have to get too far into, but, you know, ultimately kind of like what I said there, you know, right now,

Starting point is 00:04:21 if you're chatting with the, you know, a quote unquote, you know, voice model from some of the major providers, all it's really doing is taking a transcription and, you know, assigning some tokens to it. But, you know, I'm a dork. I look at the tokenization of of different words, right? And one example I always say is, like the word just. The word just can be tokenized at least like seven different ways, right? So how or maybe why is it so important to have an AI as things get more, you know, voice first or voice native? Why is it important to have an AI that actually understands what we mean, not just the words coming out of our mouths? Well, the, you know, the standard answer is you get a text message from a friend. You're running a little late to an event.

Starting point is 00:05:04 and the text message says something like, hey, you coming? And depending on that inflection that I just used, I could have made you feel anxious, I could have made you feel cared for, but you don't get any of that in the text message. And all of us have that experience of the sort of passive aggressive anxiety

Starting point is 00:05:18 of what does this friend actually mean here? Our text doesn't communicate nearly as much as voice does. We have infamous examples within the company of things from some of our gaming customers where someone will say the phrase, hey, come join my private room, which sounds totally ordinary. in text. And then you hear the particular way they say that and you feel very personally unsafe.

Starting point is 00:05:39 And there's a lot of ways to articulate things that communicates so much more. And if we're not looking at that, then we're not actually understanding the experience that human beings are having. And how can an AI participate in those experiences or improve those experiences if it doesn't even understand what's going on? And I think that's important, especially, you know, as people maybe are using large language models differently than what they were originally intended for, right? I think it was maybe an open AI study that came out at the end of 2025,

Starting point is 00:06:14 really showed that people were using AI a lot more for personal relationships, therapy, things like that. Right, like, we don't have to dive into that, but, you know, like, I'm sure that you guys hear so many different use cases and kind of your example is a great illustration of that where the same words can mean very, different things. I'm wondering if you can walk us through maybe a client of yours and just so we can

Starting point is 00:06:39 understand a use case, maybe outside of gaming, because the one that you, you know, game is a great use case. But, you know, maybe walk us through kind of a non-gaming customer that you've worked with and, you know, how the ability to understand what's actually happening outside of a transcript, how that's actually helped move the needle. Yeah, absolutely. So what, one of our first big customers outside of gaming was a Fortune 500 company in the food delivery space. And they had come to us to try to protect their drivers. It's shockingly frequent that you'll have a driver who's running a few minutes laid on a delivery.

Starting point is 00:07:13 They'll get a call from an angry customer saying something along the lines of, I'm going to kill you. And then that driver says, I don't want to go to that person's front door anymore. So that was the initial use case, is help us understand when there's those kinds of aggressive emotions being portrayed and help identify calls. that are worth the platform taking a closer look at. They came back to us a week later, and we asked, you know, how's the abuse detection going?

Starting point is 00:07:37 And they almost didn't even want to talk about it because they were so excited about how much fraud we were finding for them. And it turns out we had actually found five times more attempted scams against their drivers than their actual fraud detection tools. Because those fraud detection tools were looking at things like metadata of suspicious numbers or, you know, is a weird transaction taking place after the fact. But what we were doing was, listening live to the call and saying things like, hey, this person is performing anger,

Starting point is 00:08:05 but they don't actually demonstrate authentic anger. Hey, this person is trying to bypass policies in ways which are very clear and they're trying to make these excuses. Even in some cases with fraud, it's something like, hey, they're apologizing for the baby in the background that's crying and that's why they're so urgent. That is a recording of a baby. We can tell. It's not a baby in the same room that you are and that's clearly a scam in and of itself. So it's all these different kinds of acoustic elements that come together. And that that was actually one of our first big proof points because we realized these platforms, even, you know, retail or finance, who we work with a lot today, these spaces, you'd want to know what kind of fraud they're dealing with, what the prevalence is.

Starting point is 00:08:50 But no one knows how to just listen to their voice conversations. Everyone is guessing based on small samples. So just the ability to say we can cover the whole basis and give you that sense of the statistics is so powerful. Yeah. And I think it's important for our audience to know and understand that this isn't some, you know, something coming out of left field or something that's, this doesn't happen very often. It happens all the time. And the technology that people are using is very sophisticated, right, but also easy to use. Right. And an easy example, I think it was in the summer, it was actually the U.S. State Department was targeted, right, by a voice clone impersonating Marco Rubio, right? And like, from reading reports, it seemed like this got pretty far,

Starting point is 00:09:40 maybe farther than it should have. You know, Mike, I'm sure this is something that you guys do with all the time, right? Just how easy it is for people to create voice clones and the whole deep fake thing, right? Can you talk a little bit about what companies need to be paying attention too, right? And how can they even tell, right? How like, and what should they maybe even be doing internally to make sure that this doesn't happen to them? Because I don't know what you can do, right? Yeah, I mean, synthetic voices are much more prominent and they're better than they've ever been and will keep getting better. Sam Altman sort of famously came out and said, don't even try to detect if they're real. It's completely impossible. But Sam Altman is famously a marketer.

Starting point is 00:10:22 And I would say as a technologist, that's at least not true yet. And I don't expect it to be true for some time. To humans, the way we hear, we're indexing on specific things. We can't hear with the same fidelity. We can't tell. At this point, the synthetic voices are good enough. We genuinely can't tell. You can't train a person to do it, much better than chance.

Starting point is 00:10:45 But AI systems can. Technology can notice the discrepancies. And some of those discrepancies are the kind of. of obvious ones. There's a glitch or something like that. Some of them are more subtle that the way the technology is generating your voice is very authentic, but the room sound keeps changing. It's as if you were teleporting between different environments. Sometimes it's even, you know, the adversary tries to disguise some of those things by adding the sound of a subway in the background and adding a bunch of static. And that fools a lot of systems, but for modulate, we know what real background

Starting point is 00:11:19 noise sounds like. And just as with the screaming baby, we can say, that's not real, that's not actually happening to you right now, and use it as further evidence. So there is this prevalent phenomenon. You're going to see more and more people, not just copying your CEO's voice, but copying, you know, day-to-day employees' voices. But it is possible to implement the right tools to be able to catch right off the bat in a matter of seconds. Something is wrong here. You need to be paying very close attention in this conversation. Yeah. So I know that you all did just, come out with some new research and I'm wondering if that's where this comes into play.

Starting point is 00:11:56 So with the ensemble listening model, right, in the example that you just gave, is that kind of this new technology and the new research? Is that what kind of helps to be able to decipher like, hey, this, you know, this is why it's a deep fake and this is how we can, you know, layer this sound and really deconstruct it? It's the same principle, though synthetic voice detection is even only a corner of what the ELLLL sort of approach allows us to do. So for those who aren't familiar with the research, this idea of an ELM, the E stands for ensemble, as Jordan mentioned. So the idea is instead of one monolithic black box model, you have a number of different models that are doing different things. And in our case, we have models that are looking at emotional characteristics.

Starting point is 00:12:38 We have some that are looking at prosody. We have some that are looking at the timbre of your voice that implies whether you're synthetic or implies your age potentially. We have others that are looking at what you're saying, your behaviors like interruptions. And you need to have a way to have a way combine all these different data points together, whether to make a decision of, is this a synthetic speaker, are they attempting to be fraudulent, or even sort of more complicated analyses? And so the big innovation in our ELM research was the ability to orchestrate these different kinds of models in a way that's dynamic that can actually say, hey, because our synthetic voice detector is flagging pretty high, we actually want to trigger our noise detectors

Starting point is 00:13:19 to go to something that's a little more granular, a little bit more precise, because that's what's needed if we want to get even more accurate on synthetic things. And now that we're starting to see synthetic voice, let's take a closer look at some of the fraud behaviors that might be accompanying it. It's worth investing a little bit more time and energy looking at that. So the way we zero into a conversation, just as an actual trained analyst would be, is sort of top down. You start quickly surveying, what do I think are the major things that are happening here?

Starting point is 00:13:52 And then you look more closely in real time as the conversation is happening at the elements that are going to help further inform your understanding of what's go. And I really want to dive into this a little bit deeper. But before we do, I'm going to take a quick break here from a word from a sponsor that's very relevant to today's conversation. All right. So, Mike, you were just talking a little bit about the ELM and some of the new research that's gone in. So one thing I kind of thought about and let me know if I'm completely off base. So, you know, in a previous life, I took a lot of photos, right?

Starting point is 00:14:29 And I remember, you know, kind of the difference between a flat JPEG, right? And if you go to, you know, edit a flat JPEG, you know, you don't have a lot of control. But then there's kind of these raw files, right, where you can almost, there's all these different layers. You can individually pull out and inspect and tweak individual ones. So is that kind of how this, you know, ELM works. And if so, can you kind of explain some of the new use cases that this technology unlocks for, you know, everyday enterprises? Yeah, that concept of layers is very appropriate.

Starting point is 00:15:04 The way the ELM is trying to model a conversation, it's looking at each of these different components. And the key innovation, again, isn't that you can ask each of these questions. People for a long time have had tools to transcribe a conversation and tools to say, what's emotion. But let's take a simple example. If the emotion is sarcastic and the comment is nice job, the meaning is not nice job. And if you're then, you know, asking a AI summarizer to explain what happened in the conversation or, you know, any other attempt to derive something out of what actually happened, your systems will not be able to connect the dots between the emotion sarcasm and the words that were said.

Starting point is 00:15:49 It's that extra layer of not only, hey, FYI, this was sarcastic, but we can feed that back. We can use that to inform our understanding when they said nice job, and that can color our interpretation of what happens next. So it's all continuous.

Starting point is 00:16:06 It's all feeding back into itself instead of just being 5, 10, 100 completely independent characterizations of the conversation. That's a great way to kind of, illustrate that. And I'm guessing, right, because I'm always thinking, you know, not in a, disruption in a bad way, right, but I'm always looking at different sectors, different departments, you know, that are maybe right for disruption in a good way, right? So we stop, you know,

Starting point is 00:16:33 stop doing our day-to-day work like it's, you know, 1920. And one thing I always think of, especially when it comes to voice AI is customer service, right? I don't know anyone that likes, you know, waiting on hold for, hours and you know maybe even after that you're still not very happy with the support that you get right so i'm i'm guessing that this is uh you know a space that you've heard a lot about and you know that your products are are very uh you know crucial for these companies that want to do this but can you tell a little bit just you know on the topic of today's show you know your your example right being with a layer you know sarcasm with you know different tones of voice

Starting point is 00:17:17 and all these different things. What does this mean for services like customer service or departments like customer service? Yeah, customer service is a huge space that we spend a lot of time in here. So I'll focus on AI agents. We can talk about human agents too. But in the AI agent context,

Starting point is 00:17:35 when people talk to AIs, we don't talk normally. We immediately know we're talking to an AI and we regulate ourselves because we're so worried about being misunderstood. So there were some great studies that I saw done on this where, you know, if someone asked you, do you own your home? A human might respond to another human in any of hundreds of different ways. Oh, well, the bank hasn't repossessed it yet. If you know an AI is asking you if you owe your home, you have like four or five possible answers.

Starting point is 00:18:06 Because you're so worried that if you say something off book, the AI is not going to be able to understand. So we are restricting ourselves to make it easy. easier on the AI. And that's part of why the traditional experience talking to an AI agent feels so stifling. It doesn't feel authentic. It's in that uncanny valley of its trying to feel natural, but it's not meeting that far. Whereas introducing technology like modulates allows that AI to actually understand what's going on. So if you're starting to feel frustrated, it can hear it, It can notice it. It can try and resolve that frustration, or it can say, I'm sorry, I clearly can't solve this for you.

Starting point is 00:18:48 Let me escalate to a human. Right. Part of deploying these AI agents is you need to be able to trust that they're not going to go off the deep end when something goes wrong. So using a system like modulate that can help them effectively introspect and notice when something is going wrong and can provide you trend analysis of what kinds of things are they doing across thousands and millions of calls, that's what creates the enterprise trust and conviction that allows people to actually deploy these agents at scale in the first place. Yeah, and I'm glad we went straight to that kind of example or use case because I know from personal conversations and just from, you know, hearing from a lot of our audience that this is something that business leaders are grappling with, right? Because before when it came to, you know, voice AI agents, it was all about latency, right?

Starting point is 00:19:38 There was too much of a delay and it felt, you know, not human. But then it was all about, you know, oh, our company's knowledge, it has to be easy. And before it was, you know, expensive rag pipelines. And now it's, you know, with systems, whether you're talking about Google Gemini's, you know, their voice, open AIs, 11 labs. Now it's seemingly easier for any company to get, you know, pretty human sounding, low latency, AI agents that's connected to their knowledge. But you brought up a good thing about trust. and then understanding and the tone. You know, so maybe can you walk us through?

Starting point is 00:20:15 What are those considerations for those people that are on the fence, right? Should I just go ahead and connect all my data and, you know, go put one of these agents live and see what happens? What's the right way to roll this out? Yeah, I mean, the three major considerations that we hear from people who are thinking about these voice AI agents, one is, again, just is it going to go rogue? We've talked to someone in the interviewing space

Starting point is 00:20:40 who we believe misprompted their agent to check if candidates were flexible. So the agent started asking people if they could try out yoga poses. Right. Right. This is apoccalate's level stuff, but it happens all the time because it's just so hard to organize. And so the first need for these customers is just to be able to say, hey, if the AI doesn't know, it's not going to hallucinate,

Starting point is 00:21:06 which is how al-alums are built. Or if it does hallucinate, we will get an alert right then and there and be able to escalate out of that chain so that we don't end up in that dark place. We've seen the courts will uphold if your AI hallucinates a policy and tells it to your customer, you are on the hook for that policy,

Starting point is 00:21:23 even if it's not yours. So this is a serious enterprise consideration. The second fear here is just about scale. If you're deploying this at massive scale, even if you trust to mostly do a good job, what is it doing? How do you know? How do you find that out? Every AI agent system claims to have some kind of reporting or logging. But the way those systems work, they're not actually picking apart the specifics of what they saw in the conversation because they're too busy participating in it.

Starting point is 00:21:54 Having a system that sits on top of it and actually plucks out the key insights in a structured way makes it much, much easier for you to actually know what's going on, which then feeds into the third and final, which is compliance. If you need to be able to justify, why did you decide that Mike was probably fraudulent? You can't just say, I don't know, the magic box told me. You need to be able to explain what that logic is. ELMs are fundamentally explainable because we can look at the component models and we can tell you here are the things that contributed to that assessment and prove that it's not biased, prove that it's accurate and consistent in a way that black box models can't. Yeah. And I think probably the way that you just described it probably really resonated with a lot of people because I

Starting point is 00:22:40 get the appeal of being able to very easily go get an AI agent, right, like that can go and talk to people in real time, right? But you just hit it on the head right there, like trust, observability, all of these other things, especially when it's happening at scale, right? Like I'm wondering as, you know, someone, you know, both yourself personally and, you know, leading, modulate and, you know, helping to shape the future of this technology, what are some of the things, you know, beyond, you know, today, next week, next month that business leaders need to be thinking about when it comes to, you know, kind of voice enabled AI agents. I mean, this is such a prosaic answer, but cost. Right now so many organizations are coasting on someone high up having said, let's reserve a whole bunch of money to figure out this AI thing. And we're starting to see a lot of businesses run into that wall of, hey, we're hitting the end of that cash and we haven't been able to prove value yet. So I didn't list earlier as one of the big considerations cost because so far people have been much more interested in let's just prove what's possible.

Starting point is 00:23:55 but I think we're going to see a reckoning coming where people actually look at the economics of these systems and have to reimagine it. That's a boring answer though. So let me give a more fun technical one too. I think as much as people want to be thinking about how does the AI take the load off of the platform, there's also a version of the AI taking the load off of the customer.

Starting point is 00:24:21 And we're already seeing applications come up today where instead of me having to call, my bank and wait on hold, I can have an AI do it. I can delegate to the AI, which creates a bunch of fun new challenges, like should the AI prioritize or should the bank prioritize my AI's call the way they would have prioritized my call as a human? Can they have two AIs talk to each other? At what point are we just recreating an API? There's a lot of like fundamental design questions of how does any of this impact your brand? What does it mean for you to try to build trust with your customer if your customer,

Starting point is 00:24:55 is actually sending a delegate to interact with you instead of coming in directly. So I think there's a much bigger question, not just about voice as a mechanism for completing a transaction, but about voice being the emotional thing that builds brand trust, that builds relationships. And how do we make sure that in our haste to automate so many pieces of this, we're not actually crippling that kind of brand trust and loyalty that so many platforms rely on? So yeah, you just set off so many new questions in my head, but I can't keep you forever. But, you know, Mike, we've talked about a lot in today's conversation from, you know, deep fakes and guardrails around, you know, AI voice to, you know, fraud detection and different sectors being disrupted maybe in a good way. But as we wrap up here, I want to go back to kind of where we started and just ask you.

Starting point is 00:25:47 So now that, you know, through your guys' technology and some of the new advancements that you've come up with, AI can actually hear what we mean and, you know, it can be more than just looking at tokens and text. What does this unlock, right? Like, what should business owners be most excited about? I mean, I think the, I wish I had to punch your way to say it. But I think that the thing that they should be excited about is actually understanding their customers and being able to solve their customers problems. Right. Like, that's the actual job of customer support. That's the actual job of anyone that's picking up the phone and talking to someone, you want to understand them. You want to build a relationship.

Starting point is 00:26:30 You want to be able to satisfy something. Right now, there are so many frictions in front of us. And we can talk about, hey, we can make AI agents understand people better. What about humans? What about all the sort of culture mismatches that we run into all the time where I don't understand what it is that you meant? I can't tell you the number of people I've talked to and said, I'd love to have a little bird on my shoulder that could tell me, hey, what they meant was this.

Starting point is 00:26:53 and here's how you can communicate what you want to communicate to them. I think that actually opens the door to much richer sort of worldwide conversations overall. So that's kind of me on my founder perch talking about large missions stuff, but that is what really excites me. And I think any platform that is thinking

Starting point is 00:27:14 just about how do I complete this transaction more efficiently is thinking a little bit too small. There's a much greater opportunity here to be saying, how can I use this technology, not just to complete that one transaction, but to enrich the relationship that I'm building with my customers and be someone that they can actually trust to solve their problems in a larger way. This was a great and fun conversation. And I think, you know, Mike, like as I talked about at the beginning, as, you know,

Starting point is 00:27:43 voice becomes more and more native and kind of the de facto interface, I think that you answered a lot of important questions that our audience is going to have, both in 2026 and beyond. So thank you so much for taking time out of your day to join the Everyday AI show. We really appreciate it. Thank you so much for happening. All right. And if you miss anything, y'all, we're going to be recapping it all in today's newsletter. So make sure if you haven't already to please go to your EverydayAI.com. Thanks for tuning in. Hope to see you back tomorrow and every day for more Everyday AI. Thanks, y'all. Meet Firefly AI assistant. Now live in Adobe Firefly, the Allman One Creative AI Studio.

Starting point is 00:28:22 Just describe what you want to create in your own words and the assistant handles the rest, orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome while the assistant accelerates execution. Stand control with the ability to step in and refine at any time. See it today at firefly.adobie.com. And that's a wrap for today's edition of Everyday AI. Thanks for joining us.

Starting point is 00:28:56 If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.

Everyday AI Podcast – An AI and ChatGPT Podcast - AI Can Finally Hear What You Actually Mean. What this unlocks

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.