Everyday AI Podcast – An AI and ChatGPT Podcast - AI Can Finally Hear What You Actually Mean. What this unlocks
Episode Date: January 29, 2026Your company’s goldmine? All those meetings and call recordings. It’s the fuel that AI needs. But here’s the big letdown: those call transcripts only pick up the words. Not what they mean. An...d the difference? Well…. That can make all the difference. But some new technology might change what’s possible. Join us as we talk about it. AI Can Finally Hear What You Actually Mean. What this unlocks — An Everyday AI chat with Jordan Wilson and Modulate’s Mike Pappas.Newsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion on LinkedIn: Thoughts on this? Join the convo on LinkedIn and connect with other AI leaders.Upcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:Modulate Velma Voice Native AI Model OverviewTone, Emotion, and Intent in Voice AIDifferentiating Text vs. True Voice UnderstandingReal-World Voice AI Use Cases in Fraud DetectionSynthetic Voice and Deepfake Detection TechniquesEnsemble Listening Model (ELM) Technology ExplainedVoice AI for Customer Service and SupportTrust, Compliance, and Observability in Voice AI AgentsCost and Scalability Challenges for Voice AIFuture Impact of Voice AI on Customer RelationshipsTimestamps:00:00 "Modulate: AI That Understands Tone"06:15 "AI Use Cases Beyond Gaming"07:13 "Detecting Abuse and Fraud"13:19 Dynamic Model Orchestration Innovation16:22 "Context-Aware AI for Conversations"17:44 "Voice AI Transforming Customer Service"22:49 AI Accountability and Compliance Challenges25:36 AI, Customers, and Brand Trust28:05 "Enhancing Communication Through AI"Keywords: Voice AI, voice native AI, voice understanding, tone detection AI, intent detection, emotional AI, prosody analysis, real-time fraud detection, synthetic voice detection, AI guardrails, deepfake detection, customer support AI, call analysis, Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info)
Transcript
Discussion (0)
This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips.
Listen daily for practical advice to boost your career, business, and everyday life.
Meet Firefly AI Assistant, now live in Adobe Firefly, the all-in-one creative AI studio.
Just describe what you want to create and the assistant handles the rest,
orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome.
The assistant accelerates execution.
What's the difference between text and understanding tone?
Probably a lot.
It's something that I think a lot of us overlook,
especially when it comes to using AI.
We just assume that as an example,
if we're talking to an AI, that they understand us.
Well, maybe not.
Maybe all they're seeing or actually understand.
understanding is words or even maybe worse, all they're looking at or understanding is a series of tokens.
And I think as we look into 2026 and beyond, something I've been personally very bullish on is actual voice AI.
But it goes a step further than that, because like I said, there's a big difference between a model, maybe quote unquote, hearing what you're saying or understanding the words,
that you're saying versus understanding the tone.
And now I think the technology is finally there that we can accomplish the latter.
And what that unlocks for the everyday business leader is huge.
So that's what we're going to be talking about today on Everyday AI.
I'm excited for it.
I hope you are too.
Let's get into it.
If you're new here, welcome Everyday AI.
Well, it's for you.
It's a daily live stream podcast and free daily newsletter helping everyday business leaders
like you and me keep up with the latest AI.
technology, how to make sense of it, and to leverage all the good stuff, ignore the boring
stuff or the stuff that doesn't matter, and use it to grow our companies and career.
So it starts here with the unedited, unscripted, live stream podcast, but to take it to the next
level, please go to our website at your everyday AI.com.
All right.
And if you're looking for the daily AI news, that's going to be in the newsletter as well.
All right.
Enough of chit chatting with me.
Let's bring on a real expert, someone that is building what I think is kind of the next
frontier of AI technology, and that's voice enabled and AI that actually understands what we
mean, not just what we say. So live stream audience, please help me welcome to the show.
Mike Pappas, the CEO and co-founder of Modulate. Mike, thanks so much for joining the everyday AI show.
Andrew, excited to be here. All right. So if you're an avid, like an avid everyday listener of the
show, you've heard Modulate a little bit this week, especially. But Mike, for those that maybe
aren't familiar. What does Modulate do, especially around voice? Yeah, so Modulate is a frontier
AI developer focused on true voice understanding. So we actually got our start in the online
gaming space. We work with Call of Duty, Grand Theft Auto, Rainbow Six Seed on voice moderation,
understanding the difference between friends who are trash talking and having a good time with it,
which we want to encourage, and that very thin difference that starts to make it actually very
damaging to someone who's not expecting those kinds of interactions. You can't do that based on
transcriptions. You have to understand how people are actually experiencing the conversation.
So that's what really drove us to push AI to the next level for this kind of understanding.
These days, we're working with Fortune 500s across fraud and AI guardrails and all these
different things. But the core of it all is we just want to push AI to actually hear us as
humans and get the same meaning from those conversations that we can do ourselves.
Yeah. And I think even just with how AI works under the hood, right, which we don't have to
get too far into, but, you know, ultimately kind of like what I said there, you know, right now,
if you're chatting with the, you know, a quote unquote, you know, voice model from some of the
major providers, all it's really doing is taking a transcription and, you know, assigning some tokens
to it. But, you know, I'm a dork. I look at the tokenization of
of different words, right? And one example I always say is, like the word just. The word just can be
tokenized at least like seven different ways, right? So how or maybe why is it so important to have an
AI as things get more, you know, voice first or voice native? Why is it important to have an AI that
actually understands what we mean, not just the words coming out of our mouths? Well, the, you know,
the standard answer is you get a text message from a friend. You're running a little late to an event.
and the text message says something like,
hey, you coming?
And depending on that inflection that I just used,
I could have made you feel anxious,
I could have made you feel cared for,
but you don't get any of that in the text message.
And all of us have that experience
of the sort of passive aggressive anxiety
of what does this friend actually mean here?
Our text doesn't communicate nearly as much as voice does.
We have infamous examples within the company
of things from some of our gaming customers
where someone will say the phrase,
hey, come join my private room,
which sounds totally ordinary.
in text. And then you hear the particular way they say that and you feel very personally unsafe.
And there's a lot of ways to articulate things that communicates so much more. And if we're not
looking at that, then we're not actually understanding the experience that human beings are having.
And how can an AI participate in those experiences or improve those experiences if it doesn't even
understand what's going on? And I think that's important, especially, you know, as people
maybe are using large language models differently
than what they were originally intended for, right?
I think it was maybe an open AI study
that came out at the end of 2025,
really showed that people were using AI a lot more
for personal relationships, therapy, things like that.
Right, like, we don't have to dive into that,
but, you know, like, I'm sure that you guys hear
so many different use cases
and kind of your example is a great illustration of that
where the same words can mean very,
different things. I'm wondering if you can walk us through maybe a client of yours and just so we can
understand a use case, maybe outside of gaming, because the one that you, you know, game is a great
use case. But, you know, maybe walk us through kind of a non-gaming customer that you've worked with
and, you know, how the ability to understand what's actually happening outside of a transcript,
how that's actually helped move the needle. Yeah, absolutely. So what, one of our first big customers
outside of gaming was a Fortune 500 company in the food delivery space.
And they had come to us to try to protect their drivers.
It's shockingly frequent that you'll have a driver who's running a few minutes laid on a
delivery.
They'll get a call from an angry customer saying something along the lines of,
I'm going to kill you.
And then that driver says, I don't want to go to that person's front door anymore.
So that was the initial use case, is help us understand when there's those kinds of
aggressive emotions being portrayed and help identify calls.
that are worth the platform taking a closer look at.
They came back to us a week later,
and we asked, you know, how's the abuse detection going?
And they almost didn't even want to talk about it
because they were so excited about how much fraud we were finding for them.
And it turns out we had actually found five times more attempted scams
against their drivers than their actual fraud detection tools.
Because those fraud detection tools were looking at things like metadata of suspicious numbers
or, you know, is a weird transaction taking place after the fact.
But what we were doing was,
listening live to the call and saying things like, hey, this person is performing anger,
but they don't actually demonstrate authentic anger. Hey, this person is trying to bypass policies
in ways which are very clear and they're trying to make these excuses. Even in some cases with
fraud, it's something like, hey, they're apologizing for the baby in the background that's crying
and that's why they're so urgent. That is a recording of a baby. We can tell. It's not a baby in the same
room that you are and that's clearly a scam in and of itself. So it's all these different kinds of
acoustic elements that come together. And that that was actually one of our first big proof points because
we realized these platforms, even, you know, retail or finance, who we work with a lot today,
these spaces, you'd want to know what kind of fraud they're dealing with, what the prevalence is.
But no one knows how to just listen to their voice conversations. Everyone is guessing based on small
samples. So just the ability to say we can cover the whole basis and give you that sense of
the statistics is so powerful. Yeah. And I think it's important for our audience to know and
understand that this isn't some, you know, something coming out of left field or something that's,
this doesn't happen very often. It happens all the time. And the technology that people are using
is very sophisticated, right, but also easy to use. Right. And an easy example, I think it was in
the summer, it was actually the U.S. State Department was targeted, right, by a voice clone impersonating
Marco Rubio, right? And like, from reading reports, it seemed like this got pretty far,
maybe farther than it should have. You know, Mike, I'm sure this is something that you guys do
with all the time, right? Just how easy it is for people to create voice clones and the whole
deep fake thing, right? Can you talk a little bit about what companies need to be paying attention
too, right? And how can they even tell, right? How like, and what should they maybe even be doing
internally to make sure that this doesn't happen to them? Because I don't know what you can do, right?
Yeah, I mean, synthetic voices are much more prominent and they're better than they've ever been
and will keep getting better. Sam Altman sort of famously came out and said, don't even try to
detect if they're real. It's completely impossible. But Sam Altman is famously a marketer.
And I would say as a technologist, that's at least not true yet.
And I don't expect it to be true for some time.
To humans, the way we hear, we're indexing on specific things.
We can't hear with the same fidelity.
We can't tell.
At this point, the synthetic voices are good enough.
We genuinely can't tell.
You can't train a person to do it, much better than chance.
But AI systems can.
Technology can notice the discrepancies.
And some of those discrepancies are the kind of.
of obvious ones. There's a glitch or something like that. Some of them are more subtle that the way
the technology is generating your voice is very authentic, but the room sound keeps changing. It's as if
you were teleporting between different environments. Sometimes it's even, you know, the adversary
tries to disguise some of those things by adding the sound of a subway in the background and adding
a bunch of static. And that fools a lot of systems, but for modulate, we know what real background
noise sounds like. And just as with the screaming baby, we can say,
that's not real, that's not actually happening to you right now, and use it as further evidence.
So there is this prevalent phenomenon. You're going to see more and more people, not just
copying your CEO's voice, but copying, you know, day-to-day employees' voices.
But it is possible to implement the right tools to be able to catch right off the bat in a matter
of seconds. Something is wrong here. You need to be paying very close attention in this
conversation. Yeah. So I know that you all did just,
come out with some new research and I'm wondering if that's where this comes into play.
So with the ensemble listening model, right, in the example that you just gave, is that kind of
this new technology and the new research? Is that what kind of helps to be able to decipher like,
hey, this, you know, this is why it's a deep fake and this is how we can, you know, layer this sound
and really deconstruct it? It's the same principle, though synthetic voice detection is even
only a corner of what the ELLLL sort of approach allows us to do. So for those who aren't familiar
with the research, this idea of an ELM, the E stands for ensemble, as Jordan mentioned. So the idea
is instead of one monolithic black box model, you have a number of different models that are doing
different things. And in our case, we have models that are looking at emotional characteristics.
We have some that are looking at prosody. We have some that are looking at the timbre of your voice
that implies whether you're synthetic or implies your age potentially. We have others that are looking
at what you're saying, your behaviors like interruptions. And you need to have a way to have a way
combine all these different data points together, whether to make a decision of, is this a synthetic
speaker, are they attempting to be fraudulent, or even sort of more complicated analyses?
And so the big innovation in our ELM research was the ability to orchestrate these different
kinds of models in a way that's dynamic that can actually say, hey, because our synthetic
voice detector is flagging pretty high, we actually want to trigger our noise detectors
to go to something that's a little more granular, a little bit more precise,
because that's what's needed if we want to get even more accurate on synthetic
things. And now that we're starting to see synthetic voice,
let's take a closer look at some of the fraud behaviors that might be accompanying it.
It's worth investing a little bit more time and energy looking at that.
So the way we zero into a conversation, just as an actual trained analyst would be,
is sort of top down.
You start quickly surveying, what do I think are the major things that are happening here?
And then you look more closely in real time as the conversation is happening at the elements that are going to help further inform your understanding of what's go.
And I really want to dive into this a little bit deeper.
But before we do, I'm going to take a quick break here from a word from a sponsor that's very relevant to today's conversation.
All right.
So, Mike, you were just talking a little bit about the ELM and some of the
new research that's gone in.
So one thing I kind of thought about and let me know if I'm completely off base.
So, you know, in a previous life, I took a lot of photos, right?
And I remember, you know, kind of the difference between a flat JPEG, right?
And if you go to, you know, edit a flat JPEG, you know, you don't have a lot of control.
But then there's kind of these raw files, right, where you can almost, there's all these
different layers.
You can individually pull out and inspect and tweak individual ones.
So is that kind of how this, you know,
ELM works. And if so, can you kind of explain some of the new use cases that this technology
unlocks for, you know, everyday enterprises? Yeah, that concept of layers is very appropriate.
The way the ELM is trying to model a conversation, it's looking at each of these different
components. And the key innovation, again, isn't that you can ask each of these questions.
People for a long time have had tools to transcribe a conversation and tools to say, what's
emotion. But let's take a simple example. If the emotion is sarcastic and the comment is nice job,
the meaning is not nice job. And if you're then, you know, asking a AI summarizer to explain
what happened in the conversation or, you know, any other attempt to derive something out of what
actually happened, your systems will not be able to connect the dots between the emotion sarcasm
and the words that were said.
It's that extra layer of not only,
hey, FYI, this was sarcastic,
but we can feed that back.
We can use that to inform our understanding
when they said nice job,
and that can color our interpretation
of what happens next.
So it's all continuous.
It's all feeding back into itself
instead of just being 5, 10,
100 completely independent
characterizations of the conversation.
That's a great way to kind of,
illustrate that. And I'm guessing, right, because I'm always thinking, you know, not in a,
disruption in a bad way, right, but I'm always looking at different sectors, different departments,
you know, that are maybe right for disruption in a good way, right? So we stop, you know,
stop doing our day-to-day work like it's, you know, 1920. And one thing I always think of,
especially when it comes to voice AI is customer service, right? I don't know anyone that likes,
you know, waiting on hold for,
hours and you know maybe even after that you're still not very happy with the support that you get
right so i'm i'm guessing that this is uh you know a space that you've heard a lot about and you know
that your products are are very uh you know crucial for these companies that want to do this but
can you tell a little bit just you know on the topic of today's show you know your your example right
being with a layer you know sarcasm with you know different tones of voice
and all these different things.
What does this mean for services like customer service
or departments like customer service?
Yeah, customer service is a huge space
that we spend a lot of time in here.
So I'll focus on AI agents.
We can talk about human agents too.
But in the AI agent context,
when people talk to AIs, we don't talk normally.
We immediately know we're talking to an AI
and we regulate ourselves
because we're so worried about being misunderstood.
So there were some great studies that I saw done on this where, you know, if someone asked you, do you own your home?
A human might respond to another human in any of hundreds of different ways.
Oh, well, the bank hasn't repossessed it yet.
If you know an AI is asking you if you owe your home, you have like four or five possible answers.
Because you're so worried that if you say something off book, the AI is not going to be able to understand.
So we are restricting ourselves to make it easy.
easier on the AI. And that's part of why the traditional experience talking to an AI agent feels so
stifling. It doesn't feel authentic. It's in that uncanny valley of its trying to feel natural,
but it's not meeting that far. Whereas introducing technology like modulates allows that AI to
actually understand what's going on. So if you're starting to feel frustrated, it can hear it,
It can notice it.
It can try and resolve that frustration, or it can say, I'm sorry, I clearly can't solve this for you.
Let me escalate to a human.
Right.
Part of deploying these AI agents is you need to be able to trust that they're not going to go off the deep end when something goes wrong.
So using a system like modulate that can help them effectively introspect and notice when something is going wrong
and can provide you trend analysis of what kinds of things are they doing across thousands and millions of calls,
that's what creates the enterprise trust and conviction that allows people to actually deploy these agents at scale in the first place.
Yeah, and I'm glad we went straight to that kind of example or use case because I know from personal conversations and just from, you know, hearing from a lot of our audience that this is something that business leaders are grappling with, right?
Because before when it came to, you know, voice AI agents, it was all about latency, right?
There was too much of a delay and it felt, you know, not human.
But then it was all about, you know, oh, our company's knowledge, it has to be easy.
And before it was, you know, expensive rag pipelines.
And now it's, you know, with systems, whether you're talking about Google Gemini's, you know, their voice, open AIs, 11 labs.
Now it's seemingly easier for any company to get, you know, pretty human sounding, low latency, AI agents that's connected to their knowledge.
But you brought up a good thing about trust.
and then understanding and the tone.
You know, so maybe can you walk us through?
What are those considerations for those people that are on the fence, right?
Should I just go ahead and connect all my data and, you know,
go put one of these agents live and see what happens?
What's the right way to roll this out?
Yeah, I mean, the three major considerations that we hear from people
who are thinking about these voice AI agents,
one is, again, just is it going to go rogue?
We've talked to someone in the interviewing space
who we believe misprompted their agent to check if candidates were flexible.
So the agent started asking people if they could try out yoga poses.
Right.
Right.
This is apoccalate's level stuff,
but it happens all the time because it's just so hard to organize.
And so the first need for these customers is just to be able to say,
hey, if the AI doesn't know, it's not going to hallucinate,
which is how al-alums are built.
Or if it does hallucinate, we will get an alert right then and there
and be able to escalate out of that chain
so that we don't end up in that dark place.
We've seen the courts will uphold
if your AI hallucinates a policy
and tells it to your customer,
you are on the hook for that policy,
even if it's not yours.
So this is a serious enterprise consideration.
The second fear here is just about scale.
If you're deploying this at massive scale,
even if you trust to mostly do a good job,
what is it doing? How do you know? How do you find that out? Every AI agent system claims to have some
kind of reporting or logging. But the way those systems work, they're not actually picking apart the
specifics of what they saw in the conversation because they're too busy participating in it.
Having a system that sits on top of it and actually plucks out the key insights in a structured way
makes it much, much easier for you to actually know what's going on, which then feeds into the third and final,
which is compliance. If you need to be able to justify, why did you decide that Mike was probably
fraudulent? You can't just say, I don't know, the magic box told me. You need to be able to explain
what that logic is. ELMs are fundamentally explainable because we can look at the component models
and we can tell you here are the things that contributed to that assessment and prove that it's not biased,
prove that it's accurate and consistent in a way that black box models can't. Yeah. And I think probably
the way that you just described it probably really resonated with a lot of people because I
get the appeal of being able to very easily go get an AI agent, right, like that can go and talk
to people in real time, right? But you just hit it on the head right there, like trust,
observability, all of these other things, especially when it's happening at scale, right?
Like I'm wondering as, you know, someone, you know, both yourself personally and, you know, leading, modulate and, you know, helping to shape the future of this technology, what are some of the things, you know, beyond, you know, today, next week, next month that business leaders need to be thinking about when it comes to, you know, kind of voice enabled AI agents.
I mean, this is such a prosaic answer, but cost.
Right now so many organizations are coasting on someone high up having said, let's reserve a whole bunch of money to figure out this AI thing.
And we're starting to see a lot of businesses run into that wall of, hey, we're hitting the end of that cash and we haven't been able to prove value yet.
So I didn't list earlier as one of the big considerations cost because so far people have been much more interested in let's just prove what's possible.
but I think we're going to see a reckoning coming
where people actually look at the economics of these systems
and have to reimagine it.
That's a boring answer though.
So let me give a more fun technical one too.
I think as much as people want to be thinking about
how does the AI take the load off of the platform,
there's also a version of the AI taking the load off of the customer.
And we're already seeing applications come up today
where instead of me having to call,
my bank and wait on hold, I can have an AI do it.
I can delegate to the AI, which creates a bunch of fun new challenges, like should the
AI prioritize or should the bank prioritize my AI's call the way they would have prioritized my call
as a human? Can they have two AIs talk to each other? At what point are we just recreating an API?
There's a lot of like fundamental design questions of how does any of this impact your brand?
What does it mean for you to try to build trust with your customer if your customer,
is actually sending a delegate to interact with you instead of coming in directly.
So I think there's a much bigger question, not just about voice as a mechanism for completing
a transaction, but about voice being the emotional thing that builds brand trust, that builds
relationships. And how do we make sure that in our haste to automate so many pieces of this,
we're not actually crippling that kind of brand trust and loyalty that so many platforms rely on?
So yeah, you just set off so many new questions in my head, but I can't keep you forever.
But, you know, Mike, we've talked about a lot in today's conversation from, you know, deep fakes and guardrails around, you know, AI voice to, you know, fraud detection and different sectors being disrupted maybe in a good way.
But as we wrap up here, I want to go back to kind of where we started and just ask you.
So now that, you know, through your guys' technology and some of the new advancements that you've come up with,
AI can actually hear what we mean and, you know, it can be more than just looking at tokens and
text. What does this unlock, right? Like, what should business owners be most excited about?
I mean, I think the, I wish I had to punch your way to say it. But I think that the thing that
they should be excited about is actually understanding their customers and being able to solve their
customers problems. Right. Like, that's the actual job of customer support. That's the actual job of
anyone that's picking up the phone and talking to someone, you want to understand them.
You want to build a relationship.
You want to be able to satisfy something.
Right now, there are so many frictions in front of us.
And we can talk about, hey, we can make AI agents understand people better.
What about humans?
What about all the sort of culture mismatches that we run into all the time where I don't
understand what it is that you meant?
I can't tell you the number of people I've talked to and said, I'd love to have a little
bird on my shoulder that could tell me, hey, what they meant was this.
and here's how you can communicate
what you want to communicate to them.
I think that actually opens the door
to much richer sort of worldwide conversations overall.
So that's kind of me on my founder perch
talking about large missions stuff,
but that is what really excites me.
And I think any platform that is thinking
just about how do I complete this transaction more efficiently
is thinking a little bit too small.
There's a much greater opportunity here
to be saying, how can I use this technology, not just to complete that one transaction,
but to enrich the relationship that I'm building with my customers and be someone that they can
actually trust to solve their problems in a larger way.
This was a great and fun conversation.
And I think, you know, Mike, like as I talked about at the beginning, as, you know,
voice becomes more and more native and kind of the de facto interface, I think that you
answered a lot of important questions that our audience is going to have, both in 2026
and beyond. So thank you so much for taking time out of your day to join the Everyday AI show.
We really appreciate it. Thank you so much for happening.
All right. And if you miss anything, y'all, we're going to be recapping it all in today's newsletter.
So make sure if you haven't already to please go to your EverydayAI.com. Thanks for tuning in.
Hope to see you back tomorrow and every day for more Everyday AI. Thanks, y'all.
Meet Firefly AI assistant. Now live in Adobe Firefly, the Allman One Creative AI Studio.
Just describe what you want to create in your own words and the assistant handles the rest,
orchestrating multi-step workflows across Adobe Creative Cloud apps,
including Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome while the assistant accelerates execution.
Stand control with the ability to step in and refine at any time.
See it today at firefly.adobie.com.
And that's a wrap for today's edition of Everyday AI.
Thanks for joining us.
If you enjoyed this episode, please subscribe and leave us a rating.
It helps keep us going.
For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't get left behind.
Go break some barriers and we'll see you next time.
