Everyday AI Podcast – An AI and ChatGPT Podcast - EP 435: How 50X cheaper & faster AI transcription is changing enterprise work
Episode Date: January 8, 2025Meetings. Speeches. Quick thoughts to self. Those words are more than words. That's your company's secret sauce. Philip Kiely, Head of Developer Relations at Baseten, joins us to discuss.New...sletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion: Ask Jordan and Philip questions on AI transcriptionUpcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:1. AI Transcription Benefits2. Whisper Model by OpenAI3. Cost of Transcription4. Business Applications for AI TranscriptionTimestamps:00:00 Conversations are gold; AI makes them valuable.03:56 NVIDIA advances exceed Moore's Law; Apple's AI inaccurate.09:48 Text transcription technology error-prone; manual transcription necessary.11:19 Whisper V3: Low error rate, multilingual accuracy.14:58 Whisper rapidly transcribes audio with high efficiency.17:26 Emotion inflection crucial for text-to-speech synthesis.23:58 AI transcriptions need human verification for accuracy.25:35 Chain cheap AI models for efficient calls.30:53 On-device AI less powerful than cloud AI.33:07 Build prototypes now; technology improving rapidly.Keywords:Whisper by OpenAI, Automatic Speech Recognition, Open-source ASR, Accuracy, Multilingual ASR, MIT licensed, Amazon Transcribe, Whisper V3 Turbo, Live transcription, Speech inflection, ChatGPT, Philip Kiely, Jordan Wilson, Everyday AI podcast, Unstructured data, Anthropic funding, NVIDIA AI advancements, Apple AI alerts, AI transcription, Base 10, Searchable data, AI infrastructure platform, AI cost efficiency, Wearable technology, Voice control, On-device inference, Cloud inference, Speech synthesis, Business applications of transcription, Future of workSend Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Start Here ▶️Not sure where to start when it comes to AI? Start with our Start Here Series. You can listen to the first drop -- Episode 691 -- or get free access to our Inner Cricle community and all episodes: StartHereSeries.com Also, here's a link to the entire series on a Spotify playlist.
Transcript
Discussion (0)
This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips.
Listen daily for practical advice to boost your career, business, and everyday life.
Meet Firefly AI Assistant, now live and Adobe Firefly, the all-in-one creative AI studio.
Just describe what you want to create and the assistant handles the rest,
orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome.
The assistant accelerates execution.
Every word that you say, every meeting, every speech, that's gold.
I think so oftentimes when we get caught up in implementing generative AI in our business,
we think about other large language models that exist, right?
And we think about like, oh, we're limited by their training data, you know,
hey, hopefully these models get better.
But what about your words?
What about all of those meetings?
What about that big seminar that you're speaking at?
That is unstructured gold.
I think something that we don't talk about enough on this show or in general is how the words that we speak, the conversations that we have, how valuable those are.
And how the AI surrounding that is getting cheap.
faster, more accurate, and what that really unlocks for businesses of all size.
All right, I'm excited to talk about that and a lot more today on Everyday AI.
What's going on, y'all?
My name's Jordan Wilson.
I'm the host of Everyday AI.
This thing's for you.
It is your daily live stream podcast and free Gilling Newsletter, helping people like you and me,
us everyday people, catch up with everything that's happening in the world of AI
and how we can use all this information to grow our companies and our careers.
Is that you?
If so, welcome home.
And your other home is our website, your everyday AI.com.
So if you find value in today's conversation with our guests,
we're going to be recapping and sharing a lot more insights in our daily newsletter,
as well as keeping you up with everything else that's happening in the world of AI.
Also, there's like, I don't know, a thousand hours of audio content, text on there,
exclusive interviews from the smartest people in AI in the world,
all for free on our website.
All right, before we get started, let's first go over the AI news.
So Anthropic is set to close a $2 billion funding round as its valuation source to $60 billion.
So Anthropic, one of the biggest startups in the generative AI space, is reportedly nearing the completion of a $2 billion funding round led by Lightspeed Venture Partners.
So this investment will significantly increase its valuation from $18 billion last year to an impressive $60,000.
billion. So the latest funding round is part of a broader $6 billion initiative for
Anthropic, followed by an earlier $4 billion investment from Amazon. So yeah, they've,
I think, ranked in like $8 billion in commitment so far in this round. So Anthropics annualized
revenue has reached an approximate $875 million driven by its model of selling access to its
advanced API or sorry, its advanced AI systems to enterprises and through platforms like Amazon
Web Services.
So who knows, maybe with this extra cash that Anthropic just, you know, put in its pocket,
maybe their rate limits will go from unusable to kind of usable.
We'll see.
All right.
Next, InVedia CEO, Jensen Wong is claiming that his new AI chip performance on his GPUs
surpasses Moore's law.
Yeah, we're breaking science, breaking science in the face.
So the Nvidia CEO stated in an interview that their latest data center super chip is over 30 times faster for AI inference workloads compared to its predecessor, which could significantly lower the cost of running AI models.
He emphasized that by innovating across the entire stack, architecture, chip design, design systems, libraries, algorithms, etc., Nvidia can achieve advancements at a pace that exceeds.
Moore's law. So Wong introduced the concept of hyper Moore's law. Yeah, now we got to learn new
scaling laws suggesting that AI development is not slowing down, but is instead governed by
three active scaling laws, pre-training, post-training, and test time compute. So Wong also
claimed that Nvidia's AI chips today are 1,000 times better than those produced a decade ago,
indicating a rapid evolution in technology that could benefit various industries. All right, last but not
least Apple is facing a ton of backlash over its inaccurate AI news alerts and has promised an update.
So Apple is under scrutiny after its AI feature that essentially summarizes news alerts is generating
some false and misleading news headlines raising concerns about the accuracy of information in its new Apple
intelligence. So Apple announced that it will release a software update in the coming weeks to clarify
when news notifications are generated by its AI system known as Apple intelligence.
So the misleading alerts have sparked criticism from various media organizations,
including the BBC and ProPublica,
which reported similar inaccuracies in AI generated summaries of their content.
All right.
A lot more on those stories and everything else you need to stay ahead,
not just keep up, stay ahead on our website.
So make sure you go check that out and sign up at your everyday.
A.com. All right, enough chit-chat. Let's get to the bulk of today's conversation.
AI transcription. You probably don't think about it, but it is a boon for business. So I'm excited to
have this conversation. Hey, live stream audience, help me welcoming to our show. We have Philip
Keeley, the head of developer relations at Base 10. Philip, thank you so much for joining the
Everyday AI show. Hey, Jordan. Thanks for having me. Super excited to be here.
Let's let's chat about transcriptions. Before we
do. Can you tell everyone just a little bit about Base 10 what it is you all do? Absolutely. So
Base 10 is an AI infrastructure platform. We take open source, fine-tuned, and completely custom
models for our customers, and we help them deploy those models on worldwide auto-scaling GPU
infrastructure. We also assist with the model performance efforts so that we can get them,
you know, lower speeds, higher throughput, lower cost, better quality. Our customers are AI-native
startups at enterprise like writer, bland, Patreon. And one thing that we've been working a lot with
recently is the Whisper model. We recently released the world's fastest, most accurate, and cheapest
whisper influence. So let's, I mean, I do want to dive into Whisper, and I'm sure it's something
that a lot of our audience is familiar with. But before I even go there, what's the main benefits,
right? Like, you know, when people talk about transcription and I kind of started the show out,
on it. I'm a firm believer, right? Every word I speak on this podcast gets instantly transcribed and
fed into a large language model. But what's the benefit of capturing your company's words and
using those? I think sometimes people just overlook it. Yeah. Well, it's just another stream of data.
So if you think about all of the YouTube videos in existence, all of the podcast in existence,
all of the phone calls that maybe have been made into your company's call center.
There's just tons of data floating around out there that takes a long time to process.
Maybe if you're some sort of super speed listener, you can listen to a podcast on one and a half or two-time speed.
But when you think about how fast a human talks, we only speak at, you know, maybe up to 150 words per minute.
I know I'm not supposed to actually speak that quickly when I'm doing a podcast, so I'm always trying to slow it down a little bit.
maybe you listen at 2x speed, you're getting, what, 300 words a minute?
But if you think about how fast someone can read, you know, the fastest speed readers can read at 500 or even 1,000 words per minute.
So audio is actually a fairly low signal channel.
There's not a ton of bandwidth in talking.
But if we can transcribe that audio and then we can get it in text, not only is it much easier for us to process as people, we can read a lot faster, but it's also easier.
for machines to process.
You know, we can feed it into large language models, like you said.
Or we can do simple, find and replaced.
We can do simple search.
There's a ton of things you can do on text that is really hard to do on audio.
It is, you know, I hate floating around the term like game changer, right?
But it is, right?
Being able to capture everything that's said, you know, I like to say that is your first party
or first company gold, all the words that you talk about.
Live stream audience, thank you for joining us.
You know, if you do have any questions on AI transcription,
on what that means for your business, get them in for Philip now.
But maybe let's not Whisper, but let's talk about Whisper.
Philip, what the heck is Whisper?
Yeah, so Whisper is an open source model that was created by OpenAI a couple years ago.
And I'll actually give like a kind of little history lesson here.
So in 2019, I was.
working on a blog post about speech to text, which can also be called transcription. It can be called
ASR, which is automatic speech recognition. And I was kind of doing a survey of the state of the art.
And one of the best things I found back in 2019 was something called Amazon Transcribe. It's like an
AWS thing. And it was pretty impressive back then. You know, it was able to take some segments of text,
and it was able to create a reasonably interesting transcript out of them.
But there was definitely a ton of errors, especially around things like names, places, proper nouns,
as well as just if I kind of mumbled a little bit, then it really didn't know what was going on.
And so actually, a year later, I was working on a book.
And when I wrote that book, I did a ton of different interviews with experts in the field.
These were all audio interviews that I needed to transcribe.
and I ended up having to transcribe them by hand because I did all of these, you know, I did the survey of all this technology.
It wasn't really good enough for, you know, publication.
And so I just spent like a month at the keyboard typing out these 50,000 words from these expert interviews.
So, you know, I've always kept my eye on the space since then.
You know, when open source models like Wave 2VET came out, I was really excited.
I wanted to try it.
But nothing really approached the quality of my...
my amateur but still human transcription.
So September 21st, 2022, OpenAI released a model called Whisper.
And what's really exciting about this model is it's actually MIT licensed,
which means you don't have to go through the OpenAI platform to get it.
You can run it on your computer.
You can run it on a cloud service.
You can run it wherever you want.
And the first Whisper model was really exciting because it offered much higher accuracy.
Also, it offered that accuracy across a bunch of language.
So when we talk about an ASR model and accuracy, we want to think about WER, which is word, word error rate.
So for how many, you know, for a thousand words, how many of those words are going to be wrong?
You want that word error rate to be as low as possible.
And so this model came out, it's got word error rates of like 10, you know, maybe one percent of
the words are going to be wrong versus, you know, much higher for other models.
And since then, these models have gotten better.
Now we're on Whisper V3 here in 2025.
We also have VSper V3 turbo, which is a little less accurate than V3, but much faster.
So we're able to get faster and more accurate transcription from these open source models in a lot of different languages.
Yeah.
And what you said there, I don't know if anyone else in our audience that hits them,
but that hit me because I remember, right, I was a journalist back then.
So I literally had taped interviews on a little tape recorder, right?
I had one that was digital, but I think early on it was an actual tape not to date myself.
And I remember hitting play, stop, rewind so many times because especially when you're quoting
people for big news publications, you had to get every single word right.
You know, I'm even curious as someone that did this as well.
What was your first reaction to you seeing something like whispered?
back in 2022.
What was your reaction when using it at first?
I mean, my first reaction was, man, I wish I had this a couple of years ago because,
you know, my fingers were hurting.
I had my mouse on the floor so I could kick it with my toe to start and stop the audio
recording.
I was thinking, wow, my life could have been so much easier if this had been released a couple
years ago.
Adobe just introduced an entirely new way to create, bringing the power and precision of
its creative suite into one conversational experience. Meet Firefly AI assistant, now live in
the Adobe Firefly app, the all-in-one creative AI studio. Powered by Adobe's creative agent,
Firefly AI assistant lets you start with your vision, just describe what you want, and shape the
outcome as it takes form with the assistant. The assistant orchestrates multi-step workflows,
drawing on 60-plus pro-grade tools across Adobe Creative Cloud apps, including Photoshop, Illustrator,
Premiere, Lightroom Express, and more to help bring your ideas to life.
You can also get started with creative skills, a growing library of pre-built workflows for
common creative tasks, like batch editing photos, creating mood boards, portrait retouching, and
creating social variations.
Every step the assistant takes is visible, so you can refine, redirect, or take over at any time.
You stay in the driver's seat as the creative director.
Adobe Firefly AI assistant now in public beta.
See it today at firefly.adobie.com.
So, you know, when we talk about some recent advancements, right?
Because, yeah, I even remember I used Whisper when it first came out in 2020.
And I didn't think it was slow, right?
But now when I'm using it, because, yeah, I run it locally.
I have, you know, plenty of programs that run in on the back end as well.
Now I'm like, oh, wow, it was slow.
What does the recent speed and the cost, right?
When we look at Whisper V3 turbo, you know, maybe whenever we see a Whisper v4, what do these
advancements actually mean when it's faster and cheaper?
Yeah.
So when we think about speed and cost with Whisper, we talk about real-time factor.
So if you have, say, an hour of audio, how many times faster than real-time, can you
transcribe that. And my real-time factor as a person is like 0.3 or something, 0.2. It takes me
four or five hours to type out an hour of audio because I'm constantly starting and stopping it and
going back. Maybe if I was a faster type or maybe if I was a professional, I could go a lot faster.
Out of the box, you know, Whisper might get you to, depending on the hardware, you're using,
I don't know, 50 times, 100 times real-time factor. So maybe that hour of all.
audio, you're able to transcribe it in a minute. And that unlocks a ton. But you're actually able to
take it way further through various optimization techniques that we can get into. And you can get that
real-time factor all the way up to say like a thousand times where instead of that hour of audio
taking a minute to transcribe, it might only take, you know, five or six seconds. And the other factor
in performance optimization is, you know, if you're trying to do some kind of streaming use case,
where you're transcribing the audio not as a file after the fact, but live during the conversation.
And so for that, you care about the round-trip latency for a single 30-second chunk of audio.
And for that, you can get down to about 200 milliseconds.
So I'm a martial artist.
For me, reaction time is super important.
I don't have the best reflexes in the world.
But, you know, the average reaction time for a human is about 200 milliseconds.
And so if you're able to process that audio round trip in the time that it takes someone to sort of like react to something happening, then to your end user, that's going to feel like it's basically instant.
A lot of good comments here from our from our live stream audience and a couple of questions too.
So, you know, Samuel's asking, is there any effort to capture tone and inflection during transcription?
spoken language has a lot of context components beyond grammar and vocabulary.
That's something I was thinking myself, Sam.
So thanks for that question.
Philip, are we going to see that in future AI transcription, right?
Like, I sometimes talk very quickly.
Sometimes I talk with emotion, right?
Like, is that something that future AI transcription will be able to tackle?
That's a really good question.
Emotion, inflection, that kind of stuff is more of a factor right now when we're going in
the other direction, when we're going from,
text to speech and we want an AI model to be able to do speech synthesis. There's a lot of work that's
been put into making that sound much more natural and that's where those context components in spoken
language are super important. Generally right now when we're going the ASR route, when we're going
from speech to text, that is going to be just the sort of raw contents of the file or the raw contents
of the conversation, but that would definitely be super interesting to look at. Like I said,
it's a big area of research going in the other direction, but it's not such a big factor right now
in transcription. What has, you know, what have all of these updates done to cost, right?
Because, yeah, I remember even originally, I was happy to pay, you know, a dollar an hour or
whatever it was, you know, in the earlier days of, you know, kind of AI transcription.
What is the cost now? And, you know, what does that mean in the grand scheme of things as businesses
are trying to leverage all of this data, right? They're recording Zoom meetings. Very common now,
right? I think people have this, you know, gold mine of data that they're maybe sitting on. So can you
walk us through the cost changes and then what that actually means? Absolutely. So, you know, a couple,
a few years ago, you're looking at a dollar or two per hour of audio. And that's generally how
it's measured is. How much input time are you putting in? That's how much you're paying. So if you're
putting in a one hour audio, say like a podcast and you want to get back a transcript, it's going to
cost one or two dollars. But today, it's gotten a lot faster. And when AI models get faster,
they also get cheaper. The sort of thing that makes an AI model expensive to run is that you have to run
it on a GPUs are very expensive. So if you use less time on that GPU to accomplish the same
task, then that price goes down. Today, you're able to do these transcription jobs for,
it depends, it depends on exactly how fast you want it to run. It depends on the exact type of
transcript you're trying to generate. But if you're doing the simplest, most basic transcription
and you're okay with, you know, waiting a couple extra seconds for it to generate.
You can get down to just a couple cents per hour.
So we're looking at, you know, a 50 to 100 X reduction in the cost of doing this transcription.
And that's massive, you know, now for the same price that you were transcribing one hour of audio before,
you could transcribe 50 or 100 hours.
And that just unlocks so much for business.
Yeah.
And speaking of that, let's dive into it because I still think this is one.
of those areas, just like I started the show off, I think, you know, so many when we talk about
business use cases and advancements in generative AI and large language models, right?
I think everyone looks at using a chat GPT, a Gemini, a Meta Lama, right?
Like people look at using these models, but they don't necessarily look from within what they're
creating, which a lot of times is meetings.
It's conversations like this, right?
Can you talk a little bit about maybe some new and exciting
business use cases that have maybe just begun to become a little bit more unlocked because of that
cost and that speed. Absolutely. So a big business is going to generate just so much audio. A lot of
that's going to be internal. Sometimes you might not want to transcribe literally every single thing
that happens, but there are a bunch of places where it is really valuable. So one of those is,
you know, any kind of customer-facing situation.
You know, if you're doing call center, if you're doing a, you know,
teller service, anything where you are interacting with a customer
and from the customer perspective, you know, you get on the line there
and you hear, oh, this call may be, you know, monitored for quality assurance.
So that quality assurance monitoring historically is like a manual process.
You have some supervisors who are maybe listening to a few calls
and making sure that everything's going well.
Now, you could just transcribe every single call that's coming into your business.
And then you have a fully searchable database.
You can do quality assurance.
You can also maybe analyze those transcripts to figure out patterns and what your customers are asking for.
You can do content moderation at scale.
You know, if I post a, you know, something with text on a platform and it has, you know, some stuff the platform doesn't want on there,
That's super easy to identify and flag that I'm using it in words.
If I'm posting, say, a podcast on Spotify or something, then that's a lot more difficult.
Or if I'm posting a YouTube video because, you know, you can't really just listen to all of the podcasts at all of the YouTube videos.
But if you can get that from audio to text, then you can run it through those same moderation algorithms.
You can also do stuff like media subtitling, closed caption generation.
You can do that in real time.
I know sometimes if I'm watching a sports game on silent, I see the announcer's words,
but it's always like five or six seconds after the play is happening.
It's so far behind.
It's so far behind, right?
And so if we can get that, you know, down to something that's more real time, that's super
awesome.
And you can also do real time translation with that as well.
So, yeah, there's just so many.
different use cases where you have these massive volumes of audio being generated that before,
it just wasn't cost efficient to process these or just took too long. And now with this cheaper,
faster AI transcription that's more accurate, you can get a lot more value out of these big audio
corpuses. So, you know, Cecilia brings up a good point because there's entire industries, right,
that for many decades have thrived around just typing words what people are saying, right?
Like she's asking about how is AI transcription disrupting industries like court reporting, right?
Like are we going to see some of these traditional roles where people were just transcribers?
Are they just going to go away?
Well, you know, you do still have to verify these transcripts.
When I talk about accuracy in an AI transcription and that word error,
you know, that word error rate is not zero.
There is a lot that you can do to make your transcripts more accurate.
You know, for example, you can look at, you can have a model analyze them.
You can look at, say, chunks that are silent and, you know, replace them or rerun them.
But, you know, at the end of the day, if you're doing something like court reporting where you need 100% perfect accuracy,
it's important to have systems beyond just a single transcription model that are going to guarantee that accuracy.
And, you know, I think that there's still a major role for human in the loop in these kind of systems where you're able to, you know, go in and verify these transcripts and make sure that they're completely accurate.
Yeah, so you talked a little bit about how this, you know, advancing technology, whisper models,
you know, in general, are helping change how we've done business in the past.
But as we look to how these advancements might change how we work in the future,
what might we see change?
Because everything's going live, right?
You know, if you're live advanced voice mode from chat, GPT, you have Gemini live.
You know, you can talk to co-pilot, right?
Like how will more accurate, faster, cheaper transcription change how we work?
So one thing with that like live voice mode from chat GPT is it's really cool, but it's also really expensive, right?
That, you know, that sort of capability costs what like, you know, 10 plus dollars an hour.
And this transcription is only a few cents an hour.
So if you're a clever developer, you're able to kind of put this model in front of some other models and build these sort of chains of models for these compound AI use cases, where instead of having one gigantic model that costs a ton to run and is able to do it end to end, you chain together a few small cheap models and run the same pipeline much faster and much cheaper.
One place where that's really important right now is AI phone calling.
So if you want to say like have a automatic pizza order taker that you're going to build where a customer can call it up and just say what they want on their pizza and it's going to say, all right, I've got this pizza for you.
That kind of thing.
You can build, you know, that AI phone calling with these faster, cheaper transcription models.
Another big aspect is wearables.
So a big trend right now is, you know, having a pin or some speaker microphone combination on your body.
that's able to sort of record your daily context so that you have, you know,
better information for your decision making for that kind of stuff.
And so if you're, you know, wanting to record your life 12 or 16 hours a day,
again, if that's, you know, going through that historic transcription algorithm
where it's costing a dollar an hour, well, that's like $16 a day.
That's just not a sustainable business.
But if you're able to do it for, you know, while you're sleeping at night for a couple
pennies and it's costing a few cents a day, then, you know, now we're in the realm where this
can make sense as a consumer product. So wearables, you know, local influence, phone calling,
all these sort of things are these sort of real-time multimodal user experiences that are getting
unlocked by these transcription models. Yeah. And I do think that we are going to see those
in the wild that actually makes sense, right? If you've listened to this show, I'm never one to just,
you know, things like the humane pin and the Apple Vision pro, I'm like, no, not really.
But I think some recent advancements, right, the meta's, uh, metas ray bands, uh, you know,
some of Google's new products, you know, I think wearables are going to be a thing, whether you
think they're going to or not.
I do think that is kind of the next iteration.
Uh, but you know, one thing I'm curious about and it's something I've always thought about,
this concept of typing versus talking, right?
I can talk really quickly, but also, I don't blame y'all if you listen to this podcast on 2X.
I would too.
But might we see something in the future where it becomes less and less common to type?
And we're just interfacing with, you know, I don't know, autonomous AI agents and multi-agent
environments.
And all we're really using is our voice.
And if so, you know, what part of this technology has to improve or what advancement
are we kind of waiting on until that future is finally here,
where we're just sitting back, kicking our feet up,
and just talking to our AI agents.
Yeah.
So the future is now, actually, for that.
You have all those agent use cases and stuff that are still coming.
But if you just want to control your computer,
if you want to type an article without using your fingers,
That's actually possible.
I have a colleague who actually had to have sojury recently on their hands.
And so they went and used a voice transcription app for a few days to do writing while they couldn't type as much.
They used something called Whisper Flow, which is an application out there for that.
But yeah, it's, you know, the future is now in terms of controlling your computer with voice.
it's not something that's going to be practical in every situation.
Like if I'm on the train, I don't want to be talking to my computer and everyone else is talking to their computer.
That doesn't sound so good.
But it can definitely be helpful if you have limited typing ability.
I don't type particularly quickly.
I can definitely talk much faster than I can type.
So it's something that I'm super excited about.
Yeah, it's a good point.
And I think, you know, having conversations about these type of things is important.
because I, yeah, I do think, yeah, whether we're talking wearables, whether we're talking,
you know, talking to your computer, it is becoming more and more common, more.
I think part of how we work in the future.
One other thing, you know, what part of this, Philip, like why are, you know, if I'm talking to Siri,
if I'm talking to Alexa, right?
I see a big difference than what I'm talking to as an example, a Gemini Live or a, you know,
chat dbt, advanced voice mode.
Why is there still this kind of divide, even between the big tech conglomerates on which ones can accurately understand our words?
And sometimes they just can't.
So what you're observing there is the difference between on-device influence and cloud influence.
So if you're taking an AI model and running it on the user's device, that's on-device of edge influence.
And, you know, your user device is not going to be as powerful as like an Nvidia H-100 GPU.
sitting in a data center somewhere.
It's not going to be able to run as big of a model
or run the same model at as high of a quality.
And so because of that, for these voice transcription things,
you're probably seeing a little bit worse results
when you're using it on a local device
versus when you're using it on the cloud.
However, that's changing really quickly.
These models are pretty small.
They can be just a couple billion parameters.
And so those are actually a really good candidate
for local influence.
even on stuff like smart speakers, or maybe that next generation of smart speakers that has those
upgraded GPUs, upgraded V-WAM capabilities so that they can run these small models.
And so I definitely think you'll see that gap close in the transcription space pretty quickly.
All right. So, Philip, we've covered a lot in today's conversation.
I mean, we talked a little bit about Whisper, what this technology is, the cost savings, how
you know, faster and more accurate, you know, voice transcription AI has led to many new use cases.
But, you know, as we wrap up today show, what is the one most important thing that you want
our audience to know when it comes to how cheaper and faster AI transcription is changing
enterprise work?
I think the most important thing to understand is the trend.
You know, in the last couple years, these models have gotten much more accurate, much
cheaper, much faster. And there was, of course, the massive leap from 2022 to maybe like a couple
years before that. I think this is going to keep happening. So even if you see a use case today where it's like,
you know, Philip, actually like five cents per hour, that's a little too expensive for what I'm trying to do.
Or, oh, you can only do 200 millisecond round trip time. Like, yeah, that doesn't cut it. We're not done
optimizing these models. And even in the, you know, last couple of quarters of work
on these models. We've gotten much better at running them, been able to run them much faster and
the cheaper. And that's a trend that's continuing. So I would definitely, you know, look at these
use cases that you're considering today and say, okay, does this make sense today? If yes, go for it.
If no, still maybe go for it because it could make sense in three months, six months, nine months,
once the technology gets even better. And you're going to be pretty far ahead. You know, you said, for example,
Jordan that you don't always love some of these wearables. You know, that's a case where having the
prototype today is what's going to set you up to be able to use the, you know, polished version next
year for those companies. And so I'd say in the same vein, if you're building some kind of
speech use case, if you're building some kind of transcription use case, and if it doesn't work
today, still build that prototype, put it in your back pocket and keep an eye on the technology
as it advances because it's getting better fast.
That's great advice and I think words that we should all listen to.
All right.
So, Philip, thank you so much for taking time out of your day to join the Everyday AI show.
We appreciate your insights.
Hey, thank you so much for having me.
I had a great time.
All right, y'all.
Quick reminder, we covered a lot and there's a lot more.
So if you found something valuable today, please, if you're listening on the podcast,
make sure to subscribe and rate the platform.
go back and listen to our library of episodes.
Like we literally have thousands of hours of content on our website,
hundreds of episodes.
Also go to your everyday AI.com.
We're going to be recapping today's conversation.
Yeah, I'm going to upload it in 10 seconds.
I'm going to have it all transcribed,
but I'm going to be writing about it,
a real human telling you more info and insights to take away.
So thank you for joining us.
Hope to see you back tomorrow and every day for more everyday AI.
Thanks, y'all.
Meet Firefly AI Assistant.
Now live in Adobe Firefly, the Allman One Creative AI Studio.
Just describe what you want to create in your own words and the assistant handles the rest,
orchestrating multi-step workflows across Adobe Creative Cloud apps,
including Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome while the assistant accelerates execution.
Stand control with the ability to step in and refine at any time.
See it today at firefly.adobie.com.
And that's a wrap for today's edition of Everyday AI.
Thanks for joining us.
If you enjoyed this episode, please subscribe and leave us a rating.
It helps keep us going.
For a little more AI magic, visit Your EverydayAI.com
and sign up to our daily newsletter so you don't get left behind.
Go break some barriers and we'll see you next time.
