The Good Tech Companies - Can Voice Deepfake Detection Keep Up With the 1600% Surge in Fraud Attacks?

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Can voice deepfake detection keep up with the 1,600% surge in fraud attacks? By modulate, in 2024, there were 105,000 reported deepfake incidents, one every five minutes, using the voices of CEOs and top corporate executives to fool employees into handing over millions, along with highly sensitive data and business assets. That same year, 92% of businesses reported, losses tied to Deepfake-related incidents, totaling $12.5 billion. From the end of 2024 through the first quarter of 2025, DeepFake enabled vishing attacks surged by 1,600%. Year over year,

Starting point is 00:00:44 we can account for a 680% rise in deep fake activity. By 2027, business losses due to deep fake activity are expected to hit $40 billion, reaching a total increase of 6,566%. Between the years 2023 and 27. While there is no doubt that these attacks will continue, there is doubt over whether or not deepfake detection systems are capable of handling that surge. What a voice deepfake actually looks like in production, though some companies are designed for synthetic voice dubbing for filmmaking, and others for voice cloning for use in content creation, video games, podcasts, and audiobooks. Those with malicious intent are using them to create fraudulent audio. The process can often look like this. Fraudsters will either use LLMs or manually gain access to recordings of a target speaker,

Starting point is 00:01:34 usually from social media, speaking events, or podcasts. 2. They clean the audio using background removal, equalization, and noise gating, and sometimes a speech-to-speech model, all of which often make recordings that are even cleaner than typical, real conversational audio would be. 3. Then, the voice cloning tool converts the audio into spectrograms and feeds it into neural networks, generating, voice embeddings, or digital fingerprints that act as a vocal identity. 4. The attacker might also use a script to produce responses that mimics their targets' tone, speaking cadence, and typical inflections. We can see how this one attack process plays out in

Starting point is 00:02:13 various real-life scenarios. Voice cloning Ciosay, for example, an attacker voice cloned a CEO to authorize a major wire transfer with their bank. All the fraudster would have to do is scrape YouTube for things like podcast appearances, conference speeches, or news interviews told it a three-second clip of clean audio. That short clip grants them around 85% similarity with their target. They'd train the model, potentially prepare a script, and may also compromise an email to align their story with other communications. After that, all they'd have to do is call their bank or treasury to request a high-revenue transfer to the attacker's controlled account. In some cases, they might involve a level of urgency to override processes, claiming they'll take responsibility for whatever

Starting point is 00:02:57 happens is fallout. Nexus Flow's CEO faced this scenario, costing the company $2.3 million in deep fake fraud. Why traditional A-U-T-H-E-N-T-I-C-A-T-I-O-N failed video and voice calls have been used as an extra verification step for decades, and that was okay before the advent of video deepfakes. Now, we see cases like that of Arup, which lost $25 million when an attacker used. used a video deepfake of a CFO to confirm a suspicious email. These authentication steps largely rely on the human on the other end to identify the fraud. But when the cadence, accents, and even verbal tics match the real executive, it's incredibly difficult to tell the difference, even if you personally know the victim. In fact, most people are able to correctly

Starting point is 00:03:43 identify the deepfakes from the real human audio and video less than 25% of the time. Audio native detection flags what people canton audio native layer isn't asking, does this sound like the CEO? It's asking, does this sound like a synthetic model? Since many fraud detectors are trained on hundreds of generative models, they are able to identify patterns in both human and synthetic speech. Voice native detection models, like modulates deepfake detection API specifically, are trained to look for specific indicators of fraudulent speech, meaning they can flag the following straight from the original audio.

Starting point is 00:04:18 Synthetic voice fingerprints. Model specific artifacts that don't naturally appear in human speech, spectral regularities, over smooth formance, unnatural phase coherence. Mismatch versus enrolled, real, voice. Micro features of a human's voice that cloned audio can not match, such as microprocity, micro-timing, breath patterns, channel characteristics, office mic versus VoIP versus mobile, and long-term spectral stats. This method relies on enrolling executives' voices as baselines for comparison. Contextual risk. Models can create a synthetic likelihood score to warn of high-risk events, typically combining factors like the unusual time of the day for the call, and atypical device or number, first-time, high-value wires, or new destination accounts. With the

Starting point is 00:05:06 proper mechanisms in place, the system could use that score to trigger a secondary authentication factor, multi-signer approval, password, callback, etc. Synthetic caller impersonated. Synthetic caller impersonated a policyholder just like in the previous example, the general process for cloning the target audio remains the same. This time, though, the attacker breached the company's data first, obtaining details such as a social security number, address, and even prior recorded calls with the customer. Because this is the entry for the attack, they're easily Ableto get through the authentication processes. This is most commonly seen among insurance companies. So far, They've experienced a 475% increase in synthetic voice attacks. What makes this fraud more common

Starting point is 00:05:49 as the ability to use a generic TTS voice to portray a calm, confident customer. Attackers are typically after claim payouts, policy loans, and cash value withdrawals. Traditional authorization fails because it relies on knowledge-based authentication, KBA, but as far as the company knows, the fraudster is the customer because they've already obtained all of the custom as P. What audio native? detection catches audio native fraud detection tools run continuously and silently on every call, allowing them to detect cloned or TTS voices in less than 200 MIS. It can flag over regular pitch contours, lack of natural micro variability, model-specific noise floors and harmonics,

Starting point is 00:06:31 replayed audio e.g, prerecorded phrases, via abrupt boundary transitions and compression artifacts. Multiple high-risk calls from the same number, IP, burst of, first of, first, time, high value claims or bank detail changes. Geographic mismatch versus policyholders history. By cross-referencing the synthetic likelihood and behavioral risk, the system can trigger a stronger form of authentication and tag the call for fraud investigation. Real-time voice conversion during a support call this is where we start to see the voice cloning process deviate, because instead of using pre-recorded audio, attackers run the conversion in real-time. They do this by using a streaming voice conversion tool, like 11 labs, that transforms their own live speech

Starting point is 00:07:15 into that of another persons with a less than 200 MS latency. Attackers can either make it sound like a specific person or create a generic native speaking voice. They can call any bank, retailer, SaaS vendor, and completely take over accounts, even changing passwords, addresses, accounts, etc. Since they're deep faking the voice live, they're able to adjust to the conversation at hand, which makes it much harder for detection software to catch. Humans can't tell the difference, and neither can voice biometrics assume that the imposter is another human rather than an adversarial, model-generated voice. Even device checks are weak because attackers can easily spoof numbers. Humans can't tell the difference because the live human on the other end is ableto adapt to the

Starting point is 00:08:00 questions and flow of the conversation. Audio native detection looks for things human and app layer checks can't have in with a real-time voice clone, there are typically subtle inconsistencies between articulation and formant movement and the breathing patterns of a real caller versus AI. Background noise on the clone, or lack thereof, can also trigger fraud warnings. Even the consistency of response timing, e.g responses consistently occur after a two-second delay, and inconsistent jitter patterns may raise flags. The detection system sees them as statistical anomalies in the spectrogram and phase. domain. All of these anomalies can trigger on-screen alerts in real-time or additional verification steps or reroute the call to a special fraud line. Everything points to audio

Starting point is 00:08:46 native architecture. In our recent examination of the leading models on Hugging Faces speech deep fake leaderboard, the most comprehensive public benchmark for audio deepfake detection, we recognize that modulates Velma 2 tops the leaderboard by a good margin because of its voice native architecture. The gap between VELMA2 and the next model, resembles detect isn't small. Modulate tops the leaderboard with an average accuracy of 98, 9% or 1.1% failure rate, meaning out of 100 audio files, only 1.1% of files are falsely flagged AS Deep Fakes. This is the closest any model has ever come to complete and perfect accuracy. Ressemble AI, resemble Detect 3B, follows with an average of 2.570-97. 9% average accuracy.

Starting point is 00:09:35 anchors the pack with an average of 2.113-97, 4% average accuracy, a difference of 1, 466% average EER between modulate and resemble seems like a minute difference to anyone outside the OSER industry. However, the difference is rather large, amounting to 60% more deepfakes caught, or 150,000 fewer false positives per 10 million calls. Modulate accomplishes this even as one of the smallest deepfake detection models, thanks to its naturally efficient architecture. Voice native architecture OR-0-0-0-native detection models or audio-native ensemble learning models, Elm models, ingest raw audio to determine fraud likelihood. By ingesting the raw audio, these audio-native models can create and analyze spectrograms, time frequency maps that reveal unnatural spectral envelopes, missing

Starting point is 00:10:28 noise floors, and harmonic smoothness. Most deep-fake models are limited to this analysis. MFCC's Melfrequency kepstrel coefficients detect overly consistent MFCC patterns and compact representations of vocal tract characteristics. Prosotic features, stress patterns, jitter, shimmer, breath timing, and pitch contours. Microtemporal anomalies, sub-phonium irregularities, phase coherence issues, glottal pulse artifacts, and unnatural formant transitions. Ensemble architectures run all of the above simultaneously to create a unified prediction, more accurately flagging different attack types, languages, and audio conditions. Voice native detection could reduce false positives by 150,000 per year for every false positive, U-risk, SLA breaches, compliance exposure, wasted agent and investigation time, customer friction,

Starting point is 00:11:22 churn risk, voice-native detection systems are less likely to misclassify legitimate, messy, real-world audio as fake by analyzing the audio signals directly, spectrograms, MFCCs, Procity, microtiming. This could reduce the negative impacts of false positives by as much as approximately 150,000 false positives per year. Or, rather, the false positive rate could drop to a marry zero. Three to zero, five percent, cost of detection versus cost of fraud. The FBI's IC3 report for 2025 shows losses related to audio and video deep fakesdo be $893 million in the U.S. alone. The largest reported loss so far as the Arup case, costing the Hong Kong-based multinational engineering firm $25.6M FinTech companies are often hit particularly hard, with losses averaging $630,000 per attack.

Starting point is 00:12:17 Financial services see average losses of $603,000, and banking services see losses of $570,000 per attack. Preventing even 0.5% of the losses seen in the average attack could pay for your entire. entire fraud detection program. Those cost scale linearly, unlike fraud losses, which are catastrophic no matter the scale of your business. To run modulates deepfake detection for 10,000 hours each month, at dollar. 25 per hour would come to $2,500 per month. This is a rounding error for any large contact center or enterprise taking calls at scale. What organizations need to do now to invest in fraud detection? Investing in deep fake and fraud detection systems isn't a simple as finding the best tool with the highest accuracy rate. You need to know what your organization

Starting point is 00:13:06 can support and how well you can adopt any system you choose. For that, you need to audit and test your deep fake resilience. Audit your current voice authentication for deep fake resilience voice biometrics and knowledge-based authentication are still heavily relied on an authentication practices. But, as we're already seeing, these don't stand up against modern eye-based attacks. You can reveal whether or not this is true of your own systems by asking the following. Can your IVR or agent assist stack be fooled by a cloned voice? Do your authentication flows rely on linguistic signals instead of acoustic ones? Are you evaluating models against current generation synthesis, not 2022 era TTS? Test your any detection system you choose against state-of-the-art synthesis, not last year's deepfakes, the quality

Starting point is 00:13:53 of deepfakes used in fraud doubles every year. That's why you should question the speed of adaptability for any solution you choose, but that solution should also be tested regularly, not just upon initial inspection or adoption. Performance three months from now could show quite a difference. Don't simply test against publicly available datasets either. Create your own representative datasets with your pre-existing operational conditions. Include as much variety as you can source from your own audio files, accents, several speakers, background noise, mic issues, etc. Document all false positives and negatives within your own tests, testing both against known deepfakes and known authentic content. You can use this data

Starting point is 00:14:34 to calculate true error rates. Measure resource consumption by processing the actual number of calls you take in a day. Ensure adoptability by having actual team members use the tool. Evaluate how they understand the outputs and whether or not they respond appropriately to fraud detection scores. Do scores trigger appropriate secondary authentication layers? Or do you have to redesign workflows? the answers will help you determine both the effectiveness of any tool you select and whether or not your existing systems can handle it. Deploy audio native detection as a layer, not a replacement audio native detection as your first line of defense, using the raw audio to create several fraud likelihood scores. You'll have answers well before transcription, biometrics, or agent interaction ever comes into play and before funds are ever moved. The fraud likelihood scores should then trigger one to three of the following, out-of-band verification,

Starting point is 00:15:26 such as email confirmation codes, a verified callback, or pushed approvals through a verified app. Dynamic challenge response, such as, tell me the last four digits of the vendor we paid yesterday, or tell me the last deposit we made in your account. Quote. Device binding identity that ties customers to a trusted device or cryptographic key. Transaction level risk scoring that authenticates the action with signals such as whether or not it's an unusual amount, a different beneficiary, or there's a geolocation mismatch. This layered approach helps to reduce your operational load, lower customer friction, and increase fraud catch rates.

Starting point is 00:16:04 Ultimately, that is the goal of any fraud detection tool, but audio-native detection tools like modulates deepfake detection API are the only systems getting companies close to complete protection. Thank you for listening to this Hackernoon story, read by artificial intelligence. Visit Hackernoon.com to read, write, learn and publish.

The Good Tech Companies - Can Voice Deepfake Detection Keep Up With the 1600% Surge in Fraud Attacks?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.