The Good Tech Companies - Catching 98.9 Out of 100 Deepfakes: What It Takes to Lead Hugging Face's Leaderboard

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Catching 98, 9 out of 100 deepfakes, what it takes to lead hugging faces leaderboard by modulate. $40 billion in projected losses by 2027 for businesses exposed to voice deep fakes. To put that into perspective, the total losses in 2023 from voice deepfakes were $600 million, $12. $5 billion in 2024. That's an increase of 6,000,000. 566% in five years. Naturally, financial services were among the hardest hit, with 23% off financial sector organizations reporting losses of over a million. Contact centers were also among the hardest hit. It's now reported that these centers encounter a voice deepfake attack every 46 seconds.

Starting point is 00:00:49 Business need a deepfake detection API to accurately and reliably detect voice fraud, as it happens, to mitigate those hard losses. Vetting such a solution depends on credible business. benchmarks, like the Hugging FESH, speech deepfake leaderboard, where modulate ranks number 1 as of March 26, with an average EER of 1.104%. This translates to modulate catching 98, 9% of all AI generated deepfake voices across the diverse range of audio used in Hugging Faces 14 benchmarks. EER rate also measures the error rate of false positives, resulting in just a 1. 1% false positive rate. Why? everyone looks to the Hugging Face Deepfake leaderboard, the Hugging Face Deepfake leaderboard

Starting point is 00:01:34 refers to a set of public, continuously updated leaderboards that evaluate how well AI systems detect synthetic or manipulated media, especially speech deepfakes. To truly test the metal of each model, it uses 14 datasets and 2M plus audio samples spanning from clean lab audio to real-world telephony to benchmark each of the models. Because the HF deepfake arena is open to the public and anyone can reproduce or expose, the results to test the claim, it's the most transparent benchmark for evaluating the accuracy of OSER models in their detection of manipulated audio. Though the public may submit their own results, the leaderboard was developed and is continuously maintained by researchers at Idiop Research Institute, Switzerland, CNRS, IRASA, France, Muhammad bin Zayed

Starting point is 00:02:21 University of Artificial Intelligence, UAE, Tallinn University of Technology, Estonia, and Valid limited. U.K. This translates to HF's stance as one of the most credible and rigorous public benchmarks for evaluating detection systems. The leaderboard modulate number one, high a number two, resemble I number three. The two most important values on the leaderboard are the average result and the pool result. Average result grants equal weight to the results across all 14 datasets. Pooled result combines the computational value of all evaluation samples. The leaderboard now shows modulate, Velma 2, as number 1 for both average and pooled, with a score of 1. 104 and 1, 586, respectively. This translates to an average accuracy of 98. 9%. Out of every 100 audio

Starting point is 00:03:13 files, modulate correctly catches 98. 9%, only 1.1% get falsely flagged as deepfakes. It's the closest any model has come to complete accuracy with deepfake detection. Continuing with the results, A.I. Resemble Detect 3B. Closely follows with an average of 2.570.97. 9% average accuracy and pooled at 2.099. Highya. Authenticity Verific lands in third with an average of 2,113.97. 4% average accuracy and pooled at 2.324. These top three results are rather incredible, especially when you consider how drastically more accurate each model is in comparison to the fourth model, DLMSL Speacher. To get an even better understanding of why these achievements are so pivotal, we must look at our two top competitors individually. Resemble AI is primarily a voice generation

Starting point is 00:04:10 company, TTS and voice cloning, which could place them on the opposite side of the detection arena. However, their detection product, resemble Detect 3B, as a 3B parameter model. So while detect isn't their core architecture, they've shown dedication to combating voice fraud with a solid, accurate model. Haia is a serious player in telephony fraud, with a model that is three times smaller than resemble AIs, using only 1 billion parameters, operating at 8x real-time speed in streaming mode. The majority of their business is focused on branded caller ID and voice agents, though they've dedicated a brand of their business to spam and fraud detection and prevention. Modulate is a different story altogether.

Starting point is 00:04:53 voice native from day one, and detection is the very core of our business. We are built on the ELM architecture, and our second offerings are contained within that architecture, conversation intelligence, speech to text, and deepfake detection. This focus has paid off by allowing us to make strides in deep fake detection accuracy. What Hugging faces 14 datasets actually test? There are roughly 2 million audio files across the 14 datasets in the HF deep fake arena. All collected to represent real world attack scenarios across a variety of settings, accents, languages, industries, and technical jargon. Let's take a look at how truly diverse these audio files are and what they test for. ASV SPOOF series 2019, 2021 LA, 2021DF, 24. Out of all of the datasets in the HF collection,

Starting point is 00:05:45 the ASV spoof series is the closest thing to an industry standard for model evaluation with controlled but realistic conditions. That's why it is a the longest running and most widely cited anti-spoofing benchmark series in all of speech security. It measures LA Logical Access TTS and voice conversion attacks injected directly into the system, no channel noise. PA physical access, playback attacks in real rooms with microphones, reverberation, and environmental noise. DF, DeepFake, modern neural TTSV-VC systems, including diffusion-based models. With each new addition, you get new attack types, codex, and channel conditions. For instance, 2024 expands into VoIP, telephony,

Starting point is 00:06:29 compression artifacts, and more realistic channel distortions. Ad challenges, 2022 track one-third, 2023R1, R2, the audio deepfake detection. Ad challenge series was designed with the knowledge that most models are tested on clean audio, with those unchallenged benchmarks flouted as sale signals. This data set focuses on noisy, degraded, and real-world audio as a way to punish those models. It measures. Track 1. In the Wild Deepfake Detection. Track 3. Robustness to channel effects, background noise, and environmental distortions. 2023R1, R2 introduces more diverse languages, codex, and unseen synthesis methods. In the Wild, YouTube, social media, uncontrolled noise, in the wild is not a dataset. It's an entire category that's typically curated from video-based social media like

Starting point is 00:07:24 TikTok, live streams, and YouTube, as well as other audio mediums like podcasts and other uncontrolled environments. Essentially, it's audio captured, in the wild, our world of social experience. The nature of the audio lends itself well to application across every modern platform that ingests user audio. It measures real-world noise, room acoustics, microphone variability, editing artifacts, background music, cross-talk, overlapping speech, C-O-D-E-C-F-A-K-E, neural codec processing, neural codecs are now embedded into communication and social apps like WhatsApp, Instagram, TikTok, and Zoom. This increasingly makes real human audio look synthetic to older detectors. Obviously, this decreases the accuracy rates of deep fake detection models onto's negative

Starting point is 00:08:13 implications in real-world scenarios. The codec fake aims to identify the models that are able to discern true human audio from deepfakes by focusing the benchmark on neural codecs and codec, DAC, soundstream, etc., and codec induced artifacts. It measures whether a detector can handle audio that has been encoded right pointing arrow decoded right pointing arrow reencoded. Robustness to neural codec artifacts that resemble TTS artifacts. Sensitivity to bitrate changes. Academic benchmarks, fake OR real, DFADD, sonar, academic benchmarks are another. Another category of research-grade datasets used in peer-re-reviewed papers to compare new architectures, evaluate TTS, voice conversion, and deepfake detection.

Starting point is 00:08:58 This is research-grade performance. Fake or real is a data set with binary classification and diverse TTS-VC systems. The audio is clean and controlled, which makes it perfect for setting a baseline of discriminative ability. DFADD, DeepFake Audio Detection dataset, includes multiple languages and synthesis methods and was designed to test generalization to unseen attacks. Sonar dataset places heavy focus on neural vocoder artifacts by including challenging borderline realistic samples and high-quality TTS systems.

Starting point is 00:09:31 L-I-B-R-I-S-E-V-O-C, neural vocoder synthesis, if you're trying to detect high-quality synthetic speech in commercial TTS systems, then the Libri's VAC data set is critical. While built on Libri's speech, which is generally too clean to give a complete picture on WER accuracy in a text-to-speech application, it was re-synthesized using neural vocoders, hi-fi-gon, wave glow, wave RN, etc., which is an essential part of

Starting point is 00:09:59 detection today. Modern TTS pipelines often use diffusion models for acoustic modeling and neural vocoders for waveform generation, so vocoder detection is a core capability. This dataset measures, ability to detect vocoder generated speech, sensitivity to subtle phase, and spectral artifacts. Generalization across vocoder families. How Voice Native Architecture beats repurposed models. When systems repurpose a model, they're typically layering non-voice-specific, generalized ML models to handle new tasks. There are several drawbacks to this. Inefficiency, the extensive post-processing manual review needed when repurposing a generalized model generally makes the process incredibly inefficient. Accuracy gaps. Voice native AI tools are purpose-built

Starting point is 00:10:46 to take into account tone, cadence, and other complexities of speech-based communication. This makes them incredibly accurate. Repurposed, generalized models may misinterpret conversational nuances. Mised context. The ability to detect tone and intent, as voice native AI models do, are pertinent to ceasing harmful behaviors. Repurposed models may even enforce those behaviors and alienate users. Limited scalability. Non-specified systems struggled to keep up with the growing volume of voice interactions, causing delayed responses, at minimum, while also increasing user harm. You can see how this plays out in the architecture of the top three models in the HF deepfake leaderboard. Haia is telephony focused, which means it's strong on phone call conditions.

Starting point is 00:11:32 But-Hare architecture is optimized for a specific channel. Resemble AI comes from the generation side, which means they understand synthesis because, well, they build synthesizers. While this is an essential factor in effective deepfake detection, it's not the only factor necessary. Detection requires architectural priorities that repurposed models often don't have, including adversarial robustness. Real-time processing, false positive management at scale. That is why modulates Elms take the voice-native approach, which generalizes across telephony, VoIP, clean audio, and degraded conditions. Because their purpose built for voice, they operate directly on audio features, spectrograms, prosody, form and transitions, microtemporal patterns.

Starting point is 00:12:17 These architectural differences have makes all the difference in the accuracy of these models in the application of deepfake detection. Production performance, beyond the benchmark. The question has never been just, can you detect deepfakes in the lab? It's a can you do it at scale without drowning your fraud team in false alerts? That is why we must go beyond the benchmark to look at deployment. Average metrics on the deepfake leaderboard help us do just that. Dataset ear percent pooled one 586 average one 104 in the wild one

Starting point is 00:12:48 271 asv spoof 20 thousand one 29 29 29 asv spoof 2021 la 1 330 asv spoof 20 21 LA1 330 ASV spoof 2021 df 0 331 asv spoof 2024 aval 0 384 fake or real zero 1333 codec fake 1 538 ADD 2022 track 15, 059 ADD 2022 track 31, 174 ADD, 2023 R11, 441 ADD, 2023R21, 742 DFADD 0,000 Libre's Vak VACV0, 265 Sonar 0, 88 among the top performing models, the margins for the highest rankings are extremely slim, making for a competitive field. Some models also compensate four scores with smaller models and fast processing times. Haia claims a noteworthy 8x real-time processing speed. Still, Modulate has managed to earn top placements, with near-perfect results across six datasets, despite how challenging each is in its own right. In the remaining benchmarks, we remain competitive, steadfastly challenging

Starting point is 00:14:04 the results of other submitted systems. With all the success across the datasets, though, it is the one. average EER in production that demonstrates impressive generalization across all real-world applications and synthesis methods. For those outside of the Osser industry, A1, 466% average EER difference between modulate and the next system might not seem like it would make much difference in application. In reality, that difference amounts to 60% more deepfakes caught, or half as many, 150,000, fewer false positives per 10 million calls. All de Nyon a model that is 10x. smaller. For banks, insurance companies, and customer service teams, this could account for millions in losses. Now consider that this unprecedented level of accuracy costs just $0.25 per hour. You also

Starting point is 00:14:54 gain both batch and streaming modes and a structured output offered in four confidence segments, meaning you'll get four separate scores representing levels of certainty about the predictions. A lower cost OSER model means continuous coverage, that $0.25 per hour is 100x more X more affordable than our competitors, thanks to an inherently efficient and smaller model. This isn't just nice to have while maintaining the highest level of accuracy in deep fake detection. It's absolutely essential to stopping fraudsters. Most banks, insurance companies, and call centers are checking for fraud. However, they're running these checks at the beginning of calls and hitting the off switch early to avoid the high costs that come with longer run times.

Starting point is 00:15:36 Fraudsters know this. It's why those with a little more sophistication open their calls with real human voice to get through the fraud check and turn on the AI voice once they're through it. An affordable cost structure makes it possible to check the entire call, not just the opening seconds. You can run every single segment for every speaker, continuously and even in the background. The efficiency question, does a smaller model really matter? Haia has highlighted their one billion parameter model as more efficient than other systems. It's a legitimate claim, since model efficiency matters for both deployment to stand latency. However, this claim doesn't hold water when tested against other approaches to efficiency and accuracy.

Starting point is 00:16:16 Modulate is already a smaller model than most, but it gains an additional advantage with its voice-native architecture. This architecture provides an inherent efficiency advantage because it doesn't need to process the full complexity of language. The models operate purely on acoustic features, avoiding the computational overhead associated with transformer-driven language processing. Not to mention the avoidance of the post-processing most repurposed models typically need, which dramatically decreases their efficiency. Voice-native architecture leads them all in accuracy, efficiency, and cost the three models that occupy the hugging face speech deep-fake leaderboard have all taken a different architectural approach to achieving both efficiency and impressive accuracy. But it is modulates voice-native architecture that delivers the best

Starting point is 00:17:02 results, consistently. We deliver this pivotal performance thanks to our consistent testing and training with noisy voice data. We built our models on half a billion hours of real audio, focusing on a diverse range of vocal tones, speech rhythms, and pronunciations that appear in patterns over longer audio segments. This helps us guarantee accuracy, as seen in the HF benchmarks, at a price that allows businesses to continually run modulate without significantly driving total costs higher. As the scale of deep fake attacks grows, you need a solution that can scale with them. You need modulate. Thank you for listening to this Hackernoon story, read by artificial intelligence. Visit hackernoon.com to read, write, learn and publish.

The Good Tech Companies - Catching 98.9 Out of 100 Deepfakes: What It Takes to Lead Hugging Face's Leaderboard

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.