The Good Tech Companies - Catching 98.9 Out of 100 Deepfakes: What It Takes to Lead Hugging Face's Leaderboard
Episode Date: April 16, 2026This story was originally published on HackerNoon at: https://hackernoon.com/catching-989-out-of-100-deepfakes-what-it-takes-to-lead-hugging-faces-leaderboard. Modulate ...tops Hugging Face's Speech Deepfake Leaderboard with 98.9% accuracy at $0.25/hr. Here's what voice-native architecture unlocks for fraud teams. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #deepfake-detection, #voice-ai, #ai-benchmarks, #audio-deepfakes, #fraud-prevention, #deepfake-benchmarks, #huggingface, #good-company, and more. This story was written by: @modulate. Learn more about this writer by checking @modulate's about page, and for more stories, please visit hackernoon.com. Voice deepfake losses are projected to hit $40B by 2027, a 6,566% jump from 2023. Modulate's velma-2 now ranks #1 on Hugging Face's Speech Deepfake Leaderboard with a 1.104% average EER across 14 datasets and 2M+ audio samples, catching 98.9 out of every 100 deepfakes. This post breaks down why the Hugging Face benchmark is the most credible public standard for detection, how Modulate's voice-native ELM architecture outperforms repurposed models from Hiya and Resemble AI, and why running detection at $0.25/hr (100x cheaper than competitors) lets fraud teams monitor entire calls instead of just the opening seconds where most checks stop today.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
Catching 98, 9 out of 100 deepfakes, what it takes to lead hugging faces leaderboard by modulate.
$40 billion in projected losses by 2027 for businesses exposed to voice deep fakes.
To put that into perspective, the total losses in 2023 from voice deepfakes were $600 million, $12.
$5 billion in 2024. That's an increase of 6,000,000.
566% in five years. Naturally, financial services were among the hardest hit, with 23% off financial
sector organizations reporting losses of over a million. Contact centers were also among the hardest
hit. It's now reported that these centers encounter a voice deepfake attack every 46 seconds.
Business need a deepfake detection API to accurately and reliably detect voice fraud, as it happens,
to mitigate those hard losses. Vetting such a solution depends on credible business.
benchmarks, like the Hugging FESH, speech deepfake leaderboard, where modulate ranks number
1 as of March 26, with an average EER of 1.104%. This translates to modulate catching 98,
9% of all AI generated deepfake voices across the diverse range of audio used in Hugging
Faces 14 benchmarks. EER rate also measures the error rate of false positives, resulting in
just a 1. 1% false positive rate. Why?
everyone looks to the Hugging Face Deepfake leaderboard, the Hugging Face Deepfake leaderboard
refers to a set of public, continuously updated leaderboards that evaluate how well AI systems
detect synthetic or manipulated media, especially speech deepfakes. To truly test the metal of each
model, it uses 14 datasets and 2M plus audio samples spanning from clean lab audio to real-world telephony
to benchmark each of the models. Because the HF deepfake arena is open to the public and anyone can
reproduce or expose, the results to test the claim, it's the most transparent benchmark for
evaluating the accuracy of OSER models in their detection of manipulated audio. Though the public
may submit their own results, the leaderboard was developed and is continuously maintained by
researchers at Idiop Research Institute, Switzerland, CNRS, IRASA, France, Muhammad bin Zayed
University of Artificial Intelligence, UAE, Tallinn University of Technology, Estonia, and Valid
limited. U.K. This translates to HF's stance as one of the most credible and rigorous public
benchmarks for evaluating detection systems. The leaderboard modulate number one, high a number two,
resemble I number three. The two most important values on the leaderboard are the average
result and the pool result. Average result grants equal weight to the results across all 14
datasets. Pooled result combines the computational value of all evaluation samples. The leaderboard now
shows modulate, Velma 2, as number 1 for both average and pooled, with a score of 1.
104 and 1, 586, respectively. This translates to an average accuracy of 98. 9%. Out of every 100 audio
files, modulate correctly catches 98. 9%, only 1.1% get falsely flagged as deepfakes. It's the closest any
model has come to complete accuracy with deepfake detection. Continuing with the results,
A.I. Resemble Detect 3B. Closely follows with an average of 2.570.97. 9% average accuracy
and pooled at 2.099. Highya. Authenticity Verific lands in third with an average of 2,113.97.
4% average accuracy and pooled at 2.324. These top three results are rather incredible,
especially when you consider how drastically more accurate each model is in comparison to the fourth model,
DLMSL Speacher. To get an even better understanding of why these achievements are so pivotal,
we must look at our two top competitors individually. Resemble AI is primarily a voice generation
company, TTS and voice cloning, which could place them on the opposite side of the detection arena.
However, their detection product, resemble Detect 3B, as a 3B parameter model. So while detect
isn't their core architecture, they've shown dedication to combating voice fraud with a solid,
accurate model. Haia is a serious player in telephony fraud, with a model that is three times
smaller than resemble AIs, using only 1 billion parameters, operating at 8x real-time speed in streaming
mode. The majority of their business is focused on branded caller ID and voice agents,
though they've dedicated a brand of their business to spam and fraud detection and prevention.
Modulate is a different story altogether.
voice native from day one, and detection is the very core of our business. We are built on the
ELM architecture, and our second offerings are contained within that architecture, conversation
intelligence, speech to text, and deepfake detection. This focus has paid off by allowing us to make
strides in deep fake detection accuracy. What Hugging faces 14 datasets actually test? There are roughly
2 million audio files across the 14 datasets in the HF deep fake arena. All collected to represent real
world attack scenarios across a variety of settings, accents, languages, industries, and technical
jargon. Let's take a look at how truly diverse these audio files are and what they test for.
ASV SPOOF series 2019, 2021 LA, 2021DF, 24. Out of all of the datasets in the HF collection,
the ASV spoof series is the closest thing to an industry standard for model evaluation with
controlled but realistic conditions. That's why it is a
the longest running and most widely cited anti-spoofing benchmark series in all of speech security.
It measures LA Logical Access TTS and voice conversion attacks injected directly into the system,
no channel noise. PA physical access, playback attacks in real rooms with microphones,
reverberation, and environmental noise. DF, DeepFake, modern neural TTSV-VC systems, including diffusion-based
models. With each new addition, you get new
attack types, codex, and channel conditions. For instance, 2024 expands into VoIP, telephony,
compression artifacts, and more realistic channel distortions. Ad challenges, 2022 track one-third,
2023R1, R2, the audio deepfake detection. Ad challenge series was designed with the knowledge
that most models are tested on clean audio, with those unchallenged benchmarks flouted as sale
signals. This data set focuses on noisy, degraded, and real-world audio as a way to punish those models.
It measures. Track 1. In the Wild Deepfake Detection. Track 3. Robustness to channel effects, background
noise, and environmental distortions. 2023R1, R2 introduces more diverse languages, codex,
and unseen synthesis methods. In the Wild, YouTube, social media, uncontrolled noise, in the wild is not a
dataset. It's an entire category that's typically curated from video-based social media like
TikTok, live streams, and YouTube, as well as other audio mediums like podcasts and other
uncontrolled environments. Essentially, it's audio captured, in the wild, our world of social experience.
The nature of the audio lends itself well to application across every modern platform that ingests
user audio. It measures real-world noise, room acoustics, microphone variability, editing
artifacts, background music, cross-talk, overlapping speech, C-O-D-E-C-F-A-K-E, neural codec
processing, neural codecs are now embedded into communication and social apps like WhatsApp, Instagram,
TikTok, and Zoom. This increasingly makes real human audio look synthetic to older detectors.
Obviously, this decreases the accuracy rates of deep fake detection models onto's negative
implications in real-world scenarios. The codec fake aims to identify the models that are able to
discern true human audio from deepfakes by focusing the benchmark on neural codecs and codec,
DAC, soundstream, etc., and codec induced artifacts. It measures whether a detector can handle
audio that has been encoded right pointing arrow decoded right pointing arrow reencoded.
Robustness to neural codec artifacts that resemble TTS artifacts. Sensitivity to bitrate changes.
Academic benchmarks, fake OR real, DFADD, sonar, academic benchmarks are another.
Another category of research-grade datasets used in peer-re-reviewed papers to compare new architectures,
evaluate TTS, voice conversion, and deepfake detection.
This is research-grade performance.
Fake or real is a data set with binary classification and diverse TTS-VC systems.
The audio is clean and controlled, which makes it perfect for setting a baseline of discriminative ability.
DFADD, DeepFake Audio Detection dataset, includes multiple languages and synthesis methods
and was designed to test generalization to unseen attacks.
Sonar dataset places heavy focus on neural vocoder artifacts
by including challenging borderline realistic samples
and high-quality TTS systems.
L-I-B-R-I-S-E-V-O-C, neural vocoder synthesis,
if you're trying to detect high-quality synthetic speech
in commercial TTS systems,
then the Libri's VAC data set is critical.
While built on Libri's speech,
which is generally too clean
to give a complete picture on WER accuracy in a text-to-speech application, it was re-synthesized
using neural vocoders, hi-fi-gon, wave glow, wave RN, etc., which is an essential part of
detection today. Modern TTS pipelines often use diffusion models for acoustic modeling and neural
vocoders for waveform generation, so vocoder detection is a core capability. This dataset measures,
ability to detect vocoder generated speech, sensitivity to subtle phase,
and spectral artifacts. Generalization across vocoder families. How Voice Native
Architecture beats repurposed models. When systems repurpose a model, they're typically layering
non-voice-specific, generalized ML models to handle new tasks. There are several drawbacks to this.
Inefficiency, the extensive post-processing manual review needed when repurposing a generalized model
generally makes the process incredibly inefficient. Accuracy gaps. Voice native AI tools are purpose-built
to take into account tone, cadence, and other complexities of speech-based communication.
This makes them incredibly accurate. Repurposed, generalized models may misinterpret conversational nuances.
Mised context. The ability to detect tone and intent, as voice native AI models do, are pertinent to
ceasing harmful behaviors. Repurposed models may even enforce those behaviors and alienate users.
Limited scalability. Non-specified systems struggled to keep up with the growing volume of voice
interactions, causing delayed responses, at minimum, while also increasing user harm.
You can see how this plays out in the architecture of the top three models in the HF deepfake
leaderboard. Haia is telephony focused, which means it's strong on phone call conditions.
But-Hare architecture is optimized for a specific channel. Resemble AI comes from the generation side,
which means they understand synthesis because, well, they build synthesizers.
While this is an essential factor in effective deepfake detection, it's not the only factor necessary.
Detection requires architectural priorities that repurposed models often don't have, including
adversarial robustness. Real-time processing, false positive management at scale. That is why modulates
Elms take the voice-native approach, which generalizes across telephony, VoIP, clean audio, and degraded
conditions. Because their purpose built for voice, they operate directly on audio features,
spectrograms, prosody, form and transitions, microtemporal patterns.
These architectural differences have makes all the difference in the accuracy of these models
in the application of deepfake detection.
Production performance, beyond the benchmark.
The question has never been just, can you detect deepfakes in the lab?
It's a can you do it at scale without drowning your fraud team in false alerts?
That is why we must go beyond the benchmark to look at deployment.
Average metrics on the deepfake leaderboard help us do just that.
Dataset ear percent pooled one 586 average one 104 in the wild one
271 asv spoof 20 thousand one 29 29 29 asv spoof 2021 la 1 330 asv spoof 20 21 LA1
330 ASV spoof 2021 df 0 331 asv spoof 2024 aval 0 384 fake or real zero 1333 codec fake 1
538 ADD 2022 track 15, 059 ADD 2022 track 31, 174 ADD, 2023 R11, 441 ADD, 2023R21, 742 DFADD
0,000 Libre's Vak VACV0, 265 Sonar 0, 88 among the top performing models, the margins for the highest rankings are extremely slim,
making for a competitive field. Some models also compensate four scores with smaller models and
fast processing times. Haia claims a noteworthy 8x real-time processing speed. Still, Modulate has
managed to earn top placements, with near-perfect results across six datasets, despite how challenging
each is in its own right. In the remaining benchmarks, we remain competitive, steadfastly challenging
the results of other submitted systems. With all the success across the datasets, though, it is the one.
average EER in production that demonstrates impressive generalization across all real-world applications
and synthesis methods. For those outside of the Osser industry, A1, 466% average EER difference between
modulate and the next system might not seem like it would make much difference in application.
In reality, that difference amounts to 60% more deepfakes caught, or half as many, 150,000,
fewer false positives per 10 million calls. All de Nyon a model that is 10x.
smaller. For banks, insurance companies, and customer service teams, this could account for millions in
losses. Now consider that this unprecedented level of accuracy costs just $0.25 per hour. You also
gain both batch and streaming modes and a structured output offered in four confidence segments,
meaning you'll get four separate scores representing levels of certainty about the predictions.
A lower cost OSER model means continuous coverage, that $0.25 per hour is 100x more
X more affordable than our competitors, thanks to an inherently efficient and smaller model.
This isn't just nice to have while maintaining the highest level of accuracy in deep fake detection.
It's absolutely essential to stopping fraudsters. Most banks, insurance companies, and call centers
are checking for fraud. However, they're running these checks at the beginning of calls and hitting
the off switch early to avoid the high costs that come with longer run times.
Fraudsters know this. It's why those with a little more sophistication open their calls with
real human voice to get through the fraud check and turn on the AI voice once they're through it.
An affordable cost structure makes it possible to check the entire call, not just the opening seconds.
You can run every single segment for every speaker, continuously and even in the background.
The efficiency question, does a smaller model really matter?
Haia has highlighted their one billion parameter model as more efficient than other systems.
It's a legitimate claim, since model efficiency matters for both deployment to stand latency.
However, this claim doesn't hold water when tested against other approaches to efficiency and accuracy.
Modulate is already a smaller model than most, but it gains an additional advantage with its voice-native architecture.
This architecture provides an inherent efficiency advantage because it doesn't need to process the full complexity of language.
The models operate purely on acoustic features, avoiding the computational overhead associated with transformer-driven language processing.
Not to mention the avoidance of the post-processing most repurposed models typically need,
which dramatically decreases their efficiency. Voice-native architecture leads them all in
accuracy, efficiency, and cost the three models that occupy the hugging face speech deep-fake
leaderboard have all taken a different architectural approach to achieving both efficiency
and impressive accuracy. But it is modulates voice-native architecture that delivers the best
results, consistently. We deliver this pivotal performance thanks to our consistent testing and
training with noisy voice data. We built our models on half a billion hours of real audio,
focusing on a diverse range of vocal tones, speech rhythms, and pronunciations that appear in
patterns over longer audio segments. This helps us guarantee accuracy, as seen in the HF benchmarks,
at a price that allows businesses to continually run modulate without significantly driving total
costs higher. As the scale of deep fake attacks grows, you need a solution that can scale with them.
You need modulate. Thank you for listening to this Hackernoon story, read by artificial intelligence.
Visit hackernoon.com to read, write, learn and publish.
