Measuring the Robustness of Audio Deepfake Detection under Real-World Corruptions

1Fordham University, 2IBM Research
Overview

Overview

Abstract

Deepfakes have emerged as a widespread and rapidly escalating concern in generative AI, spanning various media types such as images, audio, and videos. Among these, audio deepfakes are particularly alarming due to the growing accessibility of high-quality voice synthesis tools and the ease with which synthetic speech can be distributed via platforms like social media and robocalls. Consequently, detecting audio deepfakes is critical in combating the misuse of AI-synthesized speech. However, real-world audio is often subject to various corruptions, such as noise, modification, and compression, that may significantly impact detection performance. In this work, we systematically evaluate the robustness of 10 audio deepfake detection models against 18 common corruption types, grouped into categories: noise perturbation, audio modification, and compression. Using both traditional deep learning models and state-of-the-art foundation models, our study yields four key insights. (1) Most models demonstrate strong robustness to noise but they are notably more vulnerable to audio modifications and compression, especially when neural codecs are applied. (2) Speech foundation models generally outperform traditional models across most corruption scenarios, likely due to their extensive pre-training on large-scale and diverse audio datasets. (3) Increasing model size improves robustness, though with diminishing returns. (4) Robustness to unseen corruptions can be enhanced by targeted data augmentation during training or by applying speech enhancement techniques at inference time. These findings highlight the importance of comprehensive evaluation against diverse corruption types and developing more robust audio deepfake detection frameworks to ensure reliability in practical deployment settings. We further advocate that future research in deepfake detection-across all media formats-should account for the diverse and often unpredictable distortions common in real-world environments.

Audio Perturbation Samples

Noise Perturbation

Perturbation type Original SNR=5 SNR=10 SNR=20 SNR=30 SNR=40
Gaussian Noise
Background Noise
Background Music

Modification

Perturbation type Original Cutoff Ratio=0.1 Cutoff Ratio=0.2 Cutoff Ratio=0.3 Cutoff Ratio=0.4 Cutoff Ratio=0.5
Highpass Filter
Lowpass Filter
Perturbation type Original Semitone=-2 Semitone=-1 Semitone=-0.5 Semitone=0.5 Semitone=1 Semitone=2
Pitch Shift
Perturbation type Original Speed Factor=0.7 Speed Factor=0.9 Speed Factor=1.1 Speed Factor=1.3 Speed Factor=1.5
Time Stretch
Perturbation type Original Delay=0.1s Delay=0.3s Delay=0.5s Delay=0.7s Delayr=0.9s
Echo
Perturbation type Original Length Ratio=0.1 Length Ratio=0.2 Length Ratio=0.3 Length Ratio=0.4 Length Ratio=0.5
Silence Insertion
Perturbation type Original Window Size=6 Window Size=10 Window Size=14 Window Size=18 Window Size=22 Window Size=26
Smooth
Perturbation type Original Key=C Key=D Key=E Key=F Key=G Key=A Key=B
Autotune

Compression

Perturbation type Original BitLevel=2 BitLevel=4 BitLevel=6 BitLevel=8 BitLevel=10
Quantization
Perturbation type Original Bitrate=16kbps Bitrate=32kbps Bitrate=64kbps Bitrate=128kbps Bitrate=256kbps Bitrate=496kbps
Opus
Perturbation type Original Bitrate=8kbps Bitrate=16kbps Bitrate=24kbps Bitrate=32kbps Bitrate=40kbps
MP3
Perturbation type Original Bandwidth=1.5kHz Bandwidth=3kHz Bandwidth=6kHz Bandwidth=12kHz Bandwidth=24kHz
Encodec
Perturbation type Original AudioDec FACodec DAC
Other Neural Codecs

BibTeX

@article{li2025measuring,
title={Measuring the Robustness of Audio Deepfake Detectors},
author={Li, Xiang and Chen, Pin-Yu and Wei, Wenqi},
journal={arXiv preprint arXiv:2503.17577},
year={2025}
}