Measuring the Robustness of Audio Deepfake Detection under Real-World Corruptions

Deepfakes have emerged as a widespread and rapidly escalating concern in generative AI, spanning various media types such as images, audio, and videos. Among these, audio deepfakes are particularly alarming due to the growing accessibility of high-quality voice synthesis tools and the ease with which synthetic speech can be distributed via platforms like social media and robocalls. Consequently, detecting audio deepfakes is critical in combating the misuse of AI-synthesized speech. However, real-world audio is often subject to various corruptions, such as noise, modification, and compression, that may significantly impact detection performance. In this work, we systematically evaluate the robustness of 10 audio deepfake detection models against 18 common corruption types, grouped into categories: noise perturbation, audio modification, and compression. Using both traditional deep learning models and state-of-the-art foundation models, our study yields four key insights. (1) Most models demonstrate strong robustness to noise but they are notably more vulnerable to audio modifications and compression, especially when neural codecs are applied. (2) Speech foundation models generally outperform traditional models across most corruption scenarios, likely due to their extensive pre-training on large-scale and diverse audio datasets. (3) Increasing model size improves robustness, though with diminishing returns. (4) Robustness to unseen corruptions can be enhanced by targeted data augmentation during training or by applying speech enhancement techniques at inference time. These findings highlight the importance of comprehensive evaluation against diverse corruption types and developing more robust audio deepfake detection frameworks to ensure reliability in practical deployment settings. We further advocate that future research in deepfake detection-across all media formats-should account for the diverse and often unpredictable distortions common in real-world environments.