Deepfakes have emerged as a widespread and rapidly escalating concern in generative
AI, spanning various media types such as images, audio, and videos. Among these,
audio deepfakes are particularly alarming due to the growing accessibility
of high-quality voice synthesis tools and the ease with which synthetic speech
can be distributed via platforms like social media and robocalls. Consequently,
detecting audio deepfakes is critical in combating the misuse of AI-synthesized
speech. However, real-world audio is often subject to various corruptions, such
as noise, modification, and compression, that may significantly impact detection
performance. In this work, we systematically evaluate the robustness of 10 audio
deepfake detection models against 18 common corruption types, grouped into
categories: noise perturbation, audio modification, and compression. Using both
traditional deep learning models and state-of-the-art foundation models, our study
yields four key insights. (1) Most models demonstrate strong robustness to noise
but they are notably more vulnerable to audio modifications and compression,
especially when neural codecs are applied. (2) Speech foundation models generally
outperform traditional models across most corruption scenarios, likely due to their
extensive pre-training on large-scale and diverse audio datasets. (3) Increasing
model size improves robustness, though with diminishing returns. (4) Robustness to
unseen corruptions can be enhanced by targeted data augmentation during training
or by applying speech enhancement techniques at inference time. These findings
highlight the importance of comprehensive evaluation against diverse corruption
types and developing more robust audio deepfake detection frameworks to ensure
reliability in practical deployment settings. We further advocate that future research
in deepfake detection-across all media formats-should account for the diverse and
often unpredictable distortions common in real-world environments.