May 2025
·
17 Reads
Adversarial attacks pose a significant threat to the robustness of neural networks. Recent studies have shown that an imperceptible altering of the input can fool trained models to yield false predictions. However, the perturbations that underlie adversarial attacks are usually hard to interpret. This paper proposes a novel interpretable adversarial attack based on a recently introduced explainability technique known as oriented, modified integrated gradients. We will demonstrate the effectiveness of the proposed method on popular image classification datasets and compare it with state-of-the-art adversarial attacks. In addition, we will analyze the visual similarity between original images and the corresponding adversarial perturbations, both qualitatively and quantitatively, and demonstrate that the proposed approach generates adversarial perturbations with significantly improved interpretability.