A robust adversarial ensemble with causal (feature interaction) interpretations for image classification
Outlet Title
Machine Learning
Document Type
Article
Publication Date
2025
Abstract
Deep learning-based discriminative classifiers, despite their remarkable success, remain vulnerable to adversarial examples that can mislead model predictions. While adversarial training can enhance robustness, it fails to address the intrinsic vulnerability stemming from the opaque nature of these black-box models. In this paper, we present a deep ensemble model that combines discriminative features with generative models to achieve both high classification accuracy and strong adversarial robustness. Our approach integrates a bottom-level pre-trained discriminative network for feature extraction with a top-level generative classification network that models adversarial input distributions through a deep latent variable model. Using variational Bayes, our model achieves superior robustness against diverse white-box adversarial attacks without requiring adversarial training. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate our model’s superior adversarial robustness. Through evaluations using counterfactual metrics and feature interaction-based metrics, we establish correlations between model interpretability and adversarial robustness. Our architecture’s generative component is generalizable and can serve as an auxiliary network adaptable to various pre-trained discriminative models. We demonstrate this generalizability through experiments on Tiny-ImageNet with different backbone architectures, indicating the potential applicability of our approach to larger-scale classification datasets.
Recommended Citation
Zeng, Chunheng; Pisu, Pierluigi; Comert, Gurcan; Begashaw, Negash; Vaidyan, Varghese; and Hubig, Nina, "A robust adversarial ensemble with causal (feature interaction) interpretations for image classification" (2025). Research & Publications. 146.
https://scholar.dsu.edu/ccspapers/146