A robust adversarial ensemble with causal (feature interaction) interpretations for image classification

Outlet Title

Machine Learning

Document Type

Article

Publication Date

2025

Abstract

Deep learning-based discriminative classifiers, despite their remarkable success, remain vulnerable to adversarial examples that can mislead model predictions. While adversarial training can enhance robustness, it fails to address the intrinsic vulnerability stemming from the opaque nature of these black-box models. In this paper, we present a deep ensemble model that combines discriminative features with generative models to achieve both high classification accuracy and strong adversarial robustness. Our approach integrates a bottom-level pre-trained discriminative network for feature extraction with a top-level generative classification network that models adversarial input distributions through a deep latent variable model. Using variational Bayes, our model achieves superior robustness against diverse white-box adversarial attacks without requiring adversarial training. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate our model’s superior adversarial robustness. Through evaluations using counterfactual metrics and feature interaction-based metrics, we establish correlations between model interpretability and adversarial robustness. Our architecture’s generative component is generalizable and can serve as an auxiliary network adaptable to various pre-trained discriminative models. We demonstrate this generalizability through experiments on Tiny-ImageNet with different backbone architectures, indicating the potential applicability of our approach to larger-scale classification datasets.

Share

COinS