Generating Deep Networks Explanations with Robust Attribution Alignment

Guohang Zeng (The University of Melbourne)*; Yousef Kowsar (The University of Melbourne); Sarah Erfani (University of Melbourne); James Bailey (THE UNIVERSITY OF MELBOURNE)


Attribution methods play a key role in generating post-hoc explanations on pre-trained models, however it has been shown that existing methods yield unfaithful and noisy explanations. In this paper, we propose a new paradigm of attribution method: we treat the model’s explanations as a part of network’s outputs then generate attribution maps from the underlying deep network. The generated attribution maps are up-sampled from the last convolutional layer of the network to obtain localization information about the target to be explained. Inspired by recent studies that showed adversarially robust models’ saliency map aligns well with human perception, we utilize attribution maps from the robust model to supervise the learned attributions. Our proposed method can produce visually plausible explanations along with the prediction in inference phase. Experiments on real datasets show that our proposed method yields more faithful explanations than post-hoc attribution methods with lighter computational costs.