Despite the great empirical success of adversarial training to defend deep learning models against adversarial perturbations, so far, it still remains rather unclear what the principles are behind the existence of adversarial perturbations, and what adversarial training does to the neural network to remove them. In this paper, we present a principle that we call “feature purification”, where we show that one of the causes of the existence of adversarial examples is due to the accumulation of certain small “dense mixtures” in the hidden weights during the training process of a neural network. Moreover, we show that one of the goals of adversarial training is to remove such small mixtures to “purify” hidden weights, to make the network (much) more robust. We present both experiments on standard vision datasets to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly initialized gradient descent indeed satisfies this principle. Moreover, our result sheds light on why, when training over the original data set, a neural network can learn well-generalizing but non-robust features; and how can adversarial training further robustify these features.
May 5 @ 12:30
12:30 pm (1h)