This reading list provides a general overview, yet not exhaustive, of the recent work in the field of interpreting deep learning. The first few papers (tutorial/review) provide an entry point to the field. They discuss general methodological and practical challenges. The next papers are more specific, and discuss the techniques of intepretation, how to evaluate them, and the application of these techniques to model validation and scientific problems.
Other methods interpret individual classification decisions, e.g. in terms of input variables of the model. Some of these methods apply to any black-box classifier, while others assume a particular structure of the decision function:
The interpretation techniques above produce interesting insights into the DNN model. But how should these techniques be compared and evaluated? This has become a crucial question, as more and more interpretation techniques are being proposed.
Model validation / understanding computer reasoning
If a DNN model produces high classification accuracy on test data, will it also work "outside the lab"? Will it behave in the same way as humans? Interpretability can provide an answer.
Understanding how the model works is especially important in real-world applications, where an incorrect decision can be costly. Examples include, medical diagnosis, and self-driving cars.
Deep ML models combined with interpretation techniques, provides a powerful tool for analyzing scientific data. It can point at previously unknown nonlinear relations in the data. This can be used to generate new scientific hypotheses, that can then be tested empirically.