Back to the tutorial's main page ]

Reading List

This reading list provides a general overview, yet not exhaustive, of the recent work in the field of interpreting deep learning. The first few papers (tutorial/review) provide an entry point to the field. They discuss general methodological and practical challenges. The next papers are more specific, and discuss the techniques of intepretation, how to evaluate them, and the application of these techniques to model validation and scientific problems.

Tutorial/review papers

Methods for DNN interpretability

Some recent work has focused on generating nice-looking interpretable prototypes that summarize the high-level concepts learned by a DNN.
Other methods interpret individual classification decisions, e.g. in terms of input variables of the model. Some of these methods apply to any black-box classifier, while others assume a particular structure of the decision function:
Recent work also focuses on the interpretation and explanation of recurrent neural networks, especially in the NLP domain.
Deep neural networks can be trained to transfer concepts from one domain to another (more interpretable) domain.

Evaluating interpretability techniques

The interpretation techniques above produce interesting insights into the DNN model. But how should these techniques be compared and evaluated? This has become a crucial question, as more and more interpretation techniques are being proposed.

Model validation / understanding computer reasoning

If a DNN model produces high classification accuracy on test data, will it also work "outside the lab"? Will it behave in the same way as humans? Interpretability can provide an answer.
Understanding how the model works is especially important in real-world applications, where an incorrect decision can be costly. Examples include, medical diagnosis, and self-driving cars.

Interpretability for the Sciences

Deep ML models combined with interpretation techniques, provides a powerful tool for analyzing scientific data. It can point at previously unknown nonlinear relations in the data. This can be used to generate new scientific hypotheses, that can then be tested empirically.