Deep Neural Networks: how can we interpret what they learned

Deep Neural Networks: how can we interpret what they learned


Louis ten Bosch,, Radboud University, Nijmegen, the Netherlands

Hugo Van hamme,, University of Leuven, Belgium

Lou Boves:, Radboud University, Nijmegen, the Netherlands


This session directly addresses a question often asked during the 2017 Interspeech conference and is related to recent papers on DNN. Its topic is directly related to a fashionable and highly promising route in learning and the modelling of complex (cognitive) processes. All knowledge about what a DNN actually learns will support in further advancing the deep learning field.

The session addresses the academic question how human knowledge can profit from using DNNs as a tool for data structure discovery

Background and rationale

Everybody active in the speech technology area witnesses the broad advent of deep learning techniques, in particular the use of Deep Neural Nets. During the 2017 Interspeech conference in Stockholm we again saw an increase in the number of DNN-based methods in speech research and technology compared to the previous years. DNNs are now being applied for many aspects in automatic speech recognition, such as estimating acoustic models, for new designs such as end-to-end ASR systems (in which lexicon, AM and LM are optimized in one general framework), and for computational approaches for word learning that do not make use of meta-level descriptive units. The success of DNNs strongly suggests that the parameters of a trained DNN reflects relevant structure in the training data.

Not surprisingly, many presentations of DNN-based research presented in Stockholm triggered discussions about the question how exactly knowledge is coveredor represented in a DNN; in other words, whether and how a DNN can be interpreted and whether we can learn from DNNs. It is widely believed that for the simulation of many cognitive tasks, including human speech processing, it is hopeless to rely on simple parametric models (such as linear models). This raises the question whether the classical model of 'proper science' in which knowledge is defined as provable propositions in an axiomatic framework can be applied in fields such as artificial intelligence in the same way as in logic, mathematics and the physical sciences, where it has been –and still is– impressively powerful.


The objective of this session is to invite papers that investigate how knowledge is encoded or represented in a DNN, and how and to what extent this knowledge can be used to update and reshape our conventional knowledge of speech processing, e.g., (phonetic or linguistic) interpretations of the representations at the layers of DNNs, context effects, speaker effects, and the use of phonetic/linguistic knowledge to guide the design and training of DNNs. Recent examples of this type of research deal with levels of abstractness (Russakovsky et al., 2015), with manifold structure (Zhu et al., 2016; Basri and Jacobs, 2017), and with representation learning (Bengio et al., 2014).

For papers in this special session, leading questions are:

  1. how does a trained DNN encapsulate the structure that exists in a data set?
  2. how can we visualize this information?
  3. how can we learn from a DNN, i.e., how can the information in a DNN be used to sharpen our insights?
  4. what type of knowledge can be encoded in a DNN?
  5. can understanding the information encoded in a DNN be used as a guidance in designing and training more powerful networks?
  6. what architectures and training techniques are most amenable to interpretations?

Envisaged format

We think the best format of this 2h special session is a combination of a number of oral presentations that provide a higher-level general overview, accompanied by a larger number of poster presentations allowing detailed discussions. Oral presentations are best for presenting the broad ideas and the bigger steps to a broad audience, while posters are best to narrow down the discussion and to specialize into the details. The session will be concluded with a short (15min) plenary summarization about the overall findings and promising future directions.

References (of the many):

Basri, R. and Jacobs, D. (2017). Efficient representation of low-dimensional manifolds using deep networks. International Conference on Learning Representations (ICLR) 2017.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.

Yoshua Bengio, Aaron Courville, and Pascal Vincent (2014). Representation Learning: A Review and New Perspectives. arXiv:1206.5538v3.