Reliable Visualization for Deep Speaker Recognition

Abstract: In spite of the impressive success of convolutional neural networks (CNNs) in speaker recognition, our understanding to CNNs’ internal functions is still limited. A major obstacle is that some popular visualization tools are difficult to apply, for example those producing saliency maps. The reason is that speaker information does not show clear spatial patterns in the temporal-frequency space, which makes it hard to interpret the visualization results, and hence hard to confirm the reliability of a visualization tool. In this paper, we conduct an extensive analysis on three popular visualization methods based on CAM: Grad-CAM, Score-CAM and Layer-CAM, to investigate their reliability for speaker recognition tasks. Experiments conducted on a state-of-the-art ResNet34SE model show that the Layer-CAM algorithm can produce reliable visualization, and thus can be used as a promising tool to explain CNN-based speaker models. The source code and examples are available in our project page: http://project.cslt.org/.

Source: paper, code

More Examples

In this page, we use more examples to show the behavior of the three CAM algorithms(Grad-CAM plus plus, Score-CAM, Layer-CAM), and we also show the saliency map from different layers. Note that the S1∼S4 represents ResNetBlock1∼ResNetBlock4. ‘+’denotes element-wised average of multiple saliency maps.

Single-Speaker Examples

In this section, we show the behavior of the three CAM algorithms using single-speaker utterances, i.e., only a single target speaker exists in an utterance.

Three examples of single-speaker utterances are shown above. It can be observed that all the saliency maps clearly separate speech and non-speech segments, demonstrating the basic capacity of the CAMs.

Multi-Speaker Examples

In this section, we show the behavior of the three CAM algorithms using Multi-speaker utterances. In the multi-speaker experiment, we concatenate an utterance of the target speaker with one or two utterances of other interfering speakers, and draw the saliency map. Note that A denotes the target speaker while B/C denotes the interfering speaker.

More examples in forms A-B, B-A-B, A-B-C, and A-B-A utterances are shown above. It can be observed that Layer-CAM shows surprisingly good performance: it can accurately locate the segments of the target speaker, and mask non-target speakers almost perfectly.In comparison, Grad-CAM++ and Score-CAM are very weak in detecting non-target speakers.

Example1:





Example2:





Example3:

Example1
S1

S2

S3

S4

S4+S3+S2+S1

Example1
S1

S2

S3

S4

S4+S3+S2+S1

Example1
S1

S2

S3

S4

S4+S3+S2+S1

Examples - Reliable Visualization for Deep Speaker Recognition

More Examples

Single-Speaker Examples

Multi-Speaker Examples