[Paper-AST] SkinAugment: Auto-Encoding Speaker Conversions For Automatic Speech Translation

4 minute read


Last Updated: 2020-04-03

This paper: SkinAugment: Auto-Encoding Speaker Conversions For Automatic Speech Translation is proposed by researchers from JHU and Facebook. It is published at ICASSP 2020.

Code : Not available.

1. Introduction

In the Introduction section, the authors state that their goal is to propose a data augmentation technique for low resource scenarios in AST tasks.

AST methods to leverage ASR and MT data include pretraining [4], multitask learning [5] and weakly supervised data augmentation [6, 7].

Existing SpecAugment [3] modifies the spectrogram with time warping, frequency masking and time masking. SkinAugment takes another different way to generate additional audio samples without requiring transcripts, using a recent neural voice conversion technique “text-to-speech skins“[8].

2. SkinAugment: Augmentation with voice conversion by conditioned autoencoding

The structure of core speaker conversion technique is shown in the figure below:


Authors state that they use this method (TTS Skins [8]) because it shows competitive results on the Voice Conversion Challenge 2018 benchmark [10], other voice conversion models can also be employed as submodule here.

3. Experiments

3.1. Datasets and Evaluation

Two standard AST datasets are used in the experiments:

  1. AST LibriSpeech [12] (English-French; the authors use the same setup as [13]), this dataset is also used for low-resource ASR.
  2. MuST-C (English-Romanian; 432 hours) [14].

Voice conversion model is trained using the 200+ voices in LibriSpeech. Therefore, dataset 1 (LibriSpeech) is considered as in-domain audios while dataset 2 (MuST-C) are considered out-of-domain audios.

In later experiments, we further augment the training data by translating LibriSpeech’s transcripts (removing test set occurrences [7]) with an MT system. The MT system is trained on two standard datasets: WMT16 for En–Ro (600k sentence pairs) and WMT14 for En–Fr (29 million sentence pairs).

Evaluation: BLEU on tokenized output.

3.2. Model Architecture

All of our experiments use the same mixed convolutional-recurrent end-to-end model architecture for conditional sequence generation, our focus being data augmentation techniques.

The authors use a speech encoder consisting of two non-linear layers followed by two convolutional layers and three bidirectional LSTM layers, along with a custom LSTM decoder [13,7]. The encoder uses 40 log-scaled mel spectrogram features.

The authors use 3 decoder layers as in [7], who report the number of parameters in each model.

3.3. Baselines

3.3.1. Cascade

ASR model + MT model using Transformer, trained on the WMT14 En–Fr parallel data.

3.3.2. SpecAugment

SpecAugment adds perturbations at the feature level, whereas SkinAugment operates at the raw wave level. We use the LibriSpeech double setting [3].

3.3.3. SpecAugment-p

Further, we introduce a simple but effective variant: SpecAugment-p, which applies SpecAugment to each batch with probability p. (The standard SpecAugment would thus use p= 1.) We found that SpecAugment with p= 0.5 was effective in our setup.

3.3.4. Augmentation Settings

We perform conditional generation of new data with either 8or 16 voice conversions, applying them to 10%, 25%, 50%,and 100% of the training corpus. This creates transformed variants of our dataset in distinct (arbitrarily selected) voices.

We compare these settings to standard SpecAugment, as well as to SpecAugment-p with p= 0.5, which we found to be effective.

3.3.5. Machine-Translation Augmentation

Machine-translated transcripts of large ASR corpora dramatically increase the performance of AST models [6,7]. We therefore translate LibriSpeech transcripts with our Transformer, then concatenate these synthetic training instances to the AST data. We apply 16 skins to 25% of the AST training data, as we found this to perform best.

4. Results

The authors perform conditional generation of new data with either 8 or 16 voice conversions, applying them to 10%, 25%, 50%, and 100% of the training corpus. The results are as below:

The result of ablation study on incorporating:

  2. translated transcripts (“+ MT”)
  3. both


Note: AST LibriSpeech - AT is original Corpus released by [12] while AST LibriSpeech is proposed by [13] by adding off-the-shelf automatic translations .

5. Conclusion

  1. While this method relies on additional audio data to train the speaker conversion, it does not rely on transcribed text.
  2. Effectively combine speaker conversion data with MT-augmented ASR data.