[Paper-PreTrain] Wav2Vec: Unsupervised Pre-Training For Speech Recognition

12 minute read


Last Updated: 2020-04-29

This paper: Wav2Vec: Unsupervised Pre-Training For Speech Recognition is proposed by researchers from Facebook.

Code: https://github.com/pytorch/fairseq/tree/master/examples/wav2vec

1. Introduction

In the paper, the authors apply unsupervised pre-training to improve supervised ASR.

The proposed model Wav2Vec is a convolutional neural network that takes raw audio as input and computes a general representation that can be input to a speech recognition system.

The objective is a contrastive loss that requires distinguishing a true future audio sample from negatives (Collobert et al., 2011;Mikolov et al., 2013; van den Oord et al., 2018).

Different to previous work (CPC by van den Oord et al.,2018), the authors move beyond frame-wise phoneme classification and apply the learned representations to improve strong supervised ASR systems.

Experimental results on the WSJ benchmark demonstrate that pre-trained representations estimated on about 1,000 hours of unlabeled speech can substantially improve a character-based ASR system and outperform the best character-based result in the literature, Deep Speech 2, improving WER from 3.1 % to 2.43 %. On TIMIT, pre-training enables us to match the best reported result in the literature. In a simulated low-resource setup with only eight hours of transcribed audio data, wav2vec reduces WER by up to 36 % compared to a baseline model that relies on labeled data only (§3, §4)

2. Pretraining Approach

Given an audio signal as input, we optimize our model (§2.1) to predict future samples from a given signal context. A common problem with these approaches is the requirement to accurately model the data distribution p(x), which is challenging. We avoid this problem by first encoding raw speech samples x into a feature representation z at a lower temporal frequency and then implicitly model a density ratio p(z_{i+k}|z_i…z_{i−r})/p(z_{i+k}) similar to CPC by van den Oord et al. (2018)

2.1. Model

Overall structure is very similar to the Contrastive Predictive Coding model introduced in the last post.


Detailed settings:

Given raw audio samples x_i∈ X, we apply the encoder network f: X → Z parameterized as a five-layer convolutional network (van den Oord et al., 2018). Alternatively, one could use other architectures such as the trainable frontend of Zeghidour et al. (2018a). The encoder layers have kernel sizes (10,8,4,4,4) and strides (5,4,2,2,2). The output of the encoder is a low frequency feature representation z_i∈Z which encodes about 30 ms of 16 kHz of audio and the striding results in representations z_i every 10 ms.

Next, we apply the context network g : Z → C to the output of the encoder network to mix multiple latent representations z_i…z_{i−v} into a single contextualized tensor c_i=g(z_i…z_{i−v}) for a receptive field size v. The context network has nine layers with kernel size three and stride one. The total receptive field of the context network is about 210 ms. The layers in both the encoder and context networks consist of a causal convolution with 512 channels, a group normalization layer and a ReLU nonlinearity. We normalize both across the feature and temporal dimension for each sample which is equivalent to group normalization with a single normalization group (Wu & He, 2018). We found it important to choose a normalization scheme that is invariant to the scaling and the offset of the input. This choice resulted in representations that generalize well across dataset.

2.2. Objective

The objective is to minimize the contrastive loss for each step k = 1, …, K:

\[L_k=-\sum_{i=1}^{T-k}(log\sigma(z_{i+k}h_k(c_i) + \lambda E_{\tilde{z}\sim p_n}[log\sigma(-\tilde{z}^T h_k(c_i))]))\]

where \(\sigma(x) = \frac{1}{1+e^{-x}}\) , \(\sigma(z_{i+k}h_k(c_i)\) is the probability of z_{i+k} to be the true sample. \(h_k(c_i)=W_k c_i + b_k\) is a step-specific affine transformation for each step k.

The loss is the sum over different step sizes:

\[L = \sum_{k=1}^KL_k\]

In practice, we approximate the expectation by sampling ten negatives examples by uniformly choosing distractors from each audio sequence, i.e.,


,where T is the sequence length and we set λ to the number of negatives. (Similar to van den Oord et al. (2018), we found that sampling negatives from different sequences and speakers yields inferior results)


After training, we input the representations c_i produced by the context network to the acoustic model instead of log-mel filterbank features.

3. Experimental Setup

3.1. Data

We consider the following corpora:

For phoneme recognition on TIMIT (Garofolo et al., 1993b) we use the standard train, dev and test split where the training data contains just over three hours of audio data.

Wall Street Journal (WSJ; Garofolo et al. (1993a); Woodland et al. (1994)) comprises about 81 hours of transcribed audio data. We train on si284, validate on nov93dev and test on nov92. Librispeech (Panayotov et al., 2015) contains a total of 960 hours of clean and noisy speech for training.

For pre-training, we use either the full 81 hours of the WSJ corpus, an 80 hour subset of clean Librispeech, the full 960 hour Librispeech training set or a combination of all of them. To train the baseline acoustic model we compute 80 log-mel filterbank coefficients for a 25 ms sliding window with stride 10 ms.

Evaluation metrics: word error rate (WER) and letter error rate (LER).

3.2. Acoustic Models

We use the wav2letter++ toolkit for training and evaluation of acoustic models (Pratap et al., 2018).


For the TIMIT task, we follow the character-based wav2letter++setup of Zeghidour et al. (2018a) which uses seven consecutive blocks of convolutions (kernel size 5 with 1,000 channels), followed by a PReLU nonlinearity and a dropout rate of 0.7. The final representation is projected to a 39-dimensional phoneme probability. The model is trained using the Auto Segmentation Criterion (ASG; Collobert et al., 2016)) using SGD with momentum.


Our baseline for the WSJ benchmark is the wav2letter++ setup described by Collobert et al. (2019) which is a 17 layer model with gated convolutions (Dauphin et al., 2017). The model predicts probabilities for 31 graphemes, including the standard English alphabet, the apostrophe and period, two repetition characters (e.g. the word ann is transcribed as an1), and a silence token (|) used as word boundary.

Training details:

All acoustic models are trained on 8 NVIDIAV100 GPUs using the distributed training implementations of fairseq and wav2letter++. When training acoustic models on WSJ, we use plain SGD with learning rate 5.6 as well as gradient clipping (Collobert et al., 2019) and train for 1,000 epochs with a total batch size of 64 audio sequences. We use early stopping and choose models based on validation WER after evaluating checkpoints with a 4-gram language model. For TIMIT we use learning rate 0.12, momentum 0.9 and we train for 1,000 epochs on 8 GPUs with a batch size of 16 audio sequences.

3.3. Decoding

Language model:

For decoding the emissions from the acoustic model we use a lexicon as well as a separate language model trained on the WSJ language modeling data only. We consider a 4-gram KenLM language model (Heafield et al., 2013), a word-based convolutional language model (Collobert et al., 2019), and a character based convolutional language model (Likhomanenko et al., 2019).

Decoding process:

We decode the word sequence y from the output of the context network c or log-mel filterbanks using the beam search decoder of Collobert et al. (2019) by maximizing

\[max_y f_{AM}(y|C)+\alpha logp_{LM}(y)+\beta|y|-\gamma \sum_{i=1}^T[\pi_i='|']\]

where f_{AM} is the acoustic model, p_{LM} is the language model, π=π_1,…,π_L are the characters of y.

Hyper-parameters α, β and γ are weights for the language model, the word penalty, and the silence penalty.


For decoding WSJ, we tune the hyperparameters α, β and γ using a random search. Finally, we decode the emissions from the acoustic model with the best parameter setting for α, β and γ. We use a beam size of 4,000 and beam score threshold of 250 for the word based language models, and a beam size of 1,500 with beam score threshold 40 for the character based language model.

3.4. Pre-Training Models (Training details)

  1. Optimizer: Adam + cosine learning rate scheduler annealed over 40k update steps (for both WSJ and the clean Librispeech training datasets) or over 400k steps (for full Librispeech).
  2. Hyperparameters:
    • Learning rate: 1e-7 warmed up for 500 updates to 5e-3, then decayed to 1e-6.
    • Number of negative samples: 10
    • Different number of tasks K (or max number of future steps to predict): 12

We train the first wav2vec variant on 8 GPUs and put audio sequences amounting up to 1.5M frames on each GPU.

Trick for audio length alignment & data augmentation:

Sequences are grouped by length and we crop each to a maximum size of 150k frames, or the length of the shortest sequence in the batch, whichever is smaller. Cropping removes speech signal from either the beginning or the end of the sequence and we randomly decide the cropping offsets for each sample; we re-sample every epoch. This is a form of data augmentation but also ensures equal length of all sequences on a GPU and removes on average 25 % of the training data. After cropping, the total effective batch size across GPUs is about 556 seconds of speech. For the large model variant, we train on 16 GPUs, doubling the effective batch size

4. Results

The authors directly evaluate the performance on downstream tasks including: WSJ (ASR) and TIMIT (phoneme recognition).

4.1. Pre-Training for the WSJ Benchmark

In the pre-training experiments, the output of the context network is fed to the acoustic model instead of log-mel filterbank features. They are evaluated on the WSJ (node93dev & nov92) data.

The results are shown as below:


We can see from the table that:

  1. the proposed method can outperform existing character-based approach Deep Speech.
  2. Ghahremani et al. (2017) pre-trains phoneme-based model on the labeled version of Librispeech and then fine-tunes on WSJ. But the proposed wav2vec large still outperforms Ghahremani et al. (2017) despite a weaker baseline model and not using Librispeech transcriptions.

To understand the impact of pre-trained representations with less transcribed data, the authors then train acoustic models with different amounts of labeled training data and measure accuracy with and without pre-trained representations (log-mel filterbanks).

  • Note: I am not quite sure how the experiment works, I think the labeled si284 is the data used for training the acoustic models, which means the total time of speech used for training is the same, the only difference between Baseline and wav2vec is whether to convert the raw audio to pre-trained representations or log-mel filterbank features.

The curves below show the changes of WER, decoding is performed with the 4-gram language model:


Similar to H ́enaff et al. (2019), we noticed that fine-tuning the embedding network does not meaningfully improve performance while substantially increasing the acoustic model training time.

4.2. Pre-Training for TIMIT

On the TIMIT task we use a 7-layer wav2letter++ model with high dropout (§3; Synnaeve & Dupoux (2016b)).

Table 2 shows that wav2vec pre-training on Librispeech and WSJ audio data can lead to results matching the state of the art. Accuracy steadily increases with more data for pre-training and the best accuracy is achieved with the largest amount of data for pre-training.


4.3. Ablations

Table 3 shows that increasing the number of negative samples only helps up to ten samples. Thereafter, performance plateaus while training time increases.

We suspect that this is because the training signal from the positive samples decreases asthe number of negative samples increases.


Effect of cropping (data augmentation):

When creating batches we crop sequences to a pre-defined maximum length.

Table 4 shows that a crop size of 150k frames results in the best performance. Not restricting the maximum length (None) gives an average sequence length of about 207k frames and results in the worst accuracy. This is most likely because the setting provides the least amount of data augmentation


Effect of different number of tasks K:

Table 5 also shows that predicting more than 12 steps ahead in the future does not result in better performance and increasing the number of steps increases training time.

5. Conclusion

We introduce wav2vec, the first application of unsupervised pre-training to speech recognition with a fully convolutional model.

Our approach achieves 2.43 % WER on the test set of WSJ, a result that outperforms the next best known character-based speech recognition model in the literature (Amodeiet al., 2016) while using two orders of magnitude less transcribed training data.

We show that more data for pre-training improves performance and that this approach not only improves resource-poor setups, but also settings where all WSJ training data is used.

In future work, we will investigate different architectures which is likely to further improve performance.