[Paper-PreTrain] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

18 minute read


Last Updated: 2020-06-28

This paper: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations is proposed by researchers from Facebook AI.

Code: https://github.com/pytorch/fairseq/ (seems not released yet)

In this paper, the authors shows that wav2vec 2.0 learns powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.

wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.

Results: SOTA on 100 hour subset of Librispeech as well as on TIMIT phoneme recognition.

1. Introduction

Motivation: current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance which is not available for the vast majority of the nearly 7,000 languages spoken worldwide [30]. Learning purely from labeled examples does not resemble language acquisition in humans: infants learn language by listening to adults around them - a process that requires learning good representations of speech.

Our approach encodes speech audio via a multi-layer convolutional neural network and then masks spans of the resulting latent speech representations [25,54], similar to masked language modeling [9]. The latent representations are fed to a Transformer network to build contextualized representations and the model is trained via a contrastive task where the true latent is to be distinguished from distractors [51, 47, 46, 27] .


Pretraining and fine-tuning:

As part of training, we learn discrete linguistic units [51,31,7,17] via a Gumbel softmax [23,5] to represent the latent representations in the contrastive task (Figure 1) which we find to be more effective than non-quantized targets. After pre-training on unlabeled speech, the model is fine-tuned on labeled data with a Connectionist Temporal Classification (CTC) loss [14,4] to be used for downstream speech recognition tasks.

Limitation of previous works (vq-wav2vec):

Previous work learned a quantization of the data followed by a contextualized representations with a self-attention model [5,4], whereas our approach solves both problems end-to-end. Masking parts of the input with Transformer networks for speech has been explored [4,25], but prior work relies either on a two-step pipeline or their model is trained by reconstructing the filter bank input features. Other related work includes learning representations from auto-encoding the input data [50,11] or directly predicting future timesteps [8].


Our results demonstrate the feasibility of ultra-low resource speech recognition: when using only 10 minutes of labeled data, our approach achieves word error rate (WER) 5.7/10.1 on the clean/noisy test sets of Librispeech. We set a new state of the art on TIMIT phoneme recognition as well as the 100 hour clean subset of Librispeech. Moreover, when we lower the amount of labeled data to just one hour, we still outperform the previous state of the art self-training method of [41] while using 100 times less labeled data and the same amount of unlabeled data. When we use all 960 hours of labeled data from Librispeech, then our model achieves 1.9/3.5 WER which performs competitively to the best published result while using a simpler baseline architecture.

2. Model

The model is composed of a multi-layer convolutional feature encoder f:X → Z which takes as input raw audio X and outputs latent speech representations z_1,…,z_T. They are then fed to a Transformer g:Z → C to build representations c_1,…,c_T capturing information from the entire sequence [9,5,4]. The output of the feature encoder is discretized to q_t with a quantization module Z → Q to represent the targets (Figure 1) in the self-supervised objective.

Design details of model component architectures:

Feature encoder The encoder consists of several blocks containing a temporal convolution followed by a GELU activation function [20]. The first block maps raw audio to a feature representation and to increase robustness, we add a group normalization before the GELU to normalize each output channel over the sequence. We apply layer normalization to the output channels of this network [1].

Contextualized representations with Transformers The output of the feature encoder is fed to a context network which follows the Transformer architecture [53,9,32]. Instead of fixed positional embeddings which encode absolute positional information, we use a convolutional layer with kernel size 128 and 16 groups similar to [36,4,55] which acts as relative positional embedding. We add the output of the convolution followed by a GELU to the inputs and then apply layer normalization.

Quantization module For self-supervised training we discretize the output of the feature encoder z to a finite set of speech representations via product quantization [24,5]. This amounts to choosing quantized representations from multiple codebooks and concatenating them. Given G codebooks, or groups, with V entries e∈R^{V×d/G}, we choose one entry from each codebook and concatenate the resulting vectors e_1,…,e_G and apply a linear transformation R^d→R^f to obtain q∈R^f.

Gumbel softmax:

The Gumbel softmax enables choosing discrete codebook entries in a fully differentiable way [15,23,34]. We use the straight-through estimator [25] and setup G hard Gumbel softmax operations [23]. The feature encoder output z is mapped to l∈R^{G×V} logits and the probabilities for choosing the v-th codebook entry for group g are

\[p_{g, v}=\frac{exp(l_{g, v}+n_v)/\tau}{\sum_{k=1}^{V}exp(l_{g,k}+n_k)/\tau}\]

where τ is a non-negative temperature, n=−log(−log(u)) and u are uniform samples from U(0,1).During the forward pass, code word i is chosen by i=argmax_j p_{g,j} and in the backward pass, the true gradient of the Gumbel softmax outputs is used.

3. Training

3.1. Masking

The authors mask a proportion of the feature encoder outputs, or time steps before feeding them to the context network and replace them with a trained feature vector shared between all masked time steps; they do not mask inputs to the quantization module. To mask the latent speech representations output by the encoder, they randomly sample without replacement p=0.065 of all time steps to be starting indices and then mask the subsequent M=10 consecutive time steps from every sampled index; spans may overlap. This results in approximately 49% of all time steps to be masked with a mean span length of 14.7, or 299ms (see Appendix A in the original paper for more details on masking) .

3.2. Objective

During pre-training, the model learns by multiple objectives: a contrastive task L_m, a codebook diversity loss L_d, a L2 penalty L_f.

\[L=L_m+\alpha L_d + \beta L_f\]

where \alpha and \beta are tuned hyperparameters.

Contrastive Loss

Given context network output c_t centered over masked time step t, the model needs to identify the true quantized latent speech representation q_t in a set of K+ 1 quantized candidate representations \tilde{q}∈Q_t which includes q_t and K distractors [22,52]. Distractors are uniformly sampled from other masked time steps of the same utterance. The loss is defined as

\[L_m=-log\frac{exp(sim(c_t, q_t)/k)}{\sum_{\tilde{q}\sim Q_t}exp(sim(c_t, \tilde{q}))/k}\]

where sim(a,b) =a^T b/‖a‖‖b‖ is the cosine similarity between context representations and quantized latent speech representations [18, 6].

Diversity Loss

The contrastive task depends on the codebook to represent both positive and negative examples and the diversity loss L_d is designed to increase the use of the quantized codebook representations [10]. We encourage the equal use of the V entries in each of the G codebooks by maximizing the entropy of the averaged softmax distribution l over the codebook entries for each codebook \overline{p_g} across a batch of utterances; the softmax distribution does not contain the Gumbel noise nor a temperature.

\[L_d=\frac{1}{GV}\sum_{g=1}^G -H(\overline{p}_g)=\frac{1}{GV} \sum_{g=1}^{G} \sum_{v=1}^{V} \overline{p}_{g, v}\log \overline{p}_{g,v}\]


Stabilizing the Feature Encoder

The authors found it helpful to apply an L2 penalty to the activations of the final layer of the feature encoder but before the final layer normalization. They also scale down the global learning for weight updates to the feature encoder by γ, see §4.2.

3.3. Fine-tuning

Pre-trained models are fine-tuned for speech recognition by adding a randomly initialized linear projection on top of the context network into C classes representing the vocabulary of the task [4].For Librispeech, we have 29 tokens for character targets plus a word boundary token. Models are optimized by minimizing a CTC loss [14] and we apply a modified version of SpecAugment [40] by masking to time-steps and channels during training which delays overfitting and significantly improves the final error rates, especially on the Libri-light subsets with few labeled examples.

4. Experimental Setup

4.1. Datasets

Pretraining: 960-hour Librispeech (LS-960) without transcriptions or {the audio data from LibriVox (LV-60k) + same preprocessing following Libri-light [26] –> 53.2k hours of audio.}


  1. 960 hours of transcribed Librispeech
  2. train-clean-100 subset comprising 100 hours (100 hours labeled)
  3. Libri-light limited resource training subsets originally extracted from Librispeech: train-10h (10 hours labeled), train-1h (1 hour labeled), train-10min (10 min labeled)
  4. TIMIT dataset containing five hours of audio recordings with detailed 39 phoneme labels.

4.2. Pre-training

The feature encoder contains seven blocks and the temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths(10,3,3,3,3,2,2). This results in an encoder output frequency of 49 Hz with a stride of about 20ms between each sample, and a receptive field of 400 input samples or 25 ms of audio.

We experiment with two model configurations which use the same encoder architecture but differ in the Transformer setup: BASE contains 12 transformer blocks, model dimension 768, inner dimension (FFN) 3,072 and 8 attention heads. Batches are built by cropping 250k audio samples, or 15.6 sec, from each example. Crops are batched together to not exceed 1.4m samples per GPU and we train on a total of 64 V100 GPUs for 1.6 days [37]; the total batch size is 1.6h.

The LARGE model contains 24 transformer blocks with model dimension 1,024, inner dimension 4,096 and 16 attention heads. We crop 320K audio samples, or 20sec, with a limit of 1.2M samples per GPU and train on 128 V100 GPUs over 2.3 days for Librispeech and 5.2 days for LibriVox; the total batch size is 2.7h.

We use dropout 0.1 in the Transformer, at the output of the feature encoder and the input to the quantization module. Layers are dropped at a rate of 0.05 for BASE and 0.2 for LARGE [21, 12]; there is no layer drop for LV-60k.

We optimize with Adam [28], warming up the learning rate for the first 8% of updates to a peak of 5×10^{−3} for BASE and 3×10^{−3} for LARGE, and then linearly decay it. LARGE trains for 250k updates, BASE for 400k updates, and LARGE on LV-60k for 600k updates. We use weight α= 0.1 for the diversity loss and β=10 for the feature penalty in Equation 2. For the quantization module we use G= 2 and V= 320 for both models, resulting in a theoretical maximum of 102.4k codewords (How does it come???). Entries are of size d/G=128 for BASE and d/G= 384 for LARGE. The Gumbel softmax temperature τ is annealed from 2 to a minimum of 0.5 for BASE and 0.1 for LARGE by a factor of 0.999995 at every update. The temperature in the contrastive loss (Equation 3) is set to κ=0.1. We set the feature encoder gradient scaling factor to γ= 0.1 for Librispech and γ= 0.03 for LibriVox. In the contrastive loss we use K=100 distractors. We choose the training checkpoint with the lowest L_m on the validation set.

4.3. Fine-tuning

After pre-training we fine-tune the learned representations on labeled data and add a randomly initialized output layer on top of the Transformer to predict (Librispeech/Libri-light) or phonemes (TIMIT). For Libri-light, we train three seeds with two different learning rates (2e-5 and 3e-5) for all subsets and choose the configuration with lowest WER on dev-other subset decoded with the official 4-gram language model (LM) with beam 50 and fixed model weights (LM weight 2, word insertion penalty -1). For BASE on the labeled 960h subset we use a learning rate of 1e-4. We optimize with Adam and a tri-state rate schedule where the learning rate is warmed up for the first10% of updates, held constant for the next 40% and then linearly decayed for the remainder. BASE uses a batch size of 3.2m samples per GPU and we fine-tune on 8 GPUs, giving a total batch size of 1,600sec. LARGE batches 1.28M samples on each GPU and we fine-tune on 24 GPUs, resulting in an effective batch size of 1920sec. For the first 10k updates only the output classifier is trained,after which the Transformer is also updated. The feature encoder is not trained during fine-tuning. We mask the feature encoder representations with a strategy similar to SpecAugment [40] detailed in Appendix B.

4.4. Language Models and Decoding

We consider two types of language models (LM): a 4-gram model and a Transformer [3] trained on the Librispeech LM corpus. The Transformer LM is identical to [49] and contains 20 blocks, model dimension 1280, inner dimension 6144 and 16 attention heads. We tune the weights of the language model (interval[0,5]) and a word insertion penalty ([−5,5]) via Bayesian optimization3: we run 128 trials with beam 500 for the 4-gram LM and beam 50 for the Transformer LM and choose the best set of weights according to performance on dev-other. Test performance is measured with beam 1,500 for the n-gram LM and beam 500 for the Transformer LM. We use the beam search decoder of [43].

5. Results

5.1. Low-Resource Labeled Data Evaluation

WER results on Librispeech dev/test sets:


We can see that:

  1. The LARGE model pre-trained on LV-60k and fine-tuned on only 10 minutes of labeled data achieves a WER of 5.7/10.1 on clean/other test sets. his demonstrates that ultra-low resource speech recognition is possible with self-supervised learning on unlabeled data. This approach improves over previous pre-training work which did not learn quantized audio units jointly [4], reducing WER by a about a third.
  2. A recent iterative self-training approach [41] represents the SOTA on the clean 100 hour subset of Librispeech but it requires multiple iterations of labeling, filtering, and re-training. On the 100 hour subset of Librispeech, their method achieves WER 4.2/8.6 on test-clean/other which compares to WER 2.3/5.0 with the LARGE model in a like for like setup, a relative WER reduction of 45%/42%.
  3. When the LARGE model uses an order of magnitude less labeled data (10h labeled), then it still achieves WER 3.2/6.1, an error reduction of 24%/29% relative to iterative self-training.
  4. Using only a single hour of labeled data, the same model achieves WER 3.9/7.6 which improves on both test-clean and test-other by 7%/12% - with two orders of magnitude less labeled data.
  5. Libri-light data splits contain both clean and noisy data leading to better accuracy on test-other compared to test-clean. (where???)
  6. Increasing model size reduces WER on all setups with the largest improvements on test-other (BASE vs. LARGE both on LS-960) and increasing the amount of unlabeled training data also leads to large improvements (LARGE LS-960 vs. LV-60k).

5.2. High-Resource Labeled Data Evaluation on Librispeech

WER results on Librispeech with 960 hours labeled data:


We can find that:

  1. This work achieves WER 1.9/3.5 on test-clean/other. This is the first time self-supervised learning achieves results competitive to the state of the art iterative semi-supervised methods in a high-resource labeled data setup.

Authors’ explanation: This is despite a weaker baseline architecture: supervised training of the same architecture achieves WER 2.1/4.6 (LARGE- from scratch) compared to WER 1.9/4.1 for ContextNet [16], the baseline architecture of the SOTA Noisy student [41]. The vocabulary of their acoustic model (characters) does not match the vocabulary of the LM (words) which delays feedback from the LM and is likely to be detrimental (did not use subwords). Moreover, they did not use any data balancing such as [41]. Finally, self-training is likely complimentary to pre-training and their combination may yield even better results. Appendix E presents a detailed error analysis of their pre-trained models in various labeled data setups.

5.3. Phoneme Recognition on TIMIT

The authors fine-tuned as for the 10 hour subset of Libri-light but did not use a language model.


We can find:

  1. The proposed approach achieves a new SOTA on this dataset, reducing PER by a relative 23%/29% over the next best result on the dev/test sets.

Appendix D shows an analysis of how the discrete latent speech representations related to phonemes

5.4. Ablations

A difference to previous work [5,4] is that we quantize the latent audio representations only for the contrastive loss, i.e., when latents are used as targets, but not when the latents are input to the Transformer network. We motivate this choice by an ablating for which we adopt a reduced training setup to increase experimental turn around: we pre-train BASE on LS-960 for 250k updates with masking probability p= 0.075, fine-tune on train-10h for 60k updates on a single GPU with 640k samples per batch, or 40 sec of speech audio. We report the average WER and standard deviation on the concatenation of dev-clean and dev-other for three seeds of fine-tuning.


Table 4 shows that:

  1. Strategy of continuous inputs with quantized targets (Baseline) performs best. Continuous latent speech representations retain more information to enable better context representations and quantizing the target representations leads to more robust training.

  2. Quantizing the latents both in the input and the targets performs least well, and explains the lower performance of prior work [5,4].

  3. Continuous targets reduce the effectiveness of self-supervised training since targets can capture detailed artifacts of the current sequence, e.g. speaker and background information,which make the task easier and prevent the model from learning general representations beneficial to speech recognition.

  4. Continuous inputs and continuous targets perform second best but various attempts to improve it did not lead to better results (see Appendix F for this experiment and other ablations on various hyperparameters). The training accuracy of identifying the correct latent audio representation increases from 62% to 78.0% when switching from quantized to continuous targets.

6. Conclusion


We presented wav2vec 2.0, a framework for self-supervised learning of speech representations which masks latent representations of the raw waveform and solves a contrastive task over quantized speech representations.


Our experiments show the large potential of pre-training on unlabeled data for speech processing: when using only 10 minutes of labeled training data, or 48 recordings of 12.5 seconds on average, we achieve a WER of 5.7/10.1 on test-clean/other of Librispeech

Potential improvements:

We expect performance gains by switching to a seq2seq architecture and a word piece vocabulary.