[Paper-Vocoder] WaveNet: A Generative Model for Raw Audio
Published:
Last Updated: 2020-07-04
This paper: WaveNet: A Generative Model for Raw Audio is proposed by researchers from Google and DeepMind.
Code: https://github.com/kan-bayashi/PytorchWaveNetVocoder (not official)
Samples: https://www.deepmind.com/blog/article/wavenet-generative-model-raw-audio
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms.
1. Introduction
This paper introduces WaveNet, an audio generative model based on the PixelCNN (van den Oordet al., 2016a;b) architecture. The main contributions of this work are as follows:
- It shows that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS), as assessed by human raters.
- In order to deal with long-range temporal dependencies needed for raw audio generation, the authors develop new architectures based on dilated causal convolutions, which exhibit very large receptive fields.
- It is shown that when conditioned on a speaker identity, a single model can be used to generate different voices.
- The same architecture shows strong results when tested on a small speech recognition dataset, and is promising when used to generate other audio modalities such as music.
In general, WaveNets provide a generic and flexible framework for tackling many applications that rely on audio generation (e.g. TTS, music, speech enhancement, voice conversion, source separation).
2. WaveNet
WaveNet is a generative model operating directly on the raw audio waveform. The joint probability of a waveform x={x_1, …, x_T} is factorised as a product of conditional probabilities as follows:
\[p(\textbf{x})=\prod_{t=1}^T p(x_t|x_1, ..., x_{t-1})\]Each audio sample x_t is therefore conditioned on the samples at all previous timesteps.
Similarly to PixelCNNs (van den Oord et al., 2016a;b), the conditional probability distribution is modelled by a stack of convolutional layers. There are no pooling layers in the network, and the output of the model has the same time dimensionality as the input. The model outputs a categorical distribution over the next value x_t with a softmax layer and it is optimized to maximize the log-likelihood of the data w.r.t. the parameters.
2.1. Dilated Causal Convolutions
Illustration of causal convolutional layers are shown in Figure 2.
At training time, the conditional predictions for all timesteps can be made in parallel because all timesteps of ground truth x are known. When generating with the model, the predictions are sequential: after each sample is predicted, it is fed back into the network to predict the next sample.
Pros and cons of causal convolutions compared to RNNs:
Because models with causal convolutions do not have recurrent connections, they are typically faster to train than RNNs, especially when applied to very long sequences. One of the problems of causal convolutions is that they require many layers, or large filters to increase the receptive field. For example, in Fig. 2 the receptive field is only 5 (= #layers + filter/kernel length - 1 = 4 + 2 - 1).
Authors’ improvements:
In this paper dilated convolutions are used to increase the receptive field by orders of magnitude, without greatly increasing computational cost.
A dilated convolution is a convolution where the filter/kernel is applied over an area larger than its length by skipping input values with a certain step. It is equivalent to a convolution with a larger filter derived from the original filter by dilating it with zeros, but is significantly more efficient. A dilated convolution effectively allows the network to operate on a coarser scale than with a normal convolution. This is similar to pooling or strided convolutions, but here the output has the same size as the input. As a special case, dilated convolution with dilation 1 yields the standard convolution.
Figure 3 depicts dilated causal convolutions for dilations 1, 2, 4,and 8.
Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency. In this paper, the dilation is doubled for every layer up to a limit and then repeated: e.g. 1, 2, 4, …, 512, 1, 2, 4, …, 512, 1, 2, 4, …, 512.
Intuition:
- Exponentially increasing the dilation factor results in exponential receptive field growth with depth (Yu & Koltun, 2016). For example each 1, 2, 4, …, 512 block has receptive field of size 1024, and can be seen as a more efficient and discriminative (non-linear) counterpart of a 1×1024 convolution.
- Stacking these blocks further increases the model capacity and the receptive field size.
2.2. Softmax Distributions
One approach to modeling the conditional distributions p(x_t|x_1, …, x_t−1) over the individual audio samples would be to use a mixture model such as a mixture density network (Bishop, 1994) or mixture of conditional Gaussian scale mixtures (MCGSM) (Theis & Bethge, 2015). However,van den Oord et al. (2016a) showed that a softmax distribution tends to work better, even when the data is implicitly continuous (as is the case for image pixel intensities or audio sample values). One of the reasons is that a categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions about their shape.
Problem:
Raw audio is typically stored as a sequence of 16-bit integer values (one per timestep), a softmax layer would need to output 65,536 probabilities per timestep to model all possible values.
Solution:
The authors apply a μ-law companding transformation (ITU-T, 1988) to the data, and then quantize it to 256 possible values:
\[f(x_t)=\text{sign}(x_t)\frac{\ln(1+\mu|x_t|)}{\ln(1+\mu)},\]where −1< x_t<1 and μ = 255. This non-linear quantization produces a significantly better reconstruction than a simple linear quantization scheme. Especially for speech, it is found that the reconstructed signal after quantization sounded very similar to the original.
2.3. Gated Activation Units
The authors use the same gated activation unit as used in the gated PixelCNN (van den Oord et al., 2016b):
\[\textbf{z}=\tanh(W_{f,k}*\textbf{x}) \odot \sigma(W_{g,k}*\textbf{x})\]where ∗ denotes a convolution operator, \odot denotes an element-wise multiplication operator, σ(·) is a sigmoid function, k is the layer index, f and g denote filter and gate, respectively, and W is a learnable convolution filter.
In our initial experiments, we observed that this non-linearity worked significantly better than the rectified linear activation function (Nair & Hinton, 2010) for modeling audio signals.
2.4. Residual And Skip Connections
Both residual (He et al., 2015) and parameterised skip connections are used throughout the network,to speed up convergence and enable training of much deeper models. In Fig. 4 we show a residual block of our model, which is stacked many times in the network.
2.5. Conditional WaveNets
Given an additional input h, WaveNets can model the conditional distribution p(x|h) of the audio given this input. Eq. (1) now becomes
\[p(\textbf{x}|\textbf{h})=\prod_{t=1}^T p(x_t|x_1, ..., x_{t-1}, \textbf{h})\]By conditioning the model on other input variables, we can guide WaveNet’s generation to produce audio with the required characteristics. For example, in a multi-speaker setting we can choose the speaker by feeding the speaker identity to the model as an extra input. Similarly, for TTS we need to feed information about the text as an extra input.
The authors condition the model on other inputs in two different ways: global conditioning and local conditioning.
Global conditioning:
Global conditioning is characterised by a single latent representation h that influences the output distribution across all timesteps, e.g. a speaker embedding in a TTS model. The activation function from Eq. (2) now becomes:
\[\textbf{z}=\tanh(W_{f,k}*\textbf{x}+V_{f,k}^T\textbf{h}) \odot \sigma(W_{g,k}*\textbf{x}+V_{g,k}^T\textbf{h})\]where V_{∗,k} is a learnable linear projection, and the vector V^T_{∗,k}h is broadcast over the time dimension.
Local conditioning:
For local conditioning we have a second time series h_t, possibly with a lower sampling frequency than the audio signal, e.g. linguistic features in a TTS model. We first transform this time series using a transposed convolutional network (learned upsampling) that maps it to a new time series y=f(h) with the same resolution as the audio signal, which is then used in the activation unit as follows:
\[\textbf{z}=\tanh(W_{f,k}*\textbf{x}+V_{f,k} * \textbf{y}) \odot \sigma(W_{g,k}*\textbf{x}+V_{g,k} * \textbf{y})\]where V_{f, k} ∗ y is now a 1×1 convolution. As an alternative to the transposed convolutional network, it is also possible to use V_{f, k} ∗ h and repeat these values across time. This worked slightly worse in the experiments.
2.6. Context Stacks
We have already mentioned several different ways to increase the receptive field size of a WaveNet:
- increasing the number of dilation stages
- using more layers
- larger filters
- greater dilation factors
- a combination thereof.
A complementary approach is to use a separate, smaller context stack that processes a long part of the audio signal and locally conditions a larger WaveNet that processes only a smaller part of the audio signal (cropped at the end). One can use multiple context stacks with varying lengths and numbers of hidden units. Stacks with larger receptive fields have fewer units per layer. Context stacks can also have pooling layers to run at a lower frequency. This keeps the computational requirements at a reasonable level and is consistent with the intuition that less capacity is required to model temporal correlations at longer timescales.
3. Experiments
The authors evaluate WaveNet on three different tasks:
-
multi-speaker speech generation (not conditioned on text)
-
TTS
-
music audio modelling.
3.1. Multi-Speaker Speech Generation
The authors used the English multi-speaker corpus from CSTR voice cloning toolkit (VCTK) (Yamagishi, 2012) and conditioned WaveNet only on the speaker. The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers.
Because the model is not conditioned on text, it generates non-existent but human language-like words in a smooth way with realistic sounding intonations. This is similar to generative models of language or images, where samples look realistic at first glance, but are clearly unnatural upon closer inspection. The lack of long range coherence is partly due to the limited size of the model’s receptive field (about 300 milliseconds), which means it can only remember the last 2–3 phonemes it produced.
A single WaveNet was able to model speech from any of the speakers by conditioning it on a one-hot encoding of a speaker. This confirms that it is powerful enough to capture the characteristics of all 109 speakers from the dataset in a single model. We observed that adding speakers resulted in better validation set performance compared to training solely on a single speaker. This suggests that WaveNet’s internal representation was shared among multiple speakers.
Finally, we observed that the model also picked up on other characteristics in the audio apart from the voice itself. For instance, it also mimicked the acoustics and recording quality, as well as the breathing and mouth movements of the speakers.
3.2. Text-To-Speech
Datasets:
- The North American English dataset contains 24.6 hours of speech data
- The Mandarin Chinese dataset contains 34.8 hours;
Both datasets were spoken by professional female speakers.
WaveNets for the TTS task were locally conditioned on linguistic features which were derived from input texts. The authors also trained WaveNets conditioned on the logarithmic fundamental frequency (logF0) values in addition to the linguistic features. External models predicting logF0 values and phone durations from linguistic features were also trained for each language.
Receptive field size and baselines:
The receptive field size of the WaveNets was 240 milliseconds. As example-based and model-based speech synthesis baselines, hidden Markov model (HMM)-driven unit selection concatenative (Gonzalvo et al., 2016) and long short-term memory recurrent neural network (LSTM-RNN)-based statistical parametric (Zenet al., 2016) speech synthesizers were built. Since the same datasets and linguistic features were used to train both the baselines and WaveNets, these speech synthesizers could be fairly compared.
Evaluation:
Subjective paired comparison tests and mean opinion score (MOS) tests were conducted. In the paired comparison tests, after listening to each pair of samples, the subjects were asked to choose which they preferred, though they could choose “neutral” if they did not have any preference. In the MOS tests, after listening to each stimulus, the subjects were asked to rate the naturalness of the stimulus in a five-point Likert scale score (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent). Please refer to Appendix B for details.
Subjective paired comparison tests:
Fig. 5 shows a selection of the subjective paired comparison test results (see Appendix B for the complete table). It can be seen from the results that WaveNet outperformed the baseline statistical parametric and concatenative speech synthesizers in both languages.
The authors found that WaveNet conditioned on linguistic features only could synthesize speech samples with natural segmental quality but sometimes it had unnatural prosody by stressing wrong words in a sentence. This could be due to the long-term dependency of F0 contours: the size of the receptive field of the WaveNet, 240 milliseconds, was not long enough to capture such long-term dependency. WaveNet conditioned on both linguistic features and F0 values did not have this problem: the external F0 prediction model runs at a lower frequency (200 Hz) so it can learn long-range dependencies that exist in F0 contours.
Mean opinion score (MOS) tests:
Table 1 show the MOS test results. It can be seen from the table that WaveNets achieved 5-scale MOSs in naturalness above 4.0, which were significantly better than those from the baseline systems. They were the highest ever reported MOS values with these training datasets and test sentences. The gap in the MOSs from the best synthetic speech to the natural ones decreased from 0.69 to 0.34 (51%) in US English and 0.42 to 0.13 (69%) in Mandarin Chinese.
3.3. Music
Datasets:
- the MagnaTagATune dataset (Law & Von Ahn, 2009), which consists of about 200 hours of music audio. Each 29-second clip is annotated with tags from a set of 188, which describe the genre, instrumentation, tempo, volume and mood of the music.
- the YouTube piano dataset, which consists of about 60 hours of solo piano music obtained from YouTube videos. Because it is constrained to a single instrument, it is considerably easier to model.
The authors found that enlarging the receptive field was crucial to obtain samples that sounded musical. Even with a receptive field of several seconds, the models did not enforce long-range consistency which resulted in second-to-second variations in genre, instrumentation, volume and sound quality. Nevertheless, the samples were often harmonic and aesthetically pleasing, even when produced by unconditional models.
Of particular interest are conditional music models, which can generate music given a set of tags specifying e.g. genre or instruments. Similarly to conditional speech models, the authors insert biases that depend on a binary vector representation of the tags associated with each training clip. This makes it possible to control various aspects of the output of the model when sampling, by feeding in a binary vector that encodes the desired properties of the samples. Such models are trained on the MagnaTagATune dataset; although the tag data bundled with the dataset was relatively noisy and had many omissions, after cleaning it up by merging similar tags and removing those with too few associated clips, this works reasonably well.
3.4. Speech Recognition
Dataset: TIMIT (Garofolo et al.,1993) dataset
WaveNets show that layers of dilated convolutions allow the receptive field to grow longer in a much cheaper way than using LSTM units.
For this task a mean-pooling layer is added after the dilated convolutions that aggregated the activations to coarser frames spanning 10 milliseconds (160×downsampling). The pooling layer was followed by a few non-causal convolutions. The authors trained WaveNet with two loss terms, one to predict the next sample and one to classify the frame, the model generalized better than with a single loss and achieved 18.8 PER on the test set, which is to our knowledge the best score obtained from a model trained directly on raw audio on TIMIT.
4. Conclusion
The authors introduced WaveNets, which are autoregressive and combine causal filters with dilated convolutions to allow their receptive fields to grow exponentially with depth, which is important to model the long-range temporal dependencies in audio signals.