Wenxin Hou

[Paper-ASR][中文笔记] A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition

2021-07-04T00:00:00-07:00

Last Updated: 2021-07-04

本篇为原创文章，转载请联系我，谢谢！

本期我们将为大家介绍Google团队在 InterSpeech 2021 的一篇日语端到端（E2E）语音识别的工作，文中调研了各种最新的E2E建模技巧，诸如：LSTM / Conformer 作为encoder，CTC / Transducer / Attention-based Decoder；本文也验证了一些最新的训练技术，如SpecAugment，Variational Noise Injection (VNI) 和 Exponential Moving Average (EMA)。本文最好的模型 + 训练配置在CSJ eval1, eval2, eval3上分别达到了SOTA的4.1%, 3.2%, and 3.5%的字错误率（CER）。

文章链接：https://arxiv.org/abs/2106.05111

背景介绍

由于日语里没有像英语那样明确的词分隔（空格），所以传统的HMM模型和发音词典都需要用一个分词器将日语文本切成词；而E2E模型可以直接对字（character）建模，极大简化了ASR模型的构建流程。

对于Encoder而言，带卷积的BLSTM第一次在日语语音识别超过了HMM模型，Transformer模型又超过了BLSTM模型，之后的Conformer模型进一步降低了在日语和其他语言上的CER。

对于Decoder而言，CTC 和 Attention 在之前的工作中用的比较多，但Transducer并没有在日语ASR中得到广泛的应用。Transducer和CTC一样很适合流式应用，并且它也能像Attention一样学习到输出之间的依赖关系。

训练技巧对于E2E模型也很关键，比如SpecAugment，VNI 和 EMA 技术都已经被证明对于训练ASR模型有效，但是还没有工作对它们进行充分的对比实验。

本文的贡献就是充分的对于以上提到的模型架构和训练技巧以及他们的组合进行了对比实验，并且在CSJ数据集上进行了多维度的评价（CER，training throughput，收敛性和推理RTF）

神经网络架构

模型由Encoder和Decoder组成，输入为 log-mel filterbank 特征，输出是后验分布，训练目标是最大化正确目标序列的后验概率。

BLSTM Encoder

首先用堆叠卷积下采样输入特征，然后用堆叠的BLSTM层递归式的进行帧级别的处理。

缺点是难以并行化，无法充分使用 GPU / TPU 加速。

Conformer Encoder

Conformer Encoder和 BLSTM Encoder 的区别就是用 Conformer 模块代替了 BLSTM 层。Figure 1为Conformer模块的示意图。Conformer使用基于相对位置编码 (Relative Positional Encoding) 的多头注意力机制建模全局特征，对于长度为 T 的输入序列，多头自注意力的计算和内存复杂度都是 O(T^2)。

Figure 2 展开了 Figure 1 中的 Convolution Module，该模块用于建模局部特征。

Conformer可以被并行化训练，因此相比 BLSTM 训练速度会快很多；但是缺点就是上述的注意力复杂度，当输入序列很长的时候可能会导致性能瓶颈。在本文实验中用 training throughput 评估。

CTC Decoder

CTC Decoder就是一层线性层 + Softmax 激活函数，CTC 通过引入一个表示 Encoder 编码的特征 X 和输出序列 Y 之间的 Alignment 的变量 Z 来预测输出序列的分布。Y 和 Z 之间的关系由一个多对一的映射 B 给出：

\[Y = B(Z)\]

B(Z) 就是将 Z 中的 blank 和重复 symbol 去掉，举个栗子：B(aa<blank>b) = ab。并且， Z 中的每一个 symbol $Z_t$ 被预测到的概率都是对应 X 中的一帧特征经过 CTC Decoder 转换预测得到的概率分布中的第 $Z_t$ 个元素，即$C_t=\text{CTC}(X_t)$ ， $P(Z_t) = C_{t, Z_t}$ 。从这里我们也可以看出来，CTC进行了条件独立性假设，即 $Z_t$ 的预测与上下文 $Z_{t-1}$ 和 $Z_{t+1}$ 均无关。

那怎么训练 CTC 呢？我们对 B 取反得到一个一对多的映射 $B^{-1}$，比如 $B^{-1}(ab) = {\text{aab, aaab, abbb, etc.}\}$ ，然后最大化所有可能性的后验概率和：

\[p_{ctc}(B(Z)=Y|X)=\prod_{t=1}^T C_{t, Z_t}\] \[L_{ctc} = - log \sum_{Z \in B^{-1}(Y)} p_{ctc}(B(Z) = Y | X)\]

Transducer Decoder

Transducer 和 CTC 原理类似，都是通过最大化可能的 Alignment 概率进行训练，区别在于 Transducer 去掉了CTC的条件独立性假设，而是递归地建模条件分布。

\[L_{\text{transducer}} = - log \sum_{Z \in B^{-1}(Y)} \prod_t p_\text{transducer} (Z_t | X, B(Z_{1:t-1}))\]

Transducer Decoder 一般采用循环神经网络（RNN）进行递归式的编码。

Attention Decoder

Attention Decoder 一般由两个模块组成：注意力模块 Attend() 和输出模块 Spell()。不像 CTC 和 Transducer，它不需要显式地对齐每一个语音帧和输出符号。

\[L_\text{att} = - log \prod_t p_\text{att} (Y_t | X, Y_{1:t-1})\]

Attention Decoder 在 NLP 的各种序列生成任务（Seq2Seq）中有着广泛的运用，这里就不详细介绍了。

训练技巧

SpecAugment

SpecAugment 是一种为 ASR 设计的数据增强的方式。主要包含：time masking 和 frequency masking。（time warping没有在本文中使用）。对于这类 trick 而言，不说废话，上代码：

def spec_augmentation(x,
                      warp_for_time=False,
                      num_t_mask=2,
                      num_f_mask=2,
                      max_t=50,
                      max_f=10,
                      max_w=80):
    """ Deep copy x and do spec augmentation then return it
    Args:
        x: input feature, T * F 2D
        num_t_mask: number of time mask to apply
        num_f_mask: number of freq mask to apply
        max_t: max width of time mask
        max_f: max width of freq mask
        max_w: max width of time warp
    Returns:
        augmented feature
    """
    y = np.copy(x)
    max_frames = y.shape[0]
    max_freq = y.shape[1]

    # time warp
    if warp_for_time and max_frames > max_w * 2:
        center = random.randrange(max_w, max_frames - max_w)
        warped = random.randrange(center - max_w, center + max_w) + 1

        left = Image.fromarray(x[:center]).resize((max_freq, warped), BICUBIC)
        right = Image.fromarray(x[center:]).resize(
            (max_freq, max_frames - warped), BICUBIC)
        y = np.concatenate((left, right), 0)
    # time mask
    for i in range(num_t_mask):
        start = random.randint(0, max_frames - 1)
        length = random.randint(1, max_t)
        end = min(max_frames, start + length)
        y[start:end, :] = 0
    # freq mask
    for i in range(num_f_mask):
        start = random.randint(0, max_freq - 1)
        length = random.randint(1, max_f)
        end = min(max_freq, start + length)
        y[:, start:end] = 0
    return y

Exponential Moving Average (EMA)

EMA 通过维护一个影子权重来提高模型训练的稳定性和泛化能力，其基本假设是，模型权重在最后的n步内，会在实际的最优点处抖动，所以我们取最后n步的平均，能使得模型更加的鲁棒。每一个训练 step $k$ 后，EMA 的模型参数 $\theta_k’$ 通过下面的公式计算：

\[\theta_k' = \gamma \theta_{k-1}' + (1 - \gamma) \theta_k\]

其中，$\gamma$ 是衰减率。代码如下：

class ExponentialMovingAverage(object):
    def __init__(self, model, decay=0.9999):
        self.model = model
        self.decay = decay
        self.shadow = {}
        self.backup = {}
    
    def load(self, ema_model):
        for name, param in ema_model.named_parameters():
            self.shadow[name] = param.data.clone()

    def register(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = param.data.clone()

    def update(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
                self.shadow[name] = new_average.clone()
	
    # 以下函数用于评估模型时的参数变换
    def apply_shadow(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                self.backup[name] = param.data
                param.data = self.shadow[name]

    def restore(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {}

Variational Noise

Variational Noise 也能提高神经网络的泛化能力，具体做法是对原参数加上服从正态分布的噪声样本，得到带噪参数用于训练。（本文只将其用于 embedding 和 LSTM 层）

\[\theta' = \theta + n, n \sim \text{Normal}(0, \sigma^2)\]

这个方法比较简单并且灵活，这里就不放代码了。

实验结果

实验设置

简要介绍一下 CSJ 数据集：581 小时的语音数据，模型建模单元为3259个字 + 3种特殊字符（SOS, EOS, UNK）。特征提取为 80 维的 log mel-filterbank，在训练集上采用了 Global CMVN (channel 轴上 0 均值，1 方差)。

各种超参就不介绍了，Google 使用的模型基本大家都可以想象一下，小不了。诸如 SpecAugment 等的参数可以参考一下原论文。

模型架构对比

首先，由表 1 可以看出，Conformer 作为 encoder 明显好于 BLSTM，不论是 CER 还是训练速度；Decoder 方面，Attention 和 Transducer 整体表现接近，CTC 略差，从训练速度看是 CTC 大于 Attention 大于 Transducer。

Encoder 对于 CER 结果的影响明显大于 Decoder。

从收敛性的角度看，Conformer 的收敛速度和效果依然明显优于 BLSTM，Attention Decoder 则在训练早期有着更快的收敛速度，但是这个效果也不如 Encoder 部分来的明显，训练后期差距被拉小甚至反超。

训练技巧对比

从 CER 的角度看，SpecAugment 带来的提升大于 EMA 大于 VNI，但是他们是互补的，组合起来可以达到最佳的训练效果。

各种技巧对于训练速度几乎没有影响，可以放心食用。

从收敛曲线看，没有 EMA 的模型训练（红色）明显变得不稳定，使用 EMA 可以帮助加速收敛。

计算复杂度对比

作者将 batch size 设置为 1，然后在 CPU 上解码，然后计算 Real-Time Factor (RTF)，RTF 用于度量解码速度，计算方法为用模型解码一条语音的时间除以该条语音自身的时间，数字越小越好。

可以看出，Conformer 作为 Encoder 依然优于 BLSTM，Transducer 作为 Decoder 是最快的，CTC 次之，Attention Decoder 最慢。值得注意的是 Attention 和 Transducer 运用了宽度为 8 的 Beam Search 而 CTC 使用了 Greedy Search。不过本文的搜索算法都是用 C++ 写的，所以表上列出的算法其实都很快了。

总结

本文探究了各种模型架构、训练技巧在日语端到端语音识别上的应用，是一篇很棒的工程论文。

[Paper-ASR][中文笔记] 用于语音识别的无监督字符级分布适配 (CMatch)

2021-06-07T00:00:00-07:00

Last Updated: 2021-07-04

这篇论文是我（Wenxin Hou）在MSRA实习期间与王晋东老师合作的工作，本文转载自王晋东老师的知乎：https://zhuanlan.zhihu.com/p/370691801。转载请联系王晋东老师或者我。

本期我们将为大家介绍一种用于语音识别的无监督字符级领域适配方法：CMatch。在这项工作中，我们提出一种用于ASR的无监督字符级分布匹配方法：CMatch，以实现在两个不同领域中的每个字符之间执行细粒度的自适应。在Libri-Adapt数据集上进行的实验表明，CMatch在跨设备和跨环境的适配上相对单词错误率（WER）分别降低了14.39％和16.50％。在这篇工作中我们还全面分析了帧级标签分配和基于Transformer的领域适配的不同策略。文章已经提交至INTERSPEECH 2021。

本文作者：侯汶昕（东京工业大学硕士生）
文章链接：https://arxiv.org/abs/2104.07491

背景介绍

众所周知，基于深度学习的端到端自动语音识别（ASR）已经可以通过大规模的训练数据和强大的模型得到很好的性能。但是，训练和测试数据之间可能会由于录音设备，环境的不同而有着相似却不匹配的分布，这样的分布（或领域）不匹配通常会导致ASR模型在测试时的识别精度下降。而这种领域或分布不匹配的情况过于常见和多样，以至于很难对于每个领域的语音数据进行大量收集并标记。在这种情况下我们往往需要借助无监督的领域适配提升在模型目标域的表现。

现有的无监督领域适配方法通常在将每个领域视为一个分布，然后进行领域适配，例如领域对抗训练或是特征匹配。这些方法可能就会导致忽略了一些不同领域内更细粒度的分布知识，例如字符，音素或单词。忽略这些信息可能会在一定程度上影响适配的效果。这在文献[1]中也得到了验证，与在整个域中对齐的传统方法相比，在子域中对齐的图像（即按类标签划分的域）通常可以实现更好的自适应性能。

以下图为例，通过执行CMatch算法，我们可以看到两个领域相同的字符在特征分布中被拉近了。

方法介绍

CMatch由两个步骤组成：

帧级标签分配
字符级别的分布匹配

帧级标签分配

要想进行帧级标签分配，首先我们需要获得较为准确的标签对齐。这里我们介绍了如下图所示的3种方法：CTC强制对齐，动态帧平均，以及伪CTC标签方法。可以看出，CTC强制对齐是通过预训练的CTC模块，在计算每条文本对应的最可能的CTC路径（插入重复和Blank符号）后分配到每个语音帧上，这个方法相对准确但是计算代价较高；动态帧平均则是将语音帧平均的分配到每个字符上，这个方法需要基于源域和目标域语速均匀的假设；而伪CTC标签的方法，通过利用已经在源域上学习得较好的CTC模块外加基于置信度的过滤（如图中的t，e，p等），兼顾了高效和准确性。

需要说明的是，我们在源域上使用真实文本进行标签分配，而由于目标域我们没有文本，所以需要借助源域模型首先在目标域的语音数据进行伪标注，然后使用模型标注的文本进行标签分配。

字符级别的分布匹配

得到帧级别的标签后，我们就需要进行字符级别的分布匹配，在本文中，我们选择采用Maximum Mean Discrepancy（MMD）度量进行特征匹配。MMD用于评估两个分布之间的差异，是迁移学习中常见的一种分布度量方法。它的公式为：

实际操作中，给定源域和目标域样本$X_S$, $X_T$，我们计算MMD的有偏差的经验估计：

通过计算所有字符之间的平均MMD，我们可以得到字符级别的分布匹配损失函数：

最终损失函数和算法流程

本文中我们采取CTC-Attention混合模型作为基础ASR模型，以同时混合学习CTC模块（用于帧级标签分配）和基于Transformer Decoder的Seq2Seq Loss，于是语音识别的损失函数可以表示为：

将分布匹配损失函数和语音识别损失函数相结合，我们就得到了最终的损失函数：

最终算法流程如下：

实验效果

首先我们看一下模型在领域内的识别效果，评价指标为词错误率（WER），数值越低代表识别效果越好：

然后我们看一下跨设备语音识别时的结果，可以注意到的是，Source-only的模型在其他设备录制的语音上的识别效果相比领域内模型都会有一定程度的下降。而基于全局MMD和领域对抗训练的方法均有所提升，CMatch则在各个情况下均取得了最佳的效果。

可以看到，CMatch在跨环境（抗噪声）语音识别的情况下也取得了很好的效果。

消融实验体现出，结合自训练和细粒度的分布匹配能够使CMatch达到最好的效果。

我们还分析比较了三种字符分配方法，可以看出CTC强制对齐取得了最好的效果，但是其计算开销也是最大；而FrameAverage也取得了较好的效果，但它的假设前提是领域和目标域具有均匀的说话速度；而使用CTC伪标签的方法取得了与CTC强制对齐相近的结果，同时计算起来也更加的高效。

最后，我们分析了是否需要在解码器端也使用CMatch Loss，结果发现，由于解码器在我们的实验中本来就没有功能上的差别，目标文本都是标准的英文，因此减小其分布的差异并没有什么效果，甚至会损害性能。

总结

本文提出了一种用于跨领域语音识别的CMatch算法。我们的主要动机是匹配来自源域和目标域的字符级分布，以利用细粒度的字符信息进行更好的自适应，并且在跨设备和跨环境ASR上的实验表明了CMatch的优越性。在未来，我们计划在更多的任务和场景下进行实验，例如数据集自适应和说话人自适应。

References

[1] Y. Zhu, F. Zhuang, J. Wang, G. Ke, J. Chen, J. Bian, H. Xiong, and Q. He, “Deep subdomain adaptation network for image classification,” IEEE transactions on neural networks and learning systems, 2020.

[Collection] Papers, Recipes, Toolkits and Interesting Posts

2020-12-10T00:00:00-08:00

Last Updated: 2020-12-10

知乎:https://www.zhihu.com/people/liuchengwei/posts

清华大学语音组Wiki：http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/Weekly_meeting

NeurIPS2020 SAS workshop: https://neurips-sas-2020.github.io/#papers

Machine Translation Reading List: https://github.com/THUNLP-MT/MT-Reading-List

Posts

1. NLP中的少样本困境问题探究 [Chinese]

2. 探索孪生神经网络：请停止你的梯度传递！ [Chinese]

3. 更深的编码器+更浅的解码器=更快的自回归模型 [Chinese]

Toolkits

1. PyTorch Lightning Bolts [GitHub]

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Paper: A Framework For Contrastive Self-Supervised Learning And Designing A New Approach

2. S3PRL Speech Toolkit [GitHub]

Self-supervised pre-training and representation learning (S3PRL) of Mockingjay, TERA, A-ALBERT, APC, and more to come. With easy-to-use standard downstream evaluation scripts including phone classification, speaker recognition, and ASR. (All in Pytorch!)

3. WavAugment [GitHub]

WavAugment performs data augmentation on audio data. The audio data is represented as pytorch tensors
Augmentations include: pitch randomization, reverberation, additive noise, time dropout (temporal masking), band reject, clipping
Data Augmenting Contrastive Learning of Speech Representations in the Time Domain, E. Kharitonov, M. Rivière, G. Synnaeve, L. Wolf, P.-E. Mazaré, M. Douze, E. Dupoux. [arxiv]

4. Sequence-to-Sequence G2P Toolkit [GitHub]

The tool does Grapheme-to-Phoneme (G2P) conversion using transformer model from tensor2tensor toolkit based on TensorFlow
Lukasz Kaiser. “Accelerating Deep Learning Research with the Tensor2Tensor Library.” In Google Research Blog, 2017

5. Abkhazia [GitHub]

Online documentation https://docs.cognitive-ml.fr/abkhazia
The Abkhazia project makes it easy to obtain simple baselines for supervised ASR (using Kaldi) and ABX tasks (using ABXpy) on the large corpora of speech recordings typically used in speech engineering, linguistics or cognitive science research

Datasets

1. Libri-Adapt: a New Speech Dataset for Unsupervised Domain Adaptation [ICASSP2020] [GitHub]

7200 hours of English speech recorded on mobile and embedded scale microphones, 72 different domains (6 microphones x 3 accents x 4 environments x 100-hour Librispeech corpus) sampled at 16KHz

2. MLS: A Large-Scale Multilingual Dataset for Speech Research [INTERSPEECH2020] (Newer arxiv version) [Recipe/Pretrained Models] [OpenSLR]

Multilingual LibriSpeech (MLS) dataset: derived from LibriVox, 50.5K hours, 8 languages (44.5K hours of English and a total of about 6K hours for other languages)
Languages covered: English, German, Dutch, French, Spanish, Italian, Portuguese, Polish

End-to-End ASR

1. Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition [INTERSPEECH2020]

Introduce non-autoregressive ASR system LASO (Listen Attentively, and Spell Once)
CER 6.4% (SOTA autoregressive Transformer model 6.7%) on AISHELL-1
21 ms, 1/50 average inference latency of the autoressive Transformer model

2. Independent Language Modeling Architecture for End-To-End ASR [ICASSP2020]

Introduce independent language modeling subnet to leverage external text data
Existing method: Replace encoding with an all-zero vector and freeze the encoder

3. Conformer: Convolution-augmented Transformer for Speech Recognition [INTERSPEECH2020] [Paper Reading Slide]

Combine CNNs and Transformers to model both local and global dependencies in a parameter-efficient way
LibriSpeech WER of 2.1/4.3 without using LM and 1.9/3.9 with an external LM on test clean/other, 2.7/6.3 with a 10M-parameter small model

4. Exploring Transformers for Large-Scale Speech Recognition [INTERSPEECH2020]

Depth-scale model initialization to accelerate convergence
Pre-LayerNorm instead of Post-LayerNorm to accelerate convergence
Chunk-based Transformer-XL for streaming ASR (low computation + GPU memory-saving)

5. Distilling the Knowledge of BERT for Sequence-to-Sequence ASR [INTERSPEECH2020] [GitHub]

Use BERT to generate soft labels by masking and predicting target words for the training of seq2seq ASR
Concatenate multiple utterances together to a fixed size for BERT prediction to make pre-training and distillation consistent and improve WER

Multilingual

A. ASR

1. Massively Multilingual Adversarial Speech Recognition [NAACL-HLT2019]

Analyze the relative importance of similarity between the target and pre-training languages along the dimensions of phonetics, phonology, language family, geographical location, and orthography
Investigate 2 additional objectives for hybrid CTC/Attention architecture: phoneme CTC and language-adversarial during pre-training

2. Learning Robust and Multilingual Speech Representations [Findings of EMNLP2020]

3. Language-agnostic Multilingual Modeling [ICASSP2020]

Propose a language-agnostic multilingual ASR system by transforming all languages to one writing system through a many-to-one transliteration transducer
Obtain 10% relative WER reduction on 4 Indic languages

6. End-to-End Multilingual Speech Recognition System with Language Supervision Training [IEICETrans.2020]

Propose a Language Masks Estimation method to constrain the output distribution

7. Towards Language-Universal Mandarin-English Speech Recognition [INTERSPEECH2019]

Propose to combine two monolingual models to build a bilingual model

8. Multilingual Speech Recognition With A Single End-To-End Model [ICASSP2018]

Invesitgate the joint/encoding LID multi-task learning/language embedding conditioned cases on 9 Indian languages

9. Towards Language-Universal End-to-End Speech Recognition [ICASSP2018]

Explore language-specific/universal output layer
Propose language-specific gating units in hidden layers

10. Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning [INTERSPEECH2020] [Slide] [Recipe]

Present LID-42: very-large-scale Transformer-based hybrid CTC/Attention models based on subwords / characters for 42-lingual ASR, average CER: 27.8/27.2
Language-independent architecture with shared vocabulary including language IDs for joint language identification, LID accuracy: 93.5/94.0
Relative improvements of 28.1% in WER by transfer learning to low-resource languages

11. Leveraging Language ID in Multilingual End-to-End Speech Recognition [ASRU2019]

12. Investigating End-to-end Speech Recognition for Mandarin-english Code-switching [ICASSP2019]

Introduce MTL where at each time step, the model predicts both the modeling unit and the language ID

13. Meta Learning for End-to-End Low-Resource Speech Recognition [ICASSP2020]

Apply model-agnostic meta-learning (MAML) to pre-train a CTC multilingual model and transfer to low-resource languages with language-specific head

14. Language Adaptive Multilingual CTC Speech Recognition

15. Multilingual Speech Recognition with Corpus Relatedness Sampling

16. Adversarial Multilingual Training for Low-Resource Speech Recognition [ICASSP2018]

17. Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes [ICASSP2019]

Propose Audio-to-Byte (A2B) and Byte-to-Audio (B2A) models for multilingual ASR and TTS
ASR model: LAS, TTS model: Tacotron 2. Input/Output layer: 256 possible byte values.

18. An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation

19. Bootstrap an End-to-End ASR System by Multilingual Training, Transfer Learning, Text-to-Text Mapping and Synthetic Audio [Arxiv2020]

Demostrate that post-ASR text-to-text mapping and synthetic TTS data can be effectively combined with approaches such as multilingual training and transfer learning for improving simulated Italian ASR bootstrapping scenario

20. Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition [AAAI2021] [GitHub]

Propose Adversarial Meta-Sampling (AMS) approach for multilingual meta-learning ASR (MML-ASR) and multilingual transfer learning ASR (MTL-ASR) by

B. STT

1. Phonological Features for 0-shot Multilingual Speech Synthesis [INTERSPEECH2020]

Utilize binary phonological features
Mapping tables for phoneme to IPA, as well as an IPA-PF lookup dictionary are available at https://github.com/papercup-open-source/phonological-features

C. Speech Translation

1. Effectively Pretraining a Speech Translation Decoder with Machine Translation Data [EMNLP2020]

Propose to use adversarial discriminator to train NMT and ASR systems simultaneously by aligning their encodings to the same latent space –> both the pre-trained ASR encoder and NMT deocder can be used to improve AST
1.5 BLEU improvements on En-De and En-Fr compared with conventional pretraining methods (ASR encoder only / ASR encoder + NMT deocder pre-trained separately)

D. Analysis

1. Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios [Arxiv2020]

Train a LID model on the Wilderness dataset and analyze the learned embeddings by comparing with classical language family findings (Ethnologue, Glottolog, Wikipedia)
Show that languages grouped by learned embeddings perform better than distance-based or phoneme-based approaches on zero-shot TTS

Curriculum Learning

1. When Do Curricula Work? [ICLR2021 (submitted)]

Investigate the implicit curricula resulting from architectural and optimization bias and find that samples are learned in a highly consistent order
Conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum and find that benefit is entirely due to the dynamic training set size rather than the order of examples
Curriculum, but not anti-curriculum or random ordering can indeed improve the performance either with limited training time budget or in the existence of noisy data

Semi-Supervised Learning

1. Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

2. Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages

3. Semi-Supervised End-to-End Speech Recognition [INTERSPEECH2018] [GitHub]

4. Self-Training for End-to-End Speech Recognition [ICASSP2020] [Recipe]

Train a base AM on limited paried data and LM on large-scale unparied text data, use beam search to generate peudo-labels for unlabelled speech data
Filtering a) n-gram repeated more than c times (looping). b) hypotheses with an EOS probability below a threshold (early-stopping) or no EOS generated. c) length-normalized log likelihood as the confidence score for quality ranking: $\text{ConfidenceScore}(\hat{Y_i})=\frac{\log P_{AM}(\hat{Y_i}|X_i)}{|\hat{Y_i}|}$
Propose sample ensemble: Combine pseudo samples generated by M models and average their loss during optimization

5. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures [SAS Workshop@ICML2020] [Recipe]

Train AM + LM on Librispeech (960 hours) to generate pseudo labels for LibriVox (53.8k hours) –> (10k hours) can already obtain promising results as shown in the paper
E2E AMs are implicit LMs, thus with enough unlabeled audio, decoding with an external LM doesn’t improve performance
On Librispeech: achieve 2.27/4.8 WER on test-clean/other sets without LM, 2.09/4.11 with LM

Self-Supervised Learning

1. Unsupervised Pretraining Transfers Well Across Languages [ICASSP2020] [GitHub]

Investigate into the CPC for cross-lingual tasks and evaluate the linear separability of the learned phoneme representation (Common Voice / LibriSpeech phoneme classification, Zerospeech2017)
Introduce two modifications to improve CPC: a) replacing batch normalization with a channel-wise normalization to avoid information leakage and stablize training; b) Replace the linear classifier with a 1-layer Transformer to make the future prediction target more reasonable
PER of modified CPC pretrained on LS-360 is comparable to the supervised model pretrained on LS-100

2. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [NeurIPS2020] [Paper Reading Slide]

Self-supervised learning by masking + contrastive loss + quantized representations
On Librispeech: achieve 1.8/3.3 WER on the clean/other test sets, 4.8/8.2 WER on 10 minutes of labeled data and 53k hours of unlabeled data. All experiments are with external LM

3. Unsupervised Cross-lingual Representation Learning for Speech Recognition [Arxiv2020]

Investigate into self-supervised pre-training (wav2vec 2.0) for multilingual / cross-lingual ASR
Publish XLSR-53, a large-scale multilingual wav2vec 2.0 pre-trained on 56K-hour combined corpora of Common Voice (38 languages) + BABEL (14 languages) + Multilingual LibriSpeech (MLS) (8 languages)

4. Unsupervised Learning of Disentangled Representations for Speech with Neural Variational Inference Models [MIT PhD Dissertation]

5. A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition

6. Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

7. Self-training and Pre-training are Complementary for Speech Recognition [Arxiv2020]

Combine self-training (paper 5 in semi-supervised learning section) and unsupervised pre-training (wav2vec 2.0) to achieve new SOTA
Use pre-trained wav2vec 2.0 for pseudo-labelling, then fine-tune wav2vec 2.0 (CTC) or train randomly-initialized Transformer model (S2S) as final model
On Librispeech: achieve 1.8/3.3 (CTC) or 1.5/3.1 (S2S) WER on the clean/other test sets, 3.0/5.2 (CTC) or 3.1/5.4 (S2S) WER on 10 minutes of labeled data and 53k hours of unlabeled data. All experiments are with external LM

8. Vector-Quantized Autoregressive Predictive Coding [INTERSPEECH2020] [GitHub]

Introduce one or more vector quantization layers to the APC model to explicitly control the amount of encoded information
Probiing tasks show the APC model prefer to retain speaker information over phonetic information when the capacity is limited
When the phonetic information is present, the learned VQ codes correspond well with English phones

9. Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks [INTERSPEECH2020]

Apply XLNet (refer to NLP 13th) to ASR tasks (w/o segment recurrence mechanism)

10. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders [ICASSP2020] [GitHub]

Apply BERT to speech with proposed additional consecutive masking (mask consecutive C frames to 0) and evaluate on phoneme classification, sentiment classification and speaker recognition

11. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech [TASLP2020 (submitted)] [S3PRL]

Propose TERA (Transformer Encoder Representations from Alteration)
Multi-target 3 auxiliary L1 reconstruction objectives: time (time masking, Mockingjay) / channel (frequency masking) / magnitude (+ Gaussian noise) alteration
Comprehensive analysis on its application to ASR (representation / fine-tune) / phoneme classification / speaker classification

12. Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction [ICASSP2020]

Inspied by BERT, pre-train bidirectional RNNs via masked reconstruction loss to improve ASR
Inspired by SpecAugment, masking segments of sufficient width in both time and frequency

13. Improving Transformer-based Speech Recognition Using Unsupervised Pre-training [Arxiv2019]

Almost same as paper 9 except it is evaluated on HKUST and AISHELL Mandarin ASR datasets

14. A Simple Framework for Contrastive Learning of Visual Representations [ICML2020] [GitHub]

Propose SimCLR framework consisting of: input image $x$ –> two data augmentation to $x_i$, $x_j$ –> same fetaure extractor –> $h_i$, $h_j$ –> same two-layer DNN –> $z_i$, $z_j$ –> contrastive loss (positive samples are $(i, j)$ and $(j, i)$, negative samples are other $2(N-1)$ augmented examples
Important factors: multiple data augmentation operations, larger batch sizes and longer training, normalize temperature for contrastive cross entropy, nonlinear transformation between representation and contrastive loss

15. Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning [Arxiv2020] [GitHub]

Data augmentation: random pitch shift, speed perturbation, room reverberation and additive noise to the waveform; time and frequency masking to the spectrogram implemented with WavAugment toolkit
Final loss = reconstruction loss (TERA) + contrastive loss (SimCLR)

16. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning [Arxiv2020]

Propose BYOL,

17. The Zero Resource Speech Benchmark 2021: Metrics and Baselines for Unsupervised Spoken Language Modeling [NeurIPS2020 SAS]

18. Multi-Format Contrastive Learning of Audio Representations [NeurIPS2020 SAS]

19. Similarity Analysis of Self-Supervised Speech Representations [NeurIPS2020 SAS]

Transfer Learning / Domain Adaptation

1. Co-Tuning for Transfer Learning [NeurIPS2020] [GitHub]

Propose Co-Tuning to fully transfer pre-trained models by utilizing pretraining-task-specific parameters
Loss of Co-Tuning: $\text{CE}(\text{prediction of target head}, y_t) + \lambda \cdot \text{CE}(\text{prediction of pretrained head}, p(y_s|y_t))$
Propose two category relationship learning approaches to translate target labels into probabilistic source labels

2. Self-training For Few-shot Transfer Across Extreme Task Differences [ICLR2021 (submitted)]

Propose “Self Training to Adapt Representations To Unseen Problems (STARTUP)”
Three stages to learn representations: train teacher model on base dataset –> contruct softly-labeled set on target unlabeled set –> train student model
Loss for training student model: CE loss on base dataset + KL divergence on softly-labeled set + self-supervised loss on unlabeled set (SimCLR in this paper)

3. Adaptation Algorithms for Speech Recognition: An Overview [IEEE OJSP2021 (submitted)]

4. Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training [ICASSP2021 (submitted)]

Propose Dropout-based Uncertainty-driven Self-Training (DUST) by filtering uncertain predictions from pseudo-labelled samples
Confidence is computed based on the maximum edit distance between model outputs w/o dropout vs. model outputs w/ dropout with different random seeds

5. Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning [ICLR2021 (submitted)]

6. Domain Adaptation Using Class Similarity for Robust Speech Recognition [INTERSPEECH2020]

Propose a novel adaptation method composed of two stages: a) train the source model and computing mean soft labels of every class over source samples; b) use the soft labels as a regularization term to train the target model on the target domain data
Experiment on a) accent adaptation (Common Voice English) and b) noise adapation (CHiME-3). The proposed method is more robust and performs even better on the latter task

7. UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [Arxiv2020]

NLP (NMT, PLM)

1. Cross-lingual Language Model Pretraining [NeurIPS2019]

Investigate into the cross-lingual language model (XLM) pretrained by Casual LM (classic LM), Masked LM (unsupervised) and Translation LM (supervised) tasks

2. Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks [EMNLP2019]

Based on 2 tasks (MLM, TLM) in XLM, further introduces 3 new cross-lingual pretraining tasks: Cross-lingual Word Recovery (based on attention), Cross-lingual Paraphrase Classification (similar to Next Sentence Prediction but predicting meaning), Cross-lingual MLM (code-switch sentences)

3. Learning Deep Transformer Models for Machine Translation [ACL2019]

Study the effects of Pre-Norm and Post-Norm in deep Transformer, the gradient of Post-Norm poses a higher risk of gradient vanishing or exploding
Propose Dynamic Linear Combination of Layers (DLCL) to memorizing the features extracted from all preceding layers
Sucessfully train a 30-layer deepest Transformer encoder and 6-layer decoder for NMT

4. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond [TACL2019]

###

5. A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT

6. Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model

7. Are All Languages Created Equal in Multilingual BERT?

8. Multilingual Neural Machine Translation with Language Clustering

9. Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders

10. Multilingual NMT with a Language-Independent Attention Bridge https://github.com/Helsinki-NLP/OpenNMT-py/tree/att-brg

11. Cross-lingual Spoken Language Understanding with Regularized Representation Alignment

12. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [ACL2019] [GitHub]

Propose Transformer-XL (extra long) to capture longer-term dependency and resolve the context fragmentation problem
Introduce segment-level recurrence mechanism to reuse previous segments as context
Introduce relative positional encodings for the proposed recurrence mechanism

13. XLNet: Generalized Autoregressive Pretraining for Language Understanding [NeurIPS2019] [GitHub] [Mask分析]

Introduce a permutation language modeling (PLM) pre-training objective
Two-stream self-attention + partial prediction (only predict the lasttokens in a factorization order)
Integrate Transformer-XL (relative positional encoding + segment recurrence mechanism)

Knowledge distillation from BERT to ASR pretrained module for Spoken Language Understanding

Deep Learning Architecture

1. Pay Less Attention with Lightweight and Dynamic Convolutions [ICLR2019]

Depth-wise convolution independently performed on every channel
Lightweight convolution (depth-wise + weight sharing + GLU)
Dynamic convolution: compute weight based on input X

2. Dynamic Convolution: Attention over Convolution Kernels [CVPR2020]

Aggregate multiple convolution kernels dynamically based on their attentions
Tricks: Sum the attention to 1 + Near-uniform attention in early epochs (softmax + large temperature)

3. Rethinking the Value of Transformer Components [COLING2020]

For decoder self-attention least important, FFN most important, higher encoder-attention > lower encoder-attention (closer to the output layer)
For encoder, the lower components (self-attention, FFN) are more important.
Two methods to improve Transformer NMT: a) Prune unimportant components and retrain the model. b) Rewind unimportant components and finetune the model

4. Understanding the Difficulty of Training Transformers [EMNLP2020]

Strong dependencies of Post-Norm amplify fluctuations brought by parameter changes and destabilize the training
Loose reliance on residual branches in Pre-Norm generally limits the algorithm’s potential and often produces inferior models (than Post-Norm)
Propose Admin: an Adaptive model initialization method to stablize the early stage of training

5. MADE: Masked Autoencoder for Distribution Estimation [ICML2015]

Modify the autoencoder to make it autoregssive by using masks to change hidden layer connectivity structures

6. Lite Transformer

7. Object-Centric Learning with Slot Attention

8. Deep Learning Recommendation Model for Personalization and Recommendation Systems

9. Multi-Head Attention:Collaborate Instead of Concatenat

Meta-Learning

1. On First-Order Meta-Learning Algorithms

2. Reptile: a Scalable Metalearning Algorithm

3. Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML [ICLR2020]

Question: is the effectiveness of MAML due to the meta-initialization being primed for rapid learning (large, efficient changes in the representations) or due to feature reuse, with the meta-initialization already containing high quality features? Answer: feature reuse is the dominant factor
Propose Almost No Inner Loop (ANIL) algorithm, a competitive simplification of MAML by removing the inner loop for all but the (task-specific) head of the underlying neural network
Propose No Inner Loop (NIL) algorithm to classify the test sample based on cosine similarities of penultimate layer representations with the k labelled examples (support set)

4. Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters

Others

1. Scaling Hidden Markov Language Models [EMNLP2020]

2. REVISIT KNOWLEDGE DISTILLATION: A TEACHER-FREE FRAMEWORK

3. PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization

4. AdapterFusion:Non-Destructive Task Composition for Transfer Learning

5. Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation

6. Dynamic Fusion Network for Multi-Domain End-to-end Task-Oriented Dialog

7. Multi-Source Domain Adaptation with Mixture of Experts

8. Meta-Learning for Few-Shot NMT Adaptation

8. Simple, Scalable Adaptation for Neural Machine Translation

9. Parameter-Efficient Transfer Learning for NLP

11. Improving Target-side Lexical Transfer in Multilingual Neural Machine Translation

12. Large Memory Layers with Product Keys

13. An Analysis of Massively Multilingual Neural Machine Translation forLow-Resource Languages

14. Large Product Key Memory for Pretrained Language Models

15. Contextual Parameter Generation for Universal Neural Machine Translation

16. Multilingual Speech Recognition with Self-Attention Structured Parameterization

17. Experience Grounds Language [EMNLP2020]

Propose the notion of a World Scope (WS) as a lens to audit progress in NLP
Five levels of WS: Corpus, Internet, Perception, Embodiment, Social

18. Monolingual Adapters for Zero-Shot Neural Machine Translation [EMNLP2020]

Propose language-specific adapter layers which are more parameter-efficient $2n$, $O(n)$ than bilingual ones $n\cdot (n-1)$, $O(n^2)$ and enable combining any encoder adapter with other decoder adapters

PyTorch模型部署踩坑记录

2020-08-14T00:00:00-07:00

Last Updated: 2020-08-14

最后更新：2020/8/14

0. 基础知识

ONNX是一种通用的神经网络交换格式，用于统一不同框架（PyTorch，TensorFlow等）写出的模型，目的是方便测试或者部署，目前ONNX只支持推理。https://github.com/onnx/onnx
PyTorch转ONNX，https://pytorch.org/docs/stable/onnx.html
基于torch.onnx的volksdep开源库，https://github.com/Media-Smart/volksdep
相关阅读：开源一年多的模型交换格式ONNX，已经一统框架江湖了？https://zhuanlan.zhihu.com/p/51387600

1. PyTorch模型转ONNX

1.1. 部分Operator不支持问题

1. torch.tril: RuntimeError: Exporting operator tril to ONNX opset version 11 is not supported.

解决方案：暂时使用numpy的tril函数代替，torch.tril() –> torch.from_numpy(np.tril)，triu同理

2. KLDivLoss:

解决方案：最新版pytorch里刚刚支持，将issue中的代码复制到相应文件中即可解决 https://github.com/pytorch/pytorch/pull/41858/files

1.2. TracerWarning问题

TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.

原因以及解决方案

if else语句。如果Pytorch定义的网络结构太过于灵活，那么转成ONNX的时候很有可能出错。这个报错通常情况下是你的网络结构中出现if else 语句。（https://blog.csdn.net/Einstellung/article/details/105886873）
存在参数覆盖（原地修改/计算）。比如在原地修改p[:, 0:2]=torch.sigmoid(p[:, 0:2]，声明一个临时变量可解决)（https://blog.csdn.net/weixin_39908946/article/details/106855482）

[Paper-NLP] Meta-Learning for Low-Resource Neural Machine Translation

2020-08-11T00:00:00-07:00

Last Updated: 2020-08-12

This paper: Meta-Learning for Low-Resource Neural Machine Translation is published at EMNLP 2018. Authors include Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho and Victor O.K.Li from The University of Hong Kong and New York University.

This paper introduces model-agnostic meta-learning algorithm (MAML) to the low-resource neural machine translation (NMT) task. The proposed approach significantly outperforms the multilingual, transfer-learning-based method.

1. Introduction

To address the problem of low-resource language pairs, various approaches have been presented including:

Utilizing monolingual corpora (multi-task learning, back-translation, dual learning, unsupervised machine translation with monolingual corpora for both sides)
Exploiting knowledge from high-resource language pairs (auxiliary translations/tasks, multilingual translation, universal lexical representation)
Pre-train the NMT on high resource language pair and transfer to target low-resource pair

The authors follow up on the latest multilingual NMT approaches and introduce MAML to low-resource NMT by regarding different language-pairs as separate tasks. They further incorporate the universal lexical representation to overcome the problem of vanilla MAML that it can not handle mismatched input and output.

2. Background

Neural Machine Translation (NMT)

Given a source sentence:

\[X=\{x_1, ..., x_{T'}\}.\]

NMT task is to model:

\[p(Y|X;\theta)=\prod_{t=1}^{T+1}p(y_t|y_{0:t-1}, x_{1:T'};\theta).\]

Meta Learning

Meta-learning is to solve the problem of “fast adaptation on new training data”, there are two categories of meta-learning:

Learning a meta-policy for updating model parameters
Learning a good parameter initialization for fast adaptation

Model-agnostic meta-learning (MAML) belongs to the second category.

3. Meta Learning for Low-Resource NMT

MAML is to find a proper initialization of parameters $\theta^0$ based on a set of tasks ${T^1, T^2, …, T^K}$ so that the model can learn a new target task $T^0$ with only small amount of training data.

The process can be understood as:

\[\theta^*=\text{Learn}(T^0; \text{MetaLearn}(T^1, ..., T^K)).\]

In the context of NMT, each language pair is regarded as a different task. The objective is to find an initialization from high-resource language-pairs to fast adapt the model to low-resource pairs.

The overall illustration is shown in Figure 1

3.1. Learn: language-specific learning

The language-specific learning process $\text{Learn}(D_T;\theta^0)$ is formulated to maximize the log-posterior given data $D_T$ and randomly initialized or meta-learned parameters $\theta^0$ :

\[\text{Learn}(D_T;\theta^0)=\text{argmax}_\theta L^{D_T}(\theta)=argmax_\theta \sum_{(X, Y)\in D_T}\log p(Y|X, \theta)-\beta||\theta -\theta^0||^2,\]

note that the second term is used to discourage the newly learned parameters from deviating too much, alleviating the overfitting issue.

3.2. MetaLearn

The meta-objective to find the initialization $\theta^0$ is given by:

\[\mathcal{L}(\theta)=\mathbb{E}_k\mathbb{E}_{D_{T^k}, D'_{T^k}}\left[ \sum_{(X, Y)\in D'_{T^k}}\log p(Y|X;Learn(D_{T^k}; \theta))\right],\]

where $k\sim\mathcal{U}({1, …, K})$ refers the $k$-th meta-learning episode. For each episode, task $T^k$ is uniformly chosen at random. $D_{T^k}$ and $D’_{T^k}$ are subsets of training examples for learning and evaluating, respectively. They are sampled independently from the chosen task $T^k$.

In learning process, the model parameters are updated by:

\[\theta_k'=\text{Learn}(D_{T^k};\theta)=\theta-\eta \nabla_\theta \mathcal{L}^{D_{T^k}}(\theta),\]

note that this process is not really applied on meta model $\theta$ but a simulation.

By applying the updated parameters $\theta’k$ to evaluation set $D’{T^k}$ , the meta-model $\theta$ is updated with meta-gradient computed on evaluation set. As shown in the formula below, note that it is possible to aggregate multiple episodes before updating:

\[\theta \leftarrow \theta-\eta' \sum_k \nabla_\theta \mathcal{L}^{D'_{T^k}}(\theta'_k),\]

where $\eta’$ is meta-learning rate.

Meta-Gradient

Based on the property below:

\[H(x)v \approx \frac{\nabla (x+uv)-\nabla (x)}{u},\]

meta-gradient is approximated as follows:

\[\nabla_\theta \mathcal{L}^{D'}(\theta')=\nabla_{\theta'} \mathcal{L}^{D'}(\theta') \nabla_{\theta} \left(\theta -\eta \nabla \mathcal{L}^{D}(\theta)\right)=\nabla_{\theta'} \mathcal{L}^{D'}(\theta') -\eta \nabla_{\theta'}\mathcal{L}^{D'}(\theta')H_\theta(\mathcal{L}^D(\theta))\\ \approx \nabla_{\theta'} \mathcal{L}^{D'}(\theta') -\frac{\eta}{u} \left[ \left .\nabla_{\theta}\mathcal{L}^{D}(\theta)\right|_{\hat{\theta}}- \left .\nabla_{\theta}\mathcal{L}^{D}(\theta)\right|_{\theta} \right],\]

where $u$ is a small constant and

\[\hat{\theta}=\theta + u\nabla_{\theta'}\mathcal{L}^{D'}(\theta')\]

In practice, the authors omitted the second term by using the simplified rule:

\[\nabla_\theta \mathcal{L}^{D'}(\theta')\approx \nabla_{\theta'} \mathcal{L}^{D'}(\theta')\]

As we can see from Figure 2, the difference between transfer learning and meta learning lies in that the former aims at directly solving the source tasks. On the other hand, meta-learning is to be useful for fine-tuning on various tasks including the source and target tasks.

3.3. Unified Lexical Representation

One limitation of meta learning is that it assumes the input and output spaces shared across all the tasks.

Unified Lexical Representation (ULR)

ULR starts with multilingual word embedding matrices $\epsilon_{\text{query}^k}\in \mathbb{R}^{|V_k|\times d}$ pretrained on monolingual corpora, where $V_k$ is the vocabulary of the $k$-th language. One of these languages are used to build universal lexical representation consisting of a universal embedding matrix: $\epsilon_u \in \mathbb{R}^{M\times d}$ and a corresponding key matrix $\epsilon_{\text{key}} \in \mathbb{R}^{M\times d}$, where $M<|V’_k|$.

Both $\epsilon^k_{\text{query}}$ and $\epsilon_{\text{key}}$ are fixed during meta-learning.

The language-specific embedding of token $x$ from language $k$ is computed as the convex sum of universal embedding vectors:

\[\epsilon^0[x]=\sum_{i=1}^M \alpha_i \epsilon_u[i],\]

where

\[\alpha_i \propto \exp\{-\frac{1}{\tau}\epsilon_{\text{key}}[i]^T A \epsilon^k_{\text{query}}[x]\}\]

and $\tau=0.05$. This approach allows with a fixed number of shared parameters ($\epsilon_u$, $\epsilon_{\text{key}}$ and $A$).

Learning of ULR

During language-specific learning, the authors estimate the change to each embedding vector by a separate parameter $\triangle \epsilon^k[x]$ to avoid directly updating the universal embedding:

\[\epsilon^k[x]=\epsilon^0[x] + \triangle \epsilon^k[x]\]

During language-specific learning, the first term is fixed while during meta-learning stage, only the first term ($\epsilon_u$ and $A$) is updated.

4. Experiments

4.1. Dataset

Source Tasks

Europarl: Bulgarian (Bg), Czech (Cs), Danish (Da), German (De), Greek (El), Spanish (Es),Estonian (Et), French (Fr), Hungarian (Hu), Italian (It), Lithuanian (Lt), Dutch (Nl), Polish (Pl),Portuguese (Pt), Slovak (Sk), Slovene (Sl) and Swedish (Sv)

WMT’17: Russian (Ru) (2M pairs subset)

Target Tasks

WMT’16: Romanian (Ro)

WMT’17: Latvian (Lv), Finnish (Fi), Turkish (Tr)

Korean Parallel Dataset: Korean (Ko)

Ro-En or Lv-En is used as a validation set for meta-learning.

4.2. Model and Learning

Model: Transformer of default hyper-parameter setting

Training & Fine-tuning:

During meta-learning, all the parameters are updated, but during fine-tuning, three strategies are considered to update the model:

updating all the modules (all)
updating the embedding and encoder only (emb+enc)
updating the embedding only (emb)

5. Results

vs. Multilingual Transfer Learning

From Figure 3 below, we can observe significant improvement of meta-learning compared with multilingual transfer learning strategy.
Note that the training sets are only subsampled sets with around 16,000 English tokens. However, the best fine-tuned MetaNMT achieves 2/3 (Ro-En) and 1/2 (rest) of the BLEU score achieved by the supervised model trained on full training sets (as shown in Table 1 above).

Impact of Validation Tasks

Furthermore, we can notice the impact of validation tasks. Fi-En benefits more when Ro-En is used for validation ((c) in Figure 3), while the opposite happens with Tr-En. The relationship between the task similarity and the impact of a validation task remains to be further investigated in the future.

Training Size

From Figure 4, we can also observe that the BLEU curve of MetaNMT is more flat. As target task’s training set increases, the gap between MetaNMT and MultiNMT shrinks, **indicating the robustness of MetaNMT in handling low-resource language pairs **.

Impact of Source Tasks

It can be inferred from Table 2 that using more source tasks is always beneficial to MetaNMT, there is up to 2x improvement from one source task (Es) to 18 source tasks (All).
Choice of source languages also have impact on target languages. For instance, comparing {De Ru} and {Es Fr It Pt}, the latter benefits Ro-En more, but the former benefits all the other pairs more.

Training Curves

Compared with MetaNMT, we can observe from Figure 5 that MultiNMT saturates rapidly and eventually degrades (overfitting), whereas MetaNMT continues to improve and never degrades.

Sample Translations

Table 3 presents zero-shot and meta-learned examples for Tr-En and Ko-En.

Zero-shot examples provide a word-by-word translation without re-ordering, demonstrating the success of applying universal lexical representation and meta-learned initialization.
After 600 sentence pairs (16,000 English tokens), the model rapidly learns to re-order tokens and produces better translation.

6. Conclusion

Contribution of this paper include:

Propose MetaNMT, a meta-learning approach for low-resource neural machine translation
Applying universal lexical representation to tackle the I/O mismatch problem across language pairs.
MetaNMT significantly outperforms multilingual transfer learning based method on low-resource tasks

[Paper-NLP] Get To The Point: Summarization with Pointer-Generator Networks

2020-07-09T00:00:00-07:00

Last Updated: 2020-08-11

Slides used for my final presentation in Language Engineering at Tokyo Tech: [slides].

This paper: Get To The Point: Summarization with Pointer-Generator Networks is proposed by researchers from Stanford and Google.

Code: https://github.com/rohithreddy024/Text-Summarizer-Pytorch (PyTorch, not official), https://github.com/abisee/pointer-generator (TensorFlow 1.x, official)

This paper introduces Pointer-Generator Network which solves the two shortcomings of current models for abstractive text summarization:

they are liable to reproduce factual details inaccurately
they tend to repeat themselves

Novelty:

use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator.
use coverage to keep track of what has been summarized, which discourages repetition

1. Introduction

Two approaches to summarization: extractive and abstractive.

Extractive methods assemble summaries exclusively from passages (usually whole sentences) taken directly from the source text
abstractive methods may generate novel words and phrases not featured in the source text – as a human-written abstract usually does.

An comparison of proposed method and baseline Seq2Seq model with attention, from which we can find that the pointer-generator solve the out-of-vocabulary (OOV) problem while coverage eliminates the repetition problem:

2. Our Models

2.1. Sequence-to-sequence attentional model (baseline)

The model architecture is shown below:

Explanation:

Attention distribution/weights (Blue): a^t at decoding time step t over the encoder hidden states h_i.

\[a^t=\text{softmax}(e^t), \quad \text{where} \quad e_i^t=v^T \tanh(W_h h_i + W_s s_t + b_{attn}),\]

where v, W_h, W_s, b_{attn} are learnable parameters.

Context Vector: weighted average of encoder hidden states h_i with attention weights a^t used at decoding time step t.

\[h^*_t=\sum_{i}{a_i^t * h_i}\]

Vocabulary distribution (Green): the probability distribution over all words in the vocabulary.

\[P_{vocab}=\text{softmax}(V'(V[s_t; h_t^*]+b)+b)\]

More intuitively,

\[P_{vocab}=\text{softmax}(\text{Linear(Linear([$s_t;h_t^*$]))})\]

We can see that the input is just concatenation of context vector and decoder state at time step t.

Loss function:

During training, the loss for time step t is the negative log likelihood (NLLLoss) of the target word w^∗_t for that timestep:

\[loss_t = -\log P(w_t^*)\]

and the overall loss for the whole sequence is:

\[loss=\frac{1}{T}\sum_{t=0}^{T-1} loss_t=-\frac{1}{T}\sum_{t=0}^{T-1}\log P(w^*_t)\]

2.2. Pointer-generator network

The model architecture is shown below:

Explanation:

The first three components are the same as the baseline model as explained in section 2.1.

Attention distribution/weights (Blue): a^t at decoding time step t over the encoder hidden states h_i.

\[a^t=\text{softmax}(e^t), \quad \text{where} \quad e_i^t=v^T \tanh(W_h h_i + W_s s_t + b_{attn}),\]

where v, W_h, W_s, b_{attn} are learnable parameters.

Context Vector: weighted average of encoder hidden states h_i with attention weights a^t used at decoding time step t.

\[h^*_t=\sum_{i}{a_i^t * h_i}\]

Vocabulary distribution (Green): the probability distribution over all words in the vocabulary.

\[P_{vocab}=\text{softmax}(V'(V[s_t; h_t^*]+b)+b)\]

More intuitively,

\[P_{vocab}=\text{softmax}(\text{Linear(Linear([$s_t;h_t^*$]))})\]

which serves as the final distribution of the model output:

\[P(w)=P_{vocab}(w)\]

The components below are introduced by pointer-generator to bring improvements:

Generation probability (Yellow): the probability of generating a word from the vocabulary at decoding time step t.

\[p_{gen}=\text{sigmoid}(w_{h^*}^T h_t^*+w_s^T s_t+w_x^T x_t + b_{ptr})\]

Final distribution: probability distribution over the extended vocabulary composed of the vocabulary and words from the source text.

\[P(w)=p_{gen}P_{vocab}(w)+(1-p_{gen})\sum_{i:w_i=w}a_i^t\]

Loss function:

The loss function is as described in equations (5) and (6), but with respect to the modified probability distribution P(w) given in equation (13).

2.3. Coverage mechanism

Coverage mechanism is proposed to solve the problem of repetition in Seq2Seq models.

A coverage vector c^t is maintained, which is the sum of attention distributions over all previous decoder timesteps:

\[c^t=\sum_{t'=0}^{t-1}a^{t'}\]

Note that c^0 is a zero vector, because on the first timestep, none of the source document has been covered.

The coverage vector is used as extra input to the attention mechanism:

\[a^t=\text{softmax}(e^t), \quad \text{where} \quad e_i^t=v^T \tanh(W_h h_i + W_s s_t + w_c c_i^t + b_{attn})\]

where w_c is a learnable parameter vector of same length as v.

The authors also find it necessary (see section 5) to additionally define a coverage loss to penalize repeatedly attending to the same locations:

\[\text{covloss}_t=\sum_i \min(a_i^t, c_i^t)\]

Note that the coverage loss is bounded; in particular equal to or less than 1 (sum of softmax attention).

Finally, the coverage loss, reweighted by some hyperparameter λ, is added to the primary loss function to yield a new composite loss function:

\[loss=\frac{1}{T}\sum_{t=0}^{T-1} loss_t=\frac{1}{T}\sum_{t=0}^{T-1}\left(-\log P(w^*_t) +\lambda \sum_i \min(a_i^t, c_i^t) \right)\]

Skipped

4. Dataset

CNN/Daily Mail dataset (Hermannet al., 2015; Nallapati et al., 2016), which contains online news articles (781 tokens on average) paired with multi-sentence summaries (3.75 sentences or 56 tokens on average).

We use the scripts supplied by Nallapati et al. (2016) to obtain the same version of the the data, which has 287,226 training pairs, 13,368 validation pairs and 11,490 test pairs.

Both the dataset’s published results (Nallapati et al., 2016, 2017) use the anonymized version of the data, which has been pre-processed to replace each named entity, e.g.,The United Nations, with its own unique identifier for the example pair, e.g., @entity5. By contrast, we operate directly on the original text (or non-anonymized version of the data), which we believe is the favorable problem to solve because it requires no pre-processing.

5. Experiments

For all experiments, our model has 256-dimensional hidden states and 128-dimensional word embeddings. For the pointer-generator models, we use a vocabulary of 50k words for both source and target – note that due to the pointer network’s ability to handle OOV words, we can use a smaller vocabulary size than Nallapati et al.’s (2016) 150k source and 60k target vocabularies. For the baseline model, we also try a larger vocabulary size of 150k.

Note that the pointer and the coverage mechanism introduce very few additional parameters to the network: for the models with vocabulary size 50k, the baseline model has 21,499,600 parameters, the pointer-generator adds 1153 (256 * 2 * 2 + 128 + 1) extra parameters (w_{h^∗}, w_s, w_x, and b_{ptr} in equation 12), and coverage adds 512 (256 * 2 directions) extra parameters (w_c in equation 15).

Training details can be found in the paper.

About the coverage mechanism:

To obtain our final coverage model, we added the coverage mechanism with coverage loss weighted to λ=1 (as described in equation 17), and trained for a further 3000 iterations (about 2 hours).

We tried training the coverage model without the loss function but found this to be ineffective, with no discernible reduction in repetition. We also tried training with coverage from the first iteration rather than as a separate training phase,but found that in the early phase of training, the coverage objective interfered with the main objective, reducing overall performance.

6. Results

6.1. Preliminaries

Results are given in Table 1. The models are evaluated with the standard ROUGH metric, reporting the F1 scores for ROUGE-1, ROUGE-2 and ROUGE-L (which respectively measure the word-overlap, bigram-overlap, and longest common sequence between the reference summary and the summary to be evaluated). ROUGE scores are obtained using the pyrouge package. (https://arxiv.org/pdf/pypi.python.org/pypi/pyrouge/0.1.3)

We also evaluate with the METEOR metric, both in exact match mode (rewarding only exact matches between words) and full mode (which additionally rewards matching stems, synonyms and para-phrases). (http://www.cs.cmu.edu/~alavie/METEOR)

6.2. Observations

Baselines:

Both baseline models perform poorly with respect to ROUGE and METEOR, and the larger vocabulary size (150k) does not seem to help. (50k is even better)

Factual details are frequently reproduced incorrectly, often replacing an uncommon (but in-vocabulary) word with a more-common alternative. For example in Figure 1,the baseline model appears to struggle with the rare word thwart, producing destabilize instead, which leads to the fabricated phrase destabilize nigeria’s economy. Even more catastrophically, the summaries sometimes devolve into repetitive nonsense, such as the third sentence produced by the baseline model in Figure 1. In addition, the baseline model can’t reproduce OOV words (such as muhammadu buhariin Figure 1). Further examples of all these problems are provided in the supplementary material.

Pointer-generator:

Pointer-generator model achieves much better ROUGE and METEOR scores than the baseline, despite many fewer training epochs.

OOV words are handled easily, factual details are almost always copied correctly, and there are no fabrications (see Figure 1). However, repetition is still very common.

Pointer-generator with coverage:

Pointer-generator model with coverage improves the ROUGE and METEOR scores further,convincingly surpassing the best abstractive model of Nallapati et al. (2016)

Despite the brevity of the coverage training phase (about 1% of the total training time), the repetition problem is almost completely eliminated, which can be seen both qualitatively (Figure1) and quantitatively (Figure 4). However, our best model does not quite surpass the ROUGE scores of the lead-3 baseline, nor the current best extractive model (Nallapati et al., 2017).

7. Discussion

7.1. Comparison with extractive systems

Table 1 shows stronger performance by extractive systems (including lead-3 baseline which summarize using the first three sentences only).

Explanation:

News articles tend to be structured with the most important information at the start
ROUGE metric naturally prefers extractive approaches, but abstractive summaries are subjective

7.2. How abstractive is our model?

Figure 6 shows final model copies whole article sentences 35% of the time; by comparison the reference summaries do so only 1.3% of the time.

8. Conclusion

This paper introduces copying mechanism to abstractive summarization task and further achieve improvements by proposing coverage mechanism. It is an interesting attempt and obtains relatively promising results.

For me there are, however, still several drawbacks to overcome:

For abstractive text generation, what is a more suitable metric apart from existing BLEU, ROUGE & METEROR, etc.?
The copying mechanism improves model performance but degrades the abstractive capability, which remains a problem to solve.

[Paper-Vocoder] WaveNet: A Generative Model for Raw Audio

2020-07-04T00:00:00-07:00

Last Updated: 2020-07-04

This paper: WaveNet: A Generative Model for Raw Audio is proposed by researchers from Google and DeepMind.

Code: https://github.com/kan-bayashi/PytorchWaveNetVocoder (not official)

Samples: https://www.deepmind.com/blog/article/wavenet-generative-model-raw-audio

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms.

1. Introduction

This paper introduces WaveNet, an audio generative model based on the PixelCNN (van den Oordet al., 2016a;b) architecture. The main contributions of this work are as follows:

It shows that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS), as assessed by human raters.
In order to deal with long-range temporal dependencies needed for raw audio generation, the authors develop new architectures based on dilated causal convolutions, which exhibit very large receptive fields.
It is shown that when conditioned on a speaker identity, a single model can be used to generate different voices.
The same architecture shows strong results when tested on a small speech recognition dataset, and is promising when used to generate other audio modalities such as music.

In general, WaveNets provide a generic and flexible framework for tackling many applications that rely on audio generation (e.g. TTS, music, speech enhancement, voice conversion, source separation).

2. WaveNet

WaveNet is a generative model operating directly on the raw audio waveform. The joint probability of a waveform x={x_1, …, x_T} is factorised as a product of conditional probabilities as follows:

\[p(\textbf{x})=\prod_{t=1}^T p(x_t|x_1, ..., x_{t-1})\]

Each audio sample x_t is therefore conditioned on the samples at all previous timesteps.

Similarly to PixelCNNs (van den Oord et al., 2016a;b), the conditional probability distribution is modelled by a stack of convolutional layers. There are no pooling layers in the network, and the output of the model has the same time dimensionality as the input. The model outputs a categorical distribution over the next value x_t with a softmax layer and it is optimized to maximize the log-likelihood of the data w.r.t. the parameters.

2.1. Dilated Causal Convolutions

Illustration of causal convolutional layers are shown in Figure 2.

At training time, the conditional predictions for all timesteps can be made in parallel because all timesteps of ground truth x are known. When generating with the model, the predictions are sequential: after each sample is predicted, it is fed back into the network to predict the next sample.

Pros and cons of causal convolutions compared to RNNs:

Because models with causal convolutions do not have recurrent connections, they are typically faster to train than RNNs, especially when applied to very long sequences. One of the problems of causal convolutions is that they require many layers, or large filters to increase the receptive field. For example, in Fig. 2 the receptive field is only 5 (= #layers + filter/kernel length - 1 = 4 + 2 - 1).

Authors’ improvements:

In this paper dilated convolutions are used to increase the receptive field by orders of magnitude, without greatly increasing computational cost.

A dilated convolution is a convolution where the filter/kernel is applied over an area larger than its length by skipping input values with a certain step. It is equivalent to a convolution with a larger filter derived from the original filter by dilating it with zeros, but is significantly more efficient. A dilated convolution effectively allows the network to operate on a coarser scale than with a normal convolution. This is similar to pooling or strided convolutions, but here the output has the same size as the input. As a special case, dilated convolution with dilation 1 yields the standard convolution.

Figure 3 depicts dilated causal convolutions for dilations 1, 2, 4,and 8.

Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency. In this paper, the dilation is doubled for every layer up to a limit and then repeated: e.g. 1, 2, 4, …, 512, 1, 2, 4, …, 512, 1, 2, 4, …, 512.

Intuition:

Exponentially increasing the dilation factor results in exponential receptive field growth with depth (Yu & Koltun, 2016). For example each 1, 2, 4, …, 512 block has receptive field of size 1024, and can be seen as a more efficient and discriminative (non-linear) counterpart of a 1×1024 convolution.
Stacking these blocks further increases the model capacity and the receptive field size.

2.2. Softmax Distributions

One approach to modeling the conditional distributions p(x_t|x_1, …, x_t−1) over the individual audio samples would be to use a mixture model such as a mixture density network (Bishop, 1994) or mixture of conditional Gaussian scale mixtures (MCGSM) (Theis & Bethge, 2015). However,van den Oord et al. (2016a) showed that a softmax distribution tends to work better, even when the data is implicitly continuous (as is the case for image pixel intensities or audio sample values). One of the reasons is that a categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions about their shape.

Problem:

Raw audio is typically stored as a sequence of 16-bit integer values (one per timestep), a softmax layer would need to output 65,536 probabilities per timestep to model all possible values.

Solution:

The authors apply a μ-law companding transformation (ITU-T, 1988) to the data, and then quantize it to 256 possible values:

\[f(x_t)=\text{sign}(x_t)\frac{\ln(1+\mu|x_t|)}{\ln(1+\mu)},\]

where −1< x_t<1 and μ = 255. This non-linear quantization produces a significantly better reconstruction than a simple linear quantization scheme. Especially for speech, it is found that the reconstructed signal after quantization sounded very similar to the original.

2.3. Gated Activation Units

The authors use the same gated activation unit as used in the gated PixelCNN (van den Oord et al., 2016b):

\[\textbf{z}=\tanh(W_{f,k}*\textbf{x}) \odot \sigma(W_{g,k}*\textbf{x})\]

where ∗ denotes a convolution operator, \odot denotes an element-wise multiplication operator, σ(·) is a sigmoid function, k is the layer index, f and g denote filter and gate, respectively, and W is a learnable convolution filter.

In our initial experiments, we observed that this non-linearity worked significantly better than the rectified linear activation function (Nair & Hinton, 2010) for modeling audio signals.

2.4. Residual And Skip Connections

Both residual (He et al., 2015) and parameterised skip connections are used throughout the network,to speed up convergence and enable training of much deeper models. In Fig. 4 we show a residual block of our model, which is stacked many times in the network.

2.5. Conditional WaveNets

Given an additional input h, WaveNets can model the conditional distribution p(x|h) of the audio given this input. Eq. (1) now becomes

\[p(\textbf{x}|\textbf{h})=\prod_{t=1}^T p(x_t|x_1, ..., x_{t-1}, \textbf{h})\]

By conditioning the model on other input variables, we can guide WaveNet’s generation to produce audio with the required characteristics. For example, in a multi-speaker setting we can choose the speaker by feeding the speaker identity to the model as an extra input. Similarly, for TTS we need to feed information about the text as an extra input.

The authors condition the model on other inputs in two different ways: global conditioning and local conditioning.

Global conditioning:

Global conditioning is characterised by a single latent representation h that influences the output distribution across all timesteps, e.g. a speaker embedding in a TTS model. The activation function from Eq. (2) now becomes:

\[\textbf{z}=\tanh(W_{f,k}*\textbf{x}+V_{f,k}^T\textbf{h}) \odot \sigma(W_{g,k}*\textbf{x}+V_{g,k}^T\textbf{h})\]

where V_{∗,k} is a learnable linear projection, and the vector V^T_{∗,k}h is broadcast over the time dimension.

Local conditioning:

For local conditioning we have a second time series h_t, possibly with a lower sampling frequency than the audio signal, e.g. linguistic features in a TTS model. We first transform this time series using a transposed convolutional network (learned upsampling) that maps it to a new time series y=f(h) with the same resolution as the audio signal, which is then used in the activation unit as follows:

\[\textbf{z}=\tanh(W_{f,k}*\textbf{x}+V_{f,k} * \textbf{y}) \odot \sigma(W_{g,k}*\textbf{x}+V_{g,k} * \textbf{y})\]

where V_{f, k} ∗ y is now a 1×1 convolution. As an alternative to the transposed convolutional network, it is also possible to use V_{f, k} ∗ h and repeat these values across time. This worked slightly worse in the experiments.

2.6. Context Stacks

We have already mentioned several different ways to increase the receptive field size of a WaveNet:

increasing the number of dilation stages

using more layers

larger filters

greater dilation factors

a combination thereof.

A complementary approach is to use a separate, smaller context stack that processes a long part of the audio signal and locally conditions a larger WaveNet that processes only a smaller part of the audio signal (cropped at the end). One can use multiple context stacks with varying lengths and numbers of hidden units. Stacks with larger receptive fields have fewer units per layer. Context stacks can also have pooling layers to run at a lower frequency. This keeps the computational requirements at a reasonable level and is consistent with the intuition that less capacity is required to model temporal correlations at longer timescales.

3. Experiments

The authors evaluate WaveNet on three different tasks:

multi-speaker speech generation (not conditioned on text)
TTS
music audio modelling.

3.1. Multi-Speaker Speech Generation

The authors used the English multi-speaker corpus from CSTR voice cloning toolkit (VCTK) (Yamagishi, 2012) and conditioned WaveNet only on the speaker. The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers.

Because the model is not conditioned on text, it generates non-existent but human language-like words in a smooth way with realistic sounding intonations. This is similar to generative models of language or images, where samples look realistic at first glance, but are clearly unnatural upon closer inspection. The lack of long range coherence is partly due to the limited size of the model’s receptive field (about 300 milliseconds), which means it can only remember the last 2–3 phonemes it produced.

A single WaveNet was able to model speech from any of the speakers by conditioning it on a one-hot encoding of a speaker. This confirms that it is powerful enough to capture the characteristics of all 109 speakers from the dataset in a single model. We observed that adding speakers resulted in better validation set performance compared to training solely on a single speaker. This suggests that WaveNet’s internal representation was shared among multiple speakers.

Finally, we observed that the model also picked up on other characteristics in the audio apart from the voice itself. For instance, it also mimicked the acoustics and recording quality, as well as the breathing and mouth movements of the speakers.

3.2. Text-To-Speech

Datasets:

The North American English dataset contains 24.6 hours of speech data
The Mandarin Chinese dataset contains 34.8 hours;

Both datasets were spoken by professional female speakers.

WaveNets for the TTS task were locally conditioned on linguistic features which were derived from input texts. The authors also trained WaveNets conditioned on the logarithmic fundamental frequency (logF0) values in addition to the linguistic features. External models predicting logF0 values and phone durations from linguistic features were also trained for each language.

Receptive field size and baselines:

The receptive field size of the WaveNets was 240 milliseconds. As example-based and model-based speech synthesis baselines, hidden Markov model (HMM)-driven unit selection concatenative (Gonzalvo et al., 2016) and long short-term memory recurrent neural network (LSTM-RNN)-based statistical parametric (Zenet al., 2016) speech synthesizers were built. Since the same datasets and linguistic features were used to train both the baselines and WaveNets, these speech synthesizers could be fairly compared.

Evaluation:

Subjective paired comparison tests and mean opinion score (MOS) tests were conducted. In the paired comparison tests, after listening to each pair of samples, the subjects were asked to choose which they preferred, though they could choose “neutral” if they did not have any preference. In the MOS tests, after listening to each stimulus, the subjects were asked to rate the naturalness of the stimulus in a five-point Likert scale score (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent). Please refer to Appendix B for details.

Subjective paired comparison tests:

Fig. 5 shows a selection of the subjective paired comparison test results (see Appendix B for the complete table). It can be seen from the results that WaveNet outperformed the baseline statistical parametric and concatenative speech synthesizers in both languages.

The authors found that WaveNet conditioned on linguistic features only could synthesize speech samples with natural segmental quality but sometimes it had unnatural prosody by stressing wrong words in a sentence. This could be due to the long-term dependency of F0 contours: the size of the receptive field of the WaveNet, 240 milliseconds, was not long enough to capture such long-term dependency. WaveNet conditioned on both linguistic features and F0 values did not have this problem: the external F0 prediction model runs at a lower frequency (200 Hz) so it can learn long-range dependencies that exist in F0 contours.

Mean opinion score (MOS) tests:

Table 1 show the MOS test results. It can be seen from the table that WaveNets achieved 5-scale MOSs in naturalness above 4.0, which were significantly better than those from the baseline systems. They were the highest ever reported MOS values with these training datasets and test sentences. The gap in the MOSs from the best synthetic speech to the natural ones decreased from 0.69 to 0.34 (51%) in US English and 0.42 to 0.13 (69%) in Mandarin Chinese.

3.3. Music

Datasets:

the MagnaTagATune dataset (Law & Von Ahn, 2009), which consists of about 200 hours of music audio. Each 29-second clip is annotated with tags from a set of 188, which describe the genre, instrumentation, tempo, volume and mood of the music.
the YouTube piano dataset, which consists of about 60 hours of solo piano music obtained from YouTube videos. Because it is constrained to a single instrument, it is considerably easier to model.

The authors found that enlarging the receptive field was crucial to obtain samples that sounded musical. Even with a receptive field of several seconds, the models did not enforce long-range consistency which resulted in second-to-second variations in genre, instrumentation, volume and sound quality. Nevertheless, the samples were often harmonic and aesthetically pleasing, even when produced by unconditional models.

Of particular interest are conditional music models, which can generate music given a set of tags specifying e.g. genre or instruments. Similarly to conditional speech models, the authors insert biases that depend on a binary vector representation of the tags associated with each training clip. This makes it possible to control various aspects of the output of the model when sampling, by feeding in a binary vector that encodes the desired properties of the samples. Such models are trained on the MagnaTagATune dataset; although the tag data bundled with the dataset was relatively noisy and had many omissions, after cleaning it up by merging similar tags and removing those with too few associated clips, this works reasonably well.

3.4. Speech Recognition

Dataset: TIMIT (Garofolo et al.,1993) dataset

WaveNets show that layers of dilated convolutions allow the receptive field to grow longer in a much cheaper way than using LSTM units.

For this task a mean-pooling layer is added after the dilated convolutions that aggregated the activations to coarser frames spanning 10 milliseconds (160×downsampling). The pooling layer was followed by a few non-causal convolutions. The authors trained WaveNet with two loss terms, one to predict the next sample and one to classify the frame, the model generalized better than with a single loss and achieved 18.8 PER on the test set, which is to our knowledge the best score obtained from a model trained directly on raw audio on TIMIT.

4. Conclusion

The authors introduced WaveNets, which are autoregressive and combine causal filters with dilated convolutions to allow their receptive fields to grow exponentially with depth, which is important to model the long-range temporal dependencies in audio signals.

[Paper-PreTrain] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

2020-06-27T00:00:00-07:00

Last Updated: 2020-06-28

This paper: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations is proposed by researchers from Facebook AI.

Code: https://github.com/pytorch/fairseq/ (seems not released yet)

In this paper, the authors shows that wav2vec 2.0 learns powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.

wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.

Results: SOTA on 100 hour subset of Librispeech as well as on TIMIT phoneme recognition.

1. Introduction

Motivation: current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance which is not available for the vast majority of the nearly 7,000 languages spoken worldwide [30]. Learning purely from labeled examples does not resemble language acquisition in humans: infants learn language by listening to adults around them - a process that requires learning good representations of speech.

Our approach encodes speech audio via a multi-layer convolutional neural network and then masks spans of the resulting latent speech representations [25,54], similar to masked language modeling [9]. The latent representations are fed to a Transformer network to build contextualized representations and the model is trained via a contrastive task where the true latent is to be distinguished from distractors [51, 47, 46, 27] .

Pretraining and fine-tuning:

As part of training, we learn discrete linguistic units [51,31,7,17] via a Gumbel softmax [23,5] to represent the latent representations in the contrastive task (Figure 1) which we find to be more effective than non-quantized targets. After pre-training on unlabeled speech, the model is fine-tuned on labeled data with a Connectionist Temporal Classification (CTC) loss [14,4] to be used for downstream speech recognition tasks.

Limitation of previous works (vq-wav2vec):

Previous work learned a quantization of the data followed by a contextualized representations with a self-attention model [5,4], whereas our approach solves both problems end-to-end. Masking parts of the input with Transformer networks for speech has been explored [4,25], but prior work relies either on a two-step pipeline or their model is trained by reconstructing the filter bank input features. Other related work includes learning representations from auto-encoding the input data [50,11] or directly predicting future timesteps [8].

Results:

Our results demonstrate the feasibility of ultra-low resource speech recognition: when using only 10 minutes of labeled data, our approach achieves word error rate (WER) 5.7/10.1 on the clean/noisy test sets of Librispeech. We set a new state of the art on TIMIT phoneme recognition as well as the 100 hour clean subset of Librispeech. Moreover, when we lower the amount of labeled data to just one hour, we still outperform the previous state of the art self-training method of [41] while using 100 times less labeled data and the same amount of unlabeled data. When we use all 960 hours of labeled data from Librispeech, then our model achieves 1.9/3.5 WER which performs competitively to the best published result while using a simpler baseline architecture.

2. Model

The model is composed of a multi-layer convolutional feature encoder f:X → Z which takes as input raw audio X and outputs latent speech representations z_1,…,z_T. They are then fed to a Transformer g:Z → C to build representations c_1,…,c_T capturing information from the entire sequence [9,5,4]. The output of the feature encoder is discretized to q_t with a quantization module Z → Q to represent the targets (Figure 1) in the self-supervised objective.

Design details of model component architectures:

Feature encoder The encoder consists of several blocks containing a temporal convolution followed by a GELU activation function [20]. The first block maps raw audio to a feature representation and to increase robustness, we add a group normalization before the GELU to normalize each output channel over the sequence. We apply layer normalization to the output channels of this network [1].

Contextualized representations with Transformers The output of the feature encoder is fed to a context network which follows the Transformer architecture [53,9,32]. Instead of fixed positional embeddings which encode absolute positional information, we use a convolutional layer with kernel size 128 and 16 groups similar to [36,4,55] which acts as relative positional embedding. We add the output of the convolution followed by a GELU to the inputs and then apply layer normalization.

Quantization module For self-supervised training we discretize the output of the feature encoder z to a finite set of speech representations via product quantization [24,5]. This amounts to choosing quantized representations from multiple codebooks and concatenating them. Given G codebooks, or groups, with V entries e∈R^{V×d/G}, we choose one entry from each codebook and concatenate the resulting vectors e_1,…,e_G and apply a linear transformation R^d→R^f to obtain q∈R^f.

Gumbel softmax:

The Gumbel softmax enables choosing discrete codebook entries in a fully differentiable way [15,23,34]. We use the straight-through estimator [25] and setup G hard Gumbel softmax operations [23]. The feature encoder output z is mapped to l∈R^{G×V} logits and the probabilities for choosing the v-th codebook entry for group g are
\[p_{g, v}=\frac{exp(l_{g, v}+n_v)/\tau}{\sum_{k=1}^{V}exp(l_{g,k}+n_k)/\tau}\]
where τ is a non-negative temperature, n=−log(−log(u)) and u are uniform samples from U(0,1).During the forward pass, code word i is chosen by i=argmax_j p_{g,j} and in the backward pass, the true gradient of the Gumbel softmax outputs is used.

3. Training

3.1. Masking

The authors mask a proportion of the feature encoder outputs, or time steps before feeding them to the context network and replace them with a trained feature vector shared between all masked time steps; they do not mask inputs to the quantization module. To mask the latent speech representations output by the encoder, they randomly sample without replacement p=0.065 of all time steps to be starting indices and then mask the subsequent M=10 consecutive time steps from every sampled index; spans may overlap. This results in approximately 49% of all time steps to be masked with a mean span length of 14.7, or 299ms (see Appendix A in the original paper for more details on masking) .

3.2. Objective

During pre-training, the model learns by multiple objectives: a contrastive task L_m, a codebook diversity loss L_d, a L2 penalty L_f.

\[L=L_m+\alpha L_d + \beta L_f\]

where \alpha and \beta are tuned hyperparameters.

Contrastive Loss

Given context network output c_t centered over masked time step t, the model needs to identify the true quantized latent speech representation q_t in a set of K+ 1 quantized candidate representations \tilde{q}∈Q_t which includes q_t and K distractors [22,52]. Distractors are uniformly sampled from other masked time steps of the same utterance. The loss is defined as

\[L_m=-log\frac{exp(sim(c_t, q_t)/k)}{\sum_{\tilde{q}\sim Q_t}exp(sim(c_t, \tilde{q}))/k}\]

where sim(a,b) =a^T b/‖a‖‖b‖ is the cosine similarity between context representations and quantized latent speech representations [18, 6].

Diversity Loss

The contrastive task depends on the codebook to represent both positive and negative examples and the diversity loss L_d is designed to increase the use of the quantized codebook representations [10]. We encourage the equal use of the V entries in each of the G codebooks by maximizing the entropy of the averaged softmax distribution l over the codebook entries for each codebook \overline{p_g} across a batch of utterances; the softmax distribution does not contain the Gumbel noise nor a temperature.

\[L_d=\frac{1}{GV}\sum_{g=1}^G -H(\overline{p}_g)=\frac{1}{GV} \sum_{g=1}^{G} \sum_{v=1}^{V} \overline{p}_{g, v}\log \overline{p}_{g,v}\]

Stabilizing the Feature Encoder

The authors found it helpful to apply an L2 penalty to the activations of the final layer of the feature encoder but before the final layer normalization. They also scale down the global learning for weight updates to the feature encoder by γ, see §4.2.

3.3. Fine-tuning

Pre-trained models are fine-tuned for speech recognition by adding a randomly initialized linear projection on top of the context network into C classes representing the vocabulary of the task [4].For Librispeech, we have 29 tokens for character targets plus a word boundary token. Models are optimized by minimizing a CTC loss [14] and we apply a modified version of SpecAugment [40] by masking to time-steps and channels during training which delays overfitting and significantly improves the final error rates, especially on the Libri-light subsets with few labeled examples.

4. Experimental Setup

4.1. Datasets

Pretraining: 960-hour Librispeech (LS-960) without transcriptions or {the audio data from LibriVox (LV-60k) + same preprocessing following Libri-light [26] –> 53.2k hours of audio.}

Fine-tuning:

960 hours of transcribed Librispeech
train-clean-100 subset comprising 100 hours (100 hours labeled)
Libri-light limited resource training subsets originally extracted from Librispeech: train-10h (10 hours labeled), train-1h (1 hour labeled), train-10min (10 min labeled)
TIMIT dataset containing five hours of audio recordings with detailed 39 phoneme labels.

4.2. Pre-training

The feature encoder contains seven blocks and the temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths(10,3,3,3,3,2,2). This results in an encoder output frequency of 49 Hz with a stride of about 20ms between each sample, and a receptive field of 400 input samples or 25 ms of audio.

We experiment with two model configurations which use the same encoder architecture but differ in the Transformer setup: BASE contains 12 transformer blocks, model dimension 768, inner dimension (FFN) 3,072 and 8 attention heads. Batches are built by cropping 250k audio samples, or 15.6 sec, from each example. Crops are batched together to not exceed 1.4m samples per GPU and we train on a total of 64 V100 GPUs for 1.6 days [37]; the total batch size is 1.6h.

The LARGE model contains 24 transformer blocks with model dimension 1,024, inner dimension 4,096 and 16 attention heads. We crop 320K audio samples, or 20sec, with a limit of 1.2M samples per GPU and train on 128 V100 GPUs over 2.3 days for Librispeech and 5.2 days for LibriVox; the total batch size is 2.7h.

We use dropout 0.1 in the Transformer, at the output of the feature encoder and the input to the quantization module. Layers are dropped at a rate of 0.05 for BASE and 0.2 for LARGE [21, 12]; there is no layer drop for LV-60k.

We optimize with Adam [28], warming up the learning rate for the first 8% of updates to a peak of 5×10^{−3} for BASE and 3×10^{−3} for LARGE, and then linearly decay it. LARGE trains for 250k updates, BASE for 400k updates, and LARGE on LV-60k for 600k updates. We use weight α= 0.1 for the diversity loss and β=10 for the feature penalty in Equation 2. For the quantization module we use G= 2 and V= 320 for both models, resulting in a theoretical maximum of 102.4k codewords (How does it come???). Entries are of size d/G=128 for BASE and d/G= 384 for LARGE. The Gumbel softmax temperature τ is annealed from 2 to a minimum of 0.5 for BASE and 0.1 for LARGE by a factor of 0.999995 at every update. The temperature in the contrastive loss (Equation 3) is set to κ=0.1. We set the feature encoder gradient scaling factor to γ= 0.1 for Librispech and γ= 0.03 for LibriVox. In the contrastive loss we use K=100 distractors. We choose the training checkpoint with the lowest L_m on the validation set.

4.3. Fine-tuning

After pre-training we fine-tune the learned representations on labeled data and add a randomly initialized output layer on top of the Transformer to predict (Librispeech/Libri-light) or phonemes (TIMIT). For Libri-light, we train three seeds with two different learning rates (2e-5 and 3e-5) for all subsets and choose the configuration with lowest WER on dev-other subset decoded with the official 4-gram language model (LM) with beam 50 and fixed model weights (LM weight 2, word insertion penalty -1). For BASE on the labeled 960h subset we use a learning rate of 1e-4. We optimize with Adam and a tri-state rate schedule where the learning rate is warmed up for the first10% of updates, held constant for the next 40% and then linearly decayed for the remainder. BASE uses a batch size of 3.2m samples per GPU and we fine-tune on 8 GPUs, giving a total batch size of 1,600sec. LARGE batches 1.28M samples on each GPU and we fine-tune on 24 GPUs, resulting in an effective batch size of 1920sec. For the first 10k updates only the output classifier is trained,after which the Transformer is also updated. The feature encoder is not trained during fine-tuning. We mask the feature encoder representations with a strategy similar to SpecAugment [40] detailed in Appendix B.

4.4. Language Models and Decoding

We consider two types of language models (LM): a 4-gram model and a Transformer [3] trained on the Librispeech LM corpus. The Transformer LM is identical to [49] and contains 20 blocks, model dimension 1280, inner dimension 6144 and 16 attention heads. We tune the weights of the language model (interval[0,5]) and a word insertion penalty ([−5,5]) via Bayesian optimization3: we run 128 trials with beam 500 for the 4-gram LM and beam 50 for the Transformer LM and choose the best set of weights according to performance on dev-other. Test performance is measured with beam 1,500 for the n-gram LM and beam 500 for the Transformer LM. We use the beam search decoder of [43].

5. Results

5.1. Low-Resource Labeled Data Evaluation

WER results on Librispeech dev/test sets:

We can see that:

The LARGE model pre-trained on LV-60k and fine-tuned on only 10 minutes of labeled data achieves a WER of 5.7/10.1 on clean/other test sets. his demonstrates that ultra-low resource speech recognition is possible with self-supervised learning on unlabeled data. This approach improves over previous pre-training work which did not learn quantized audio units jointly [4], reducing WER by a about a third.
A recent iterative self-training approach [41] represents the SOTA on the clean 100 hour subset of Librispeech but it requires multiple iterations of labeling, filtering, and re-training. On the 100 hour subset of Librispeech, their method achieves WER 4.2/8.6 on test-clean/other which compares to WER 2.3/5.0 with the LARGE model in a like for like setup, a relative WER reduction of 45%/42%.
When the LARGE model uses an order of magnitude less labeled data (10h labeled), then it still achieves WER 3.2/6.1, an error reduction of 24%/29% relative to iterative self-training.
Using only a single hour of labeled data, the same model achieves WER 3.9/7.6 which improves on both test-clean and test-other by 7%/12% - with two orders of magnitude less labeled data.
Libri-light data splits contain both clean and noisy data leading to better accuracy on test-other compared to test-clean. (where???)
Increasing model size reduces WER on all setups with the largest improvements on test-other (BASE vs. LARGE both on LS-960) and increasing the amount of unlabeled training data also leads to large improvements (LARGE LS-960 vs. LV-60k).

5.2. High-Resource Labeled Data Evaluation on Librispeech

WER results on Librispeech with 960 hours labeled data:

We can find that:

This work achieves WER 1.9/3.5 on test-clean/other. This is the first time self-supervised learning achieves results competitive to the state of the art iterative semi-supervised methods in a high-resource labeled data setup.

Authors’ explanation: This is despite a weaker baseline architecture: supervised training of the same architecture achieves WER 2.1/4.6 (LARGE- from scratch) compared to WER 1.9/4.1 for ContextNet [16], the baseline architecture of the SOTA Noisy student [41]. The vocabulary of their acoustic model (characters) does not match the vocabulary of the LM (words) which delays feedback from the LM and is likely to be detrimental (did not use subwords). Moreover, they did not use any data balancing such as [41]. Finally, self-training is likely complimentary to pre-training and their combination may yield even better results. Appendix E presents a detailed error analysis of their pre-trained models in various labeled data setups.

5.3. Phoneme Recognition on TIMIT

The authors fine-tuned as for the 10 hour subset of Libri-light but did not use a language model.

We can find:

The proposed approach achieves a new SOTA on this dataset, reducing PER by a relative 23%/29% over the next best result on the dev/test sets.

Appendix D shows an analysis of how the discrete latent speech representations related to phonemes

5.4. Ablations

A difference to previous work [5,4] is that we quantize the latent audio representations only for the contrastive loss, i.e., when latents are used as targets, but not when the latents are input to the Transformer network. We motivate this choice by an ablating for which we adopt a reduced training setup to increase experimental turn around: we pre-train BASE on LS-960 for 250k updates with masking probability p= 0.075, fine-tune on train-10h for 60k updates on a single GPU with 640k samples per batch, or 40 sec of speech audio. We report the average WER and standard deviation on the concatenation of dev-clean and dev-other for three seeds of fine-tuning.

Table 4 shows that:

Strategy of continuous inputs with quantized targets (Baseline) performs best. Continuous latent speech representations retain more information to enable better context representations and quantizing the target representations leads to more robust training.
Quantizing the latents both in the input and the targets performs least well, and explains the lower performance of prior work [5,4].
Continuous targets reduce the effectiveness of self-supervised training since targets can capture detailed artifacts of the current sequence, e.g. speaker and background information,which make the task easier and prevent the model from learning general representations beneficial to speech recognition.
Continuous inputs and continuous targets perform second best but various attempts to improve it did not lead to better results (see Appendix F for this experiment and other ablations on various hyperparameters). The training accuracy of identifying the correct latent audio representation increases from 62% to 78.0% when switching from quantized to continuous targets.

6. Conclusion

Contribution:

We presented wav2vec 2.0, a framework for self-supervised learning of speech representations which masks latent representations of the raw waveform and solves a contrastive task over quantized speech representations.

Results:

Our experiments show the large potential of pre-training on unlabeled data for speech processing: when using only 10 minutes of labeled training data, or 48 recordings of 12.5 seconds on average, we achieve a WER of 5.7/10.1 on test-clean/other of Librispeech

Potential improvements:

We expect performance gains by switching to a seq2seq architecture and a word piece vocabulary.

[Paper-NLP] URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors

2020-06-24T00:00:00-07:00

Last Updated: 2020-06-24

This paper: URIEL andlang2vec: Representing languages as typological,geographical, and phylogenetic vectors is proposed by researchers from CMU and University of Pittsburgh. It is accepted by EACL 2017. This paper is recommended for introducing lang2vec containing information of languages which helps multilingual NLP research.

Code: https://github.com/antonisa/lang2vec

In this paper, the authors introduced the URIEL knowledge base for massively multilingual NLP and the lang2vec utility which provides information-rich vector identifications of languages drawn from typological, geographical, and phylogenetic databases that are normalized to have straightforward and consistent formats, naming, and semantics.

1. Introduction

lang2vec feature primarily represent binary language facts (e.g., that negation precedes the verb or is represented as a suffix, that the language is part of the Germanic family, etc.) and are sourced and predicted from a variety of linguistic resources including WALS (Dryer and Haspel-math, 2013), PHOIBLE (Moran et al., 2014), Ethnologue (Lewis et al., 2015), and Glottolog (Ham-marstr ̈om et al., 2015).

lang2vec takes as its in-put a list of ISO 639-3 codes and outputs a matrix of [0.0, 1.0] feature values (like those in Table1):

2. Motivation

The recent success of “polyglot” models (Hermann and Blunsom, 2014; Faruqui and Dyer, 2014; Ammar et al., 2016; Tsvetkov et al., 2016; Daiber et al., 2016), in which a language model is trained on multiple languages and shares representations across languages, represents a promising avenue for NLP, especially for less-resourced languages, as these models appear to be able to learn useful patterns from better-resourced languages even when training data in the target language is limited.

Tsvetkov et al. (2016) shows that vectors that represent in formation about the language outperform a simple “one-hot” representation where each language is represented by a 1 in a single dimension. Sample results from Tsvetkov et al. (2016) are reproduced in Table 2.

We can see that training on a set of three similar languages, and a set of four similar and dissimilar languages, raises perplexity above the baseline monolingual model, even when the language is identified to the model by a one-hot (id) vector. However, perplexity is lowered by the introduction of phonological feature vectors for each language (the phonology and inventory vector types described in §3.1), giving consistently lower perplexity than even the monolingual baseline.

The initial motivation for the URIEL knowledge base and the lang2vec utility is to make such research easier, allowing different sources of information to be easily used together or as different experimental conditions (e.g., is it better to provide this model information about the syntactic features of the language, or the phylogenetic relationships between the languages?). Standardizing the use of this kind of information also makes it easier to replicate and expand on previous work, without needing to know how the authors processed, for example, WALS feature classes or PHOIBLE inventories into model input.

3. Vector types

General composition: binary vectors

lang2vec** offers a variety of vector representations of languages, of different types and derived from different sources, but all reporting feature values between 0.0 (generally representing the absence of a phenomenon or non-membership in a class) and 1.0 (generally representing the presence of a phenomenon or membership in a class). This normalization makes vectors from different sources more easily interchangeable and more easily predictable for each other (§4).

Different features are not mutually exclusive

As in SSWL (Collins and Kayne, 2011), different features are not held to be mutually exclusive; the features SSVO and SSOV can both be 1 if both orders are normally encountered in the language.

Missing values:

Phylogeny, geography, and identity vectors are complete—they have no missing values.
The typological features (syntax, phonology, and inventory) have missing values, reflecting the coverage of the original sources; missing values are represented in the output as “–”. Predicted typological vectors (§4) attempt to impute these values based on related, neighboring, and typologically similar languages.

Dimensionality:

All vectors within the syntax, phonology, and inventory categories have the same dimensionality as other types of vectors in the same category, even though the sources themselves may only represent a subset of these values, to allow straightforward element-wise comparison of values.

3.1. Typological vectors

The syntax features are adapted (after conversion to binary features) from the World Atlas of Language Structures (WALS) (Dryer and Haspel-math, 2013), directly from Syntactic Structures of World Languages (Collins and Kayne, 2011) (whose features are already binary), and indirectly by text-mining the short prose descriptions on typological features in Ethnologue (Lewis et al.,2015).

The phonology features are adapted in the same manner from WALS and Ethnologue.

The phonetic inventory features are adapted from the PHOIBLE database, itself a collection and normalization of seven phonological databases (Moran et al., 2014; Chanard, 2006; Crothers et al., 1979; Hartell, 1993; Michael et al., 2012; Maddieson and Precoda, 1990; Ramaswami, 1999). The PHOIBLE-based features in lang2vec primarily represent the presence or absence of natural classes of features (e.g., interdental fricatives, voiced uvulars, etc.), with 1 representing the presence of at least one sound of that class and 0 representing absence. They are derived from PHOIBLE’s phonetic inventories by extracting each segment’s articulatory features using the PanPhon* feature extractor (Mortensen etal., 2016), and using these features to determine the presence or absence of the relevant natural classes.

* About PanPhone: https://github.com/dmort27/panphon

3.2. Phylogeny vectors

The fam vectors express shared membership in language families, according to the world language family tree in Glottolog (Hammarstr ̈om et al., 2015). Each dimension represents a language family or branch thereof (such as “Indo-European” or “West Germanic” in Table 4)

3.3. Geography vectors

Although another component of URIEL (to be de-scribed in a future publication) provides geographical distances between languages, geo vectors express geographical location with a fixed number of dimensions and each dimension representing the same feature even when different sets of languages are considered. Each dimension represents the orthodromic distance—that is, the “great circle” distance—from the language in question to a fixed point on the Earth’s surface. These distances are expressed as a fraction of the Earth’s antipodal distance, so that values will always be in between 0.0 (directly at the fixed point) and 1.0 (at the antipode of the fixed point).

3.4. Identity vectors

The id vector is simply a one-hot vector identifying each language. These vectors can serve as simple identifiers of languages to a system, serve as the control in an experiment in introducing typological information to a system, as in Tsvetkov et al. (2016), or serve in combination with other vectors (such as fam) that do not always identify a language uniquely.

4. Feature prediction

One of the major difficulties in using typological features in multilingual processing is that many languages, and many features of individual languages, happen to be missing from the databases.

The authors efforts towards filling missing values using KNN:

The question of how we can best predict unknown typological features is a larger question (Daum ́e III and Campbell, 2007; Daum ́e III, 2009;Coke et al., 2016) than this article can capture in detail, but nonetheless we can offer a preliminary attempt at providing practically useful approximations of missing features by a k-nearest-neighbors approach.

By taking an average of genetic, geographical, and feature distances between languages, and calculating a weighted 10-nearest-neighbors classification, we can predict feature missing values with an accuracy of 92.93% in 10-fold cross-validation.

5. Conclusion

While there are many language-information resources available to NLP, their heterogeneity in format, semantics, language naming, and feature naming makes it difficult to combine them, compare them, and use them to predict missing values from each other. lang2vec aims to make cross-source and cross-information-type experiments straightforward by providing standardized, normalized vectors representing a variety of information types.

[Paper-ST] Phone Features Improve Speech Translation

2020-06-05T00:00:00-07:00

Last Updated: 2020-06-09

This paper: Phone Features Improve Speech Translation is proposed by researchers from JHU and CMU. It is accepted by ACL 2020. This paper is recommended for its comprehensive experiments and analysis.

Code: github.com/esalesky/xnmt-devel (seems not released yet)

The authors compared cascaded and end-to-end models across high, medium, and low-resource conditions, and showed that cascades remain stronger baselines.

Further, the authors introduced two methods to incorporate phone features into ST models which improves both architectures and closes the gap between end-to-end models and cascades.

1. Introduction

The authors propose two simple heuristics to integrate phoneme-level information into neural speech translation models:

(1) as a more robust intermediate representation in a cascade

(2) as a concatenated embedding factor.

Data: Fisher Spanish–English dataset

We compare to recent work using phone segmentation for end-to-end speech translation (Salesky et al., 2019), and show that our methods outperform this model by up to 20 BLEU on our lowest-resource condition.

Finally, we test model robustness by varying the quality of our phone features, which may indicate which models will better generalize across differently-resourced conditions

2. Models with Phone Supervision

Two proposed methods to incorporate phone features into cascaded and end-to-end models:

1. Phone cascade: uses phone labels as the ASR output and the machine translation input

2. Phone end-to-end: concatenates trainable phone embeddings to typical speech feature vector input. Note that this method maintains the same source sequence length as the original speech feature sequence.

3. Phone Segmentation (E2E baseline): uses phone boundaries to segment consecutive speech frames by averaging a variable number of features with the same phone label. This significantly reduces source sequence lengths (by∼80%), reducing the number of model parameters and memory

3. Data

Fisher Spanish-English Corpus containing 160 hours of Spanish telephone speech, split into 138k utterances. Standard dev/test sets are used. For medium / low resource experiments, 40 / 20 hours subsets of the data are randomly selected.

4. Generating Phone Supervision

We extract 40-dimensional Mel filter bank features with per-speaker mean and variance normalization using Kaldi (Povey et al., 2011). We train an HMM/GMM system on the full Fisher Spanish dataset with the Kaldi recipe (Povey et al., 2011), using the Spanish CALLHOME Lexicon (LDC96L16), and compute per-frame phone alignments with the triphone model (tri3a) with LDA+MLLT features. This yields 50 phone labels, including silence(), noise, and laughter.

To leverage our better-performing neural ASR models for phone generation, we create essentially a ‘2-pass’ alignment procedure:

generating a transcript

using this transcript to force align phones.

Table 1 shows the mapping between phone quality and the ASR models used for phone feature generation. This procedure enables us to both improve phone alignment quality and also match training and inference procedures for phone generation for our translation models.

5. Model & Training Procedure

The authors used xnmt (Neubig et al., 2018) to build encoder-decoder models.

Our pyramidal encoder uses 3-layer BiLSTMs with linear network-in-network(NiN) projections and batch normalization between layers (Sperber et al., 2019; Zhang et al., 2017).

We use single layer MLP attention (Bahdanau et al., 2015) with 128 units and 1 decoder layer as opposed to 3 or 4 in previous work – we did not see consistent benefits from additional depth.

Please refer to the original paper for other details.

6. Prior Work: Cascaded vs End-to-End Models on Fisher Spanish-English

The results are shown as Table 2. Please refer to the paper for detailed baseline settings (e.g. Parameter, Additional Data, etc.)

7. Results Using Phone Features

From Table 3, we can find that phone cascade is better than phone end-to-end, but phone end-to-end performs better than baseline cascade.

Hybrid cascade uses an ASR model with phone-informed downsampling and BPE targets (Salesky et al., 2019). This improves the WER of ASR model to 28.1 on dev and 23.2 on test, matching Weiss et al. (2017)’s state-of-the-art on test (23.2) and approaching it on dev (25.7). It is best-performing model on the full dataset. However, at lower-resource conditions, it does not perform as favorably compared to phone featured models – as shown in Figure 2. This suggests improving ASR may enable cascades to perform better at high-resource conditions, but under lower-resource conditions it is not as effective as utilizing phone features.

Training time

Comparing to previous work using additional data

Incorporating phone information makes model more efficient.

We note that our phone models further outperform previous work trained with additional corpora. The attention-passing model of Sperber et al. (2019) trained on additional parallel Spanish-English text yields 38.8 on test on the full dataset, which Salesky et al. (2019) matches on the full dataset and our proposed models exceed, with the phone cascade yielding a similar result (37.4) trained on only 40 hours.

Pre-training with 300 hours of English ASR data and fine-tuning on 20 hours of Spanish-English data, Stoianet al. (2020); Bansal et al. (2019) improve their end-to-end models from≈10 BLEU to 20.2. All three of our proposed models exceed this mark trained on 20 hours of Fisher.

8. Model Robustness & Further Analysis

8.1. Phone Cascade

Figure 3 compares the impact of different phone qualities for downstream MT. Note that with gold alignments, translation performance is similar to text-based translation

Redundancy

The authors collapsed adjacent consecutive phones with the same label in phone cascaded models.

For the phone cascade models compared in Figure 3, we collapse adjacent consecutive phones with the same label, i.e. when three consecutive frames have been aligned to the same phone label ‘BB B’ we have reduced the sequence to a single phone ‘B’ for translation.

Translating with full sequence of phones hurt the performance.

Translating the full sequence of redundant frame-level phone labels (e.g. the same sequence length as the number of frames), for the full 160hr dataset, all models performed on average 0.6 BLEU worse; for 40hr, 1.8 BLEU worse; and with 20 hours, 4.1 BLEU worse – a 13% decrease in performance solely from non-uniqued sequences.

8.2. Phone End-to-End

Our phone end-to-end model concatenates trainable embeddings for phone labels to frame-level filterbank features, associating similar feature vectors globally across the corpus, as opposed to locally within an utterance with the phone-averaged embeddings.

The model’s performance degradation compared to the phone cascade in lower-resource conditions is likely due in part to these sequence lengths, as shown by our additional experiments with input redundancy for the cascade. The greater reduction in performance here using lower quality phones suggests the noise of the labels and concatenated filterbank features compound, further detracting from performance. Perhaps further investigation into the relative weights placed on the two embedding factors over the training process could close this additional gap.

8.3. Phone Segmentation: Salesky et al. (2019)

That work introduced downsampling informed by phone segmentation– unlike our other models, the value of the phone label is not used, but rather, phone alignments are used only to determine the boundary between adjacent phones for variable-length downsampling. We hypothesize that the primary reason for their BLEU improvements is the reduction in local redundancy between similar frames, as discovered in the previous section.

8.4. Quality of Phone Labels

Two examples of phone sequences:

We see the primary difference in produced phones between different models is the label values, rather than the boundaries.

We note that differences in frame-level phone boundaries would not affect our phone cascaded models, where the speech features are discarded, while they would affect our phone end-to-end models, where the phone labels are concatenated to speech feature vectors and associate them across the corpus.

Skipped. Please refer to the paper.

10. Conclusion

We show that phone features significantly improve the performance and data efficiency of neural speech translation models.

Greatest improvements in low-resource settings (20 hours):

E2E: 5 BLEU > baseline cascade
cascade: 9 BLEU > prior work

Generating phone features uses the same data as auxiliary speech recognition tasks from prior work; our experiments suggest these features are a more effective use of this data, with our models matching the performance from previous works’ performance without additional training data.

Wenxin Hou

[Paper-ASR][中文笔记] A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition

背景介绍

神经网络架构

BLSTM Encoder

Conformer Encoder

CTC Decoder

Transducer Decoder

Attention Decoder

训练技巧

SpecAugment

Exponential Moving Average (EMA)

Variational Noise

实验结果

实验设置

模型架构 对比

训练技巧 对比

计算复杂度 对比

总结

[Paper-ASR][中文笔记] 用于语音识别的无监督字符级分布适配 (CMatch)

背景介绍

方法介绍

帧级标签分配

字符级别的分布匹配

最终损失函数和算法流程

实验效果

总结

References

[Collection] Papers, Recipes, Toolkits and Interesting Posts

Posts

1. NLP中的少样本困境问题探究 [Chinese]

2. 探索孪生神经网络：请停止你的梯度传递！ [Chinese]

3. 更深的编码器+更浅的解码器=更快的自回归模型 [Chinese]

Toolkits

1. PyTorch Lightning Bolts [GitHub]

2. S3PRL Speech Toolkit [GitHub]

3. WavAugment [GitHub]

4. Sequence-to-Sequence G2P Toolkit [GitHub]

5. Abkhazia [GitHub]

Datasets

1. Libri-Adapt: a New Speech Dataset for Unsupervised Domain Adaptation [ICASSP2020] [GitHub]

2. MLS: A Large-Scale Multilingual Dataset for Speech Research [INTERSPEECH2020] (Newer arxiv version) [Recipe/Pretrained Models] [OpenSLR]

End-to-End ASR

1. Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition [INTERSPEECH2020]

2. Independent Language Modeling Architecture for End-To-End ASR [ICASSP2020]

3. Conformer: Convolution-augmented Transformer for Speech Recognition [INTERSPEECH2020] [Paper Reading Slide]

4. Exploring Transformers for Large-Scale Speech Recognition [INTERSPEECH2020]

5. Distilling the Knowledge of BERT for Sequence-to-Sequence ASR [INTERSPEECH2020] [GitHub]

Multilingual

A. ASR

1. Massively Multilingual Adversarial Speech Recognition [NAACL-HLT2019]

2. Learning Robust and Multilingual Speech Representations [Findings of EMNLP2020]

3. Language-agnostic Multilingual Modeling [ICASSP2020]

6. End-to-End Multilingual Speech Recognition System with Language Supervision Training [IEICETrans.2020]

7. Towards Language-Universal Mandarin-English Speech Recognition [INTERSPEECH2019]

8. Multilingual Speech Recognition With A Single End-To-End Model [ICASSP2018]

9. Towards Language-Universal End-to-End Speech Recognition [ICASSP2018]

10. Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning [INTERSPEECH2020] [Slide] [Recipe]

11. Leveraging Language ID in Multilingual End-to-End Speech Recognition [ASRU2019]

12. Investigating End-to-end Speech Recognition for Mandarin-english Code-switching [ICASSP2019]

13. Meta Learning for End-to-End Low-Resource Speech Recognition [ICASSP2020]

14. Language Adaptive Multilingual CTC Speech Recognition

15. Multilingual Speech Recognition with Corpus Relatedness Sampling

16. Adversarial Multilingual Training for Low-Resource Speech Recognition [ICASSP2018]

17. Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes [ICASSP2019]

18. An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation

19. Bootstrap an End-to-End ASR System by Multilingual Training, Transfer Learning, Text-to-Text Mapping and Synthetic Audio [Arxiv2020]

20. Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition [AAAI2021] [GitHub]

B. STT

1. Phonological Features for 0-shot Multilingual Speech Synthesis [INTERSPEECH2020]

C. Speech Translation

1. Effectively Pretraining a Speech Translation Decoder with Machine Translation Data [EMNLP2020]

D. Analysis

1. Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios [Arxiv2020]

Curriculum Learning

1. When Do Curricula Work? [ICLR2021 (submitted)]

Semi-Supervised Learning

1. Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

2. Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages

3. Semi-Supervised End-to-End Speech Recognition [INTERSPEECH2018] [GitHub]

模型架构对比

训练技巧对比

计算复杂度对比