<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.0">Jekyll</generator><link href="https://houwx.net/feed.xml" rel="self" type="application/atom+xml" /><link href="https://houwx.net/" rel="alternate" type="text/html" /><updated>2023-04-23T19:49:18-07:00</updated><id>https://houwx.net/feed.xml</id><title type="html">Wenxin Hou</title><subtitle>personal description</subtitle><author><name>Wenxin Hou 侯汶昕</name><email>houwx001@gmail.com</email></author><entry><title type="html">[Paper-ASR][中文笔记] A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition</title><link href="https://houwx.net/posts/2020/01/blog-post-31/" rel="alternate" type="text/html" title="[Paper-ASR][中文笔记] A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition" /><published>2021-07-04T00:00:00-07:00</published><updated>2021-07-04T00:00:00-07:00</updated><id>https://houwx.net/posts/2020/01/blog-post-31-japanese-e2e-asr</id><content type="html" xml:base="https://houwx.net/posts/2020/01/blog-post-31/">&lt;p&gt;Last Updated: 2021-07-04&lt;/p&gt;

&lt;p&gt;本篇为原创文章，转载请联系我，谢谢！&lt;/p&gt;

&lt;p&gt;本期我们将为大家介绍Google团队在 InterSpeech 2021 的一篇日语端到端（E2E）语音识别的工作，文中调研了各种最新的E2E建模技巧，诸如：LSTM / Conformer 作为encoder，CTC / Transducer / Attention-based Decoder；本文也验证了一些最新的训练技术，如SpecAugment，Variational Noise Injection (VNI) 和 Exponential Moving Average (EMA)。本文最好的模型 + 训练配置在CSJ eval1, eval2, eval3上分别达到了SOTA的4.1%, 3.2%, and 3.5%的字错误率（CER）。&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;文章链接：&lt;a href=&quot;https://arxiv.org/abs/2106.05111&quot;&gt;https://arxiv.org/abs/2106.05111&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/posts/2021-07-04-blog-post-31-7.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;背景介绍&quot;&gt;背景介绍&lt;/h2&gt;

&lt;p&gt;由于日语里没有像英语那样明确的词分隔（空格），所以传统的HMM模型和发音词典都需要用一个分词器将日语文本切成词；而E2E模型可以直接对字（character）建模，极大简化了ASR模型的构建流程。&lt;/p&gt;

&lt;p&gt;对于Encoder而言，带卷积的BLSTM第一次在日语语音识别超过了HMM模型，Transformer模型又超过了BLSTM模型，之后的Conformer模型进一步降低了在日语和其他语言上的CER。&lt;/p&gt;

&lt;p&gt;对于Decoder而言，CTC 和 Attention 在之前的工作中用的比较多，但Transducer并没有在日语ASR中得到广泛的应用。Transducer和CTC一样很适合流式应用，并且它也能像Attention一样学习到输出之间的依赖关系。&lt;/p&gt;

&lt;p&gt;训练技巧对于E2E模型也很关键，比如SpecAugment，VNI 和 EMA 技术都已经被证明对于训练ASR模型有效，但是还没有工作对它们进行充分的对比实验。&lt;/p&gt;

&lt;p&gt;本文的贡献就是充分的对于以上提到的模型架构和训练技巧以及他们的组合进行了对比实验，并且在CSJ数据集上进行了多维度的评价（CER，training throughput，收敛性和推理RTF）&lt;/p&gt;

&lt;h2 id=&quot;神经网络架构&quot;&gt;神经网络架构&lt;/h2&gt;

&lt;p&gt;模型由Encoder和Decoder组成，输入为 log-mel filterbank 特征，输出是后验分布，训练目标是最大化正确目标序列的后验概率。&lt;/p&gt;

&lt;h3 id=&quot;blstm-encoder&quot;&gt;BLSTM Encoder&lt;/h3&gt;

&lt;p&gt;首先用堆叠卷积下采样输入特征，然后用堆叠的BLSTM层递归式的进行帧级别的处理。&lt;/p&gt;

&lt;p&gt;缺点是难以并行化，无法充分使用 GPU / TPU 加速。&lt;/p&gt;

&lt;h3 id=&quot;conformer-encoder&quot;&gt;Conformer Encoder&lt;/h3&gt;

&lt;p&gt;Conformer Encoder和 BLSTM Encoder 的区别就是用 Conformer 模块代替了 BLSTM 层。Figure 1为Conformer模块的示意图。Conformer使用基于相对位置编码 (Relative Positional Encoding) 的多头注意力机制建模全局特征，对于长度为 T 的输入序列，多头自注意力的计算和内存复杂度都是 O(T^2)。&lt;/p&gt;

&lt;p&gt;Figure 2 展开了 Figure 1 中的 Convolution Module，该模块用于建模局部特征。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/posts/2021-07-04-blog-post-31-1.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Conformer可以被并行化训练，因此相比 BLSTM 训练速度会快很多；但是缺点就是上述的注意力复杂度，当输入序列很长的时候可能会导致性能瓶颈。在本文实验中用 training throughput 评估。&lt;/p&gt;

&lt;h3 id=&quot;ctc-decoder&quot;&gt;CTC Decoder&lt;/h3&gt;
&lt;p&gt;CTC Decoder就是一层线性层 + Softmax 激活函数，CTC 通过引入一个表示 Encoder 编码的特征 X 和输出序列 Y 之间的 Alignment 的变量 Z 来预测输出序列的分布。Y 和 Z 之间的关系由一个&lt;strong&gt;多对一&lt;/strong&gt;的映射 B 给出：&lt;/p&gt;

\[Y = B(Z)\]

&lt;p&gt;B(Z) 就是将 Z 中的 blank 和 重复 symbol 去掉，举个栗子：B(aa&amp;lt;blank&amp;gt;b) = ab。并且， Z 中的每一个 symbol  $Z_t$ 被预测到的概率都是对应 X 中的一帧特征经过 CTC Decoder 转换预测得到的概率分布中的第 $Z_t$ 个元素，即$C_t=\text{CTC}(X_t)$ ， $P(Z_t) = C_{t, Z_t}$ 。从这里我们也可以看出来，CTC进行了条件独立性假设，即 $Z_t$ 的预测与上下文 $Z_{t-1}$  和 $Z_{t+1}$ 均无关。&lt;/p&gt;

&lt;p&gt;那怎么训练 CTC 呢？我们对 B 取反得到一个&lt;strong&gt;一对多&lt;/strong&gt;的映射 $B^{-1}$，比如 $B^{-1}(ab) = {\text{aa&lt;blank&gt;b, aaab, abbb, etc.}\}$ ，然后最大化所有可能性的后验概率和：&lt;/blank&gt;&lt;/p&gt;

\[p_{ctc}(B(Z)=Y|X)=\prod_{t=1}^T C_{t, Z_t}\]

\[L_{ctc} = - log \sum_{Z \in B^{-1}(Y)} p_{ctc}(B(Z) = Y | X)\]

&lt;h3 id=&quot;transducer-decoder&quot;&gt;Transducer Decoder&lt;/h3&gt;

&lt;p&gt;Transducer 和 CTC 原理类似，都是通过最大化可能的 Alignment 概率进行训练，区别在于 Transducer 去掉了CTC的条件独立性假设，而是递归地建模条件分布。&lt;/p&gt;

\[L_{\text{transducer}} = - log \sum_{Z \in B^{-1}(Y)} \prod_t p_\text{transducer} (Z_t | X, B(Z_{1:t-1}))\]

&lt;p&gt;Transducer Decoder 一般采用循环神经网络（RNN）进行递归式的编码。&lt;/p&gt;

&lt;h3 id=&quot;attention-decoder&quot;&gt;Attention Decoder&lt;/h3&gt;

&lt;p&gt;Attention Decoder 一般由两个模块组成：注意力模块 Attend() 和 输出模块 Spell()。不像 CTC 和 Transducer，它不需要显式地对齐每一个语音帧和输出符号。&lt;/p&gt;

\[L_\text{att} = - log \prod_t p_\text{att} (Y_t | X, Y_{1:t-1})\]

&lt;p&gt;Attention Decoder 在 NLP 的各种序列生成任务（Seq2Seq）中有着广泛的运用，这里就不详细介绍了。&lt;/p&gt;

&lt;h2 id=&quot;训练技巧&quot;&gt;训练技巧&lt;/h2&gt;

&lt;h3 id=&quot;specaugment&quot;&gt;SpecAugment&lt;/h3&gt;

&lt;p&gt;SpecAugment 是一种为 ASR 设计的数据增强的方式。主要包含：time masking 和 frequency masking。（time warping没有在本文中使用）。对于这类 trick 而言，不说废话，上代码：&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;spec_augmentation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;warp_for_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;num_t_mask&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;num_f_mask&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;max_t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;max_f&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;max_w&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;80&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot; Deep copy x and do spec augmentation then return it
    Args:
        x: input feature, T * F 2D
        num_t_mask: number of time mask to apply
        num_f_mask: number of freq mask to apply
        max_t: max width of time mask
        max_f: max width of freq mask
        max_w: max width of time warp
    Returns:
        augmented feature
    &quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;copy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;max_frames&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;max_freq&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# time warp
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;warp_for_time&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_frames&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_w&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;center&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randrange&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max_w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_frames&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;warped&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randrange&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;center&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;center&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Image&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fromarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;center&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;resize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max_freq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;warped&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BICUBIC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Image&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fromarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;center&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:]).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;resize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max_freq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_frames&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;warped&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BICUBIC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;concatenate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# time mask
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_t_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_frames&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;length&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max_frames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# freq mask
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_f_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_freq&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;length&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max_freq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;exponential-moving-average-ema&quot;&gt;Exponential Moving Average (EMA)&lt;/h3&gt;

&lt;p&gt;EMA 通过维护一个影子权重来提高模型训练的稳定性和泛化能力，其基本假设是，模型权重在最后的n步内，会在实际的最优点处抖动，所以我们取最后n步的平均，能使得模型更加的鲁棒。每一个训练 step $k$ 后，EMA 的模型参数 $\theta_k’$ 通过下面的公式计算：&lt;/p&gt;

\[\theta_k&apos; = \gamma \theta_{k-1}&apos; + (1 - \gamma) \theta_k\]

&lt;p&gt;其中，$\gamma$ 是衰减率。代码如下：&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ExponentialMovingAverage&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;object&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;decay&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.9999&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decay&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;decay&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shadow&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;backup&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ema_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ema_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;named_parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
            &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shadow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clone&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;named_parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;requires_grad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shadow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clone&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;named_parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;requires_grad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shadow&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;new_average&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;1.0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decay&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shadow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
                &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shadow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_average&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clone&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
	
    &lt;span class=&quot;c1&quot;&gt;# 以下函数用于评估模型时的参数变换
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;apply_shadow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;named_parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;requires_grad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shadow&lt;/span&gt;
                &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;backup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shadow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;restore&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;named_parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;requires_grad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;backup&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;backup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;backup&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;variational-noise&quot;&gt;Variational Noise&lt;/h3&gt;

&lt;p&gt;Variational Noise 也能提高神经网络的泛化能力，具体做法是对原参数加上服从正态分布的噪声样本，得到带噪参数用于训练。（本文只将其用于 embedding 和 LSTM 层）&lt;/p&gt;

\[\theta&apos; = \theta + n, n \sim \text{Normal}(0, \sigma^2)\]

&lt;p&gt;这个方法比较简单并且灵活，这里就不放代码了。&lt;/p&gt;

&lt;h2 id=&quot;实验结果&quot;&gt;实验结果&lt;/h2&gt;

&lt;h3 id=&quot;实验设置&quot;&gt;实验设置&lt;/h3&gt;

&lt;p&gt;简要介绍一下 CSJ 数据集：581 小时的语音数据，模型建模单元为3259个字 + 3种特殊字符 （SOS, EOS, UNK）。特征提取为 80 维的 log mel-filterbank，在训练集上采用了 Global CMVN (channel 轴上 0 均值，1 方差)。&lt;/p&gt;

&lt;p&gt;各种超参就不介绍了，Google 使用的模型基本大家都可以想象一下，小不了。诸如 SpecAugment 等的参数可以参考一下原论文。&lt;/p&gt;

&lt;h3 id=&quot;模型架构-对比&quot;&gt;模型架构 对比&lt;/h3&gt;

&lt;p&gt;首先，由表 1 可以看出，Conformer 作为 encoder 明显好于 BLSTM，不论是 CER 还是训练速度 ；Decoder 方面，Attention 和 Transducer 整体表现接近，CTC 略差，从训练速度看是 CTC 大于 Attention 大于 Transducer。&lt;/p&gt;

&lt;p&gt;Encoder 对于 CER 结果的影响明显大于 Decoder。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/posts/2021-07-04-blog-post-31-2.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;从收敛性的角度看，Conformer 的收敛速度和效果依然明显优于 BLSTM，Attention Decoder 则在训练早期有着更快的收敛速度，但是这个效果也不如 Encoder 部分来的明显，训练后期差距被拉小甚至反超。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/posts/2021-07-04-blog-post-31-3.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;训练技巧-对比&quot;&gt;训练技巧 对比&lt;/h3&gt;

&lt;p&gt;从 CER 的角度看，SpecAugment 带来的提升 大于 EMA 大于 VNI，但是他们是互补的，组合起来可以达到最佳的训练效果。&lt;/p&gt;

&lt;p&gt;各种技巧对于训练速度几乎没有影响，可以放心食用。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/posts/2021-07-04-blog-post-31-4.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;从收敛曲线看，没有 EMA 的模型训练（红色）明显变得不稳定，使用 EMA 可以帮助加速收敛。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/posts/2021-07-04-blog-post-31-5.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;计算复杂度-对比&quot;&gt;计算复杂度 对比&lt;/h3&gt;

&lt;p&gt;作者将 batch size 设置为 1，然后在 CPU 上解码，然后计算 Real-Time Factor (RTF)，RTF 用于度量解码速度，计算方法为用模型解码一条语音的时间除以该条语音自身的时间，数字越小越好。&lt;/p&gt;

&lt;p&gt;可以看出，Conformer 作为 Encoder 依然 优于 BLSTM，Transducer 作为 Decoder 是最快的，CTC 次之，Attention Decoder 最慢 。值得注意的是 Attention 和 Transducer 运用了宽度为 8 的 Beam Search 而 CTC 使用了 Greedy Search。不过本文的搜索算法都是用 C++ 写的，所以表上列出的算法其实都很快了。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/posts/2021-07-04-blog-post-31-6.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;总结&quot;&gt;总结&lt;/h2&gt;

&lt;p&gt;本文探究了各种模型架构、训练技巧在日语端到端语音识别上的应用，是一篇很棒的工程论文。&lt;/p&gt;</content><author><name>Wenxin Hou 侯汶昕</name><email>houwx001@gmail.com</email></author><category term="ASR" /><category term="Japanese" /><summary type="html">Last Updated: 2021-07-04</summary></entry><entry><title type="html">[Paper-ASR][中文笔记] 用于语音识别的无监督字符级分布适配 (CMatch)</title><link href="https://houwx.net/posts/2020/01/blog-post-30/" rel="alternate" type="text/html" title="[Paper-ASR][中文笔记] 用于语音识别的无监督字符级分布适配 (CMatch)" /><published>2021-06-07T00:00:00-07:00</published><updated>2021-06-07T00:00:00-07:00</updated><id>https://houwx.net/posts/2020/01/blog-post-30-CMatch</id><content type="html" xml:base="https://houwx.net/posts/2020/01/blog-post-30/">&lt;p&gt;Last Updated: 2021-07-04&lt;/p&gt;

&lt;p&gt;这篇论文是我（Wenxin Hou）在MSRA实习期间与王晋东老师合作的工作，本文转载自王晋东老师的知乎：&lt;a href=&quot;https://zhuanlan.zhihu.com/p/370691801&quot;&gt;https://zhuanlan.zhihu.com/p/370691801&lt;/a&gt;。转载请联系王晋东老师或者我。&lt;/p&gt;

&lt;p&gt;本期我们将为大家介绍&lt;strong&gt;一种用于语音识别的无监督字符级领域适配方法：CMatch&lt;/strong&gt;。在这项工作中，我们提出一种用于ASR的无监督字符级分布匹配方法：CMatch，以实现在两个不同领域中的每个字符之间执行细粒度的自适应。在Libri-Adapt数据集上进行的实验表明，CMatch在跨设备和跨环境的适配上相对单词错误率（WER）分别降低了14.39％和16.50％。在这篇工作中我们还全面分析了帧级标签分配和基于Transformer的领域适配的不同策略。文章已经提交至INTERSPEECH 2021。&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;本文作者：侯汶昕（东京工业大学硕士生）&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;文章链接：&lt;a href=&quot;https://arxiv.org/abs/2104.07491&quot;&gt;https://arxiv.org/abs/2104.07491&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/title.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;背景介绍&quot;&gt;背景介绍&lt;/h2&gt;

&lt;p&gt;众所周知，基于深度学习的端到端自动语音识别（ASR）已经可以通过大规模的训练数据和强大的模型得到很好的性能。但是，训练和测试数据之间可能会由于录音设备，环境的不同而有着相似却不匹配的分布，这样的分布（或领域）不匹配通常会导致ASR模型在测试时的识别精度下降。而这种领域或分布不匹配的情况过于常见和多样，以至于很难对于每个领域的语音数据进行大量收集并标记。在这种情况下我们往往需要借助无监督的领域适配提升在模型目标域的表现。&lt;/p&gt;

&lt;p&gt;现有的无监督领域适配方法通常在将每个领域视为一个分布，然后进行领域适配，例如领域对抗训练或是特征匹配。 这些方法可能就会导致忽略了一些不同领域内更细粒度的分布知识，例如字符，音素或单词。忽略这些信息可能会在一定程度上影响适配的效果。这在文献[1]中也得到了验证，与在整个域中对齐的传统方法相比，在子域中对齐的图像（即按类标签划分的域）通常可以实现更好的自适应性能。&lt;/p&gt;

&lt;p&gt;以下图为例，通过执行CMatch算法，我们可以看到两个领域相同的字符在特征分布中被拉近了。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture1.jpg&quot; alt=&quot;CMatch示意图，可以看到两个领域相同的字符在特征分布中被拉近了&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;方法介绍&quot;&gt;方法介绍&lt;/h2&gt;

&lt;p&gt;CMatch由两个步骤组成：&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;帧级标签分配&lt;/li&gt;
  &lt;li&gt;字符级别的分布匹配&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;帧级标签分配&quot;&gt;帧级标签分配&lt;/h3&gt;

&lt;p&gt;要想进行帧级标签分配，首先我们需要获得较为准确的标签对齐。这里我们介绍了如下图所示的3种方法：CTC强制对齐，动态帧平均，以及伪CTC标签方法。可以看出，CTC强制对齐是通过预训练的CTC模块，在计算每条文本对应的最可能的CTC路径（插入重复和Blank符号）后分配到每个语音帧上，这个方法相对准确但是计算代价较高；动态帧平均则是将语音帧平均的分配到每个字符上，这个方法需要基于源域和目标域语速均匀的假设；而伪CTC标签的方法，通过利用已经在源域上学习得较好的CTC模块外加基于置信度的过滤（如图中的t，e，p等），兼顾了高效和准确性。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture4.jpg&quot; alt=&quot;三种帧级标签分配策略&quot; /&gt;&lt;/p&gt;

&lt;p&gt;需要说明的是，我们在源域上使用真实文本进行标签分配，而由于目标域我们没有文本，所以需要借助源域模型首先在目标域的语音数据进行伪标注，然后使用模型标注的文本进行标签分配。&lt;/p&gt;

&lt;h3 id=&quot;字符级别的分布匹配&quot;&gt;字符级别的分布匹配&lt;/h3&gt;
&lt;p&gt;得到帧级别的标签后，我们就需要进行字符级别的分布匹配，在本文中，我们选择采用Maximum Mean Discrepancy（MMD）度量进行特征匹配。MMD用于评估两个分布之间的差异，是迁移学习中常见的一种分布度量方法。它的公式为：&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture7.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;实际操作中，给定源域和目标域样本$X_S$, $X_T$，我们计算MMD的有偏差的经验估计：&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture8.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;通过计算所有字符之间的平均MMD，我们可以得到字符级别的分布匹配损失函数：&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture9.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;最终损失函数和算法流程&quot;&gt;最终损失函数和算法流程&lt;/h3&gt;

&lt;p&gt;本文中我们采取CTC-Attention混合模型作为基础ASR模型，以同时混合学习CTC模块（用于帧级标签分配）和基于Transformer Decoder的Seq2Seq Loss，于是语音识别的损失函数可以表示为：&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture2.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;将分布匹配损失函数和语音识别损失函数相结合，我们就得到了最终的损失函数：&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture10.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;最终算法流程如下：&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture11.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;实验效果&quot;&gt;实验效果&lt;/h2&gt;

&lt;p&gt;首先我们看一下模型在领域内的识别效果，评价指标为词错误率（WER），数值越低代表识别效果越好：&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture12.jpg&quot; alt=&quot;领域内识别结果&quot; /&gt;&lt;/p&gt;

&lt;p&gt;然后我们看一下跨设备语音识别时的结果，可以注意到的是，Source-only的模型在其他设备录制的语音上的识别效果相比领域内模型都会有一定程度的下降。而基于全局MMD和领域对抗训练的方法均有所提升，CMatch则在各个情况下均取得了最佳的效果。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture13.jpg&quot; alt=&quot;跨设备识别结果&quot; /&gt;&lt;/p&gt;

&lt;p&gt;可以看到，CMatch在跨环境（抗噪声）语音识别的情况下也取得了很好的效果。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture14.jpg&quot; alt=&quot;跨环境识别结果&quot; /&gt;&lt;/p&gt;

&lt;p&gt;消融实验体现出，结合自训练和细粒度的分布匹配能够使CMatch达到最好的效果。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture15.jpg&quot; alt=&quot;消融实验&quot; /&gt;&lt;/p&gt;

&lt;p&gt;我们还分析比较了三种字符分配方法，可以看出CTC强制对齐取得了最好的效果，但是其计算开销也是最大；而FrameAverage也取得了较好的效果，但它的假设前提是领域和目标域具有均匀的说话速度；而使用CTC伪标签的方法取得了与CTC强制对齐相近的结果，同时计算起来也更加的高效。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture16.jpg&quot; alt=&quot;字符分配方法比较&quot; /&gt;&lt;/p&gt;

&lt;p&gt;最后，我们分析了是否需要在解码器端也使用CMatch Loss，结果发现，由于解码器在我们的实验中本来就没有功能上的差别，目标文本都是标准的英文，因此减小其分布的差异并没有什么效果，甚至会损害性能。&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://houwx.net/files/papers/uda/CMatch_Figures/Picture17.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;总结&quot;&gt;总结&lt;/h2&gt;

&lt;p&gt;本文提出了一种用于跨领域语音识别的CMatch算法。我们的主要动机是匹配来自源域和目标域的字符级分布，以利用细粒度的字符信息进行更好的自适应，并且在跨设备和跨环境ASR上的实验表明了CMatch的优越性。在未来，我们计划在更多的任务和场景下进行实验，例如数据集自适应和说话人自适应。&lt;/p&gt;

&lt;h3 id=&quot;references&quot;&gt;References&lt;/h3&gt;

&lt;p&gt;[1] Y. Zhu, F. Zhuang, J. Wang, G. Ke, J. Chen, J. Bian, H. Xiong, and Q. He, “Deep subdomain adaptation network for image classification,” IEEE transactions on neural networks and learning systems, 2020.&lt;/p&gt;</content><author><name>Wenxin Hou 侯汶昕</name><email>houwx001@gmail.com</email></author><category term="ASR" /><category term="Unsupervised Domain Adaptation" /><summary type="html">Last Updated: 2021-07-04</summary></entry><entry><title type="html">[Collection] Papers, Recipes, Toolkits and Interesting Posts</title><link href="https://houwx.net/posts/2020/01/blog-post-29/" rel="alternate" type="text/html" title="[Collection] Papers, Recipes, Toolkits and Interesting Posts" /><published>2020-12-10T00:00:00-08:00</published><updated>2020-12-10T00:00:00-08:00</updated><id>https://houwx.net/posts/2020/01/blog-post-29-reading-list</id><content type="html" xml:base="https://houwx.net/posts/2020/01/blog-post-29/">&lt;p&gt;Last Updated: 2020-12-10&lt;/p&gt;

&lt;p&gt;知乎:https://www.zhihu.com/people/liuchengwei/posts&lt;/p&gt;

&lt;p&gt;清华大学语音组Wiki：http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/Weekly_meeting&lt;/p&gt;

&lt;p&gt;NeurIPS2020 SAS workshop: https://neurips-sas-2020.github.io/#papers&lt;/p&gt;

&lt;p&gt;Machine Translation Reading List: https://github.com/THUNLP-MT/MT-Reading-List&lt;/p&gt;

&lt;h1 id=&quot;posts&quot;&gt;Posts&lt;/h1&gt;

&lt;h3 id=&quot;1-nlp中的少样本困境问题探究-chinese&quot;&gt;1. NLP中的少样本困境问题探究 &lt;a href=&quot;https://blog.csdn.net/xixiaoyaoww/article/details/106632300&quot;&gt;[Chinese]&lt;/a&gt;&lt;/h3&gt;

&lt;h3 id=&quot;2-探索孪生神经网络请停止你的梯度传递-chinese&quot;&gt;2. 探索孪生神经网络：请停止你的梯度传递！ &lt;a href=&quot;https://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&amp;amp;mid=2650418672&amp;amp;idx=1&amp;amp;sn=5ed3ecc114277c9560c874c893cda0de&amp;amp;chksm=becdabaa89ba22bca575673024b026e11832b4f49ff1182c9bc209a8e0b5cc50440095eb533f&amp;amp;mpshare=1&amp;amp;scene=1&amp;amp;srcid=1206QhVjd5wTlHxfvflix4g2&amp;amp;sharer_sharetime=1607267785812&amp;amp;sharer_shareid=bfeb459fa3f43a270eff892a8e232c34&amp;amp;key=7537aef5c35d1c87f23ea654285eaa72e26b4b063c8f5d2b51d2de8d77500a17a997acdceeb884a0416ea3499cdc482ac23b902508ed88f29e1006afe9825b1672b82f92f7d02bb73417a953716f85684d4ca4326956eabbd7da85a256dec94d2b3b86e5d6140c6656d0b3cd97411ed63d6b515cdf673dfa5050bc322f5a7c64&amp;amp;ascene=1&amp;amp;uin=NjA0ODE1NTgx&amp;amp;devicetype=Windows+10+x64&amp;amp;version=6300002f&amp;amp;lang=zh_CN&amp;amp;exportkey=AS1effcPMoQ3h8eHHQqLCfI%3D&amp;amp;pass_ticket=mq7HGcsJzQ2p3817kGfQTfjWahWmdup879jk2Tc%2BIJUQg2j4seO3dTgNqJeR315Q&amp;amp;wx_header=0&quot;&gt;[Chinese]&lt;/a&gt;&lt;/h3&gt;

&lt;h3 id=&quot;3-更深的编码器更浅的解码器更快的自回归模型-chinese&quot;&gt;3. 更深的编码器+更浅的解码器=更快的自回归模型 &lt;a href=&quot;https://mp.weixin.qq.com/s?__biz=MzIwMTc4ODE0Mw==&amp;amp;mid=2247508286&amp;amp;idx=2&amp;amp;sn=1dc7eed239ef1a267465ff82c1dcc6a9&amp;amp;chksm=96ea7ebea19df7a8b905a573a8b0ce4cb116eda48ce8cd3a98f5b7b8104dd9356a48a836eb72&amp;amp;scene=132#wechat_redirect&quot;&gt;[Chinese]&lt;/a&gt;&lt;/h3&gt;

&lt;h1 id=&quot;toolkits&quot;&gt;Toolkits&lt;/h1&gt;

&lt;h3 id=&quot;1-pytorch-lightning-bolts-github&quot;&gt;1. PyTorch Lightning Bolts &lt;a href=&quot;https://github.com/PyTorchLightning/pytorch-lightning-bolts&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch&lt;/li&gt;
  &lt;li&gt;Paper: A Framework For Contrastive Self-Supervised Learning And Designing A New Approach&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;2-s3prl-speech-toolkit-github&quot;&gt;2. S3PRL Speech Toolkit &lt;a href=&quot;https://github.com/andi611/Self-Supervised-Speech-Pretraining-and-Representation-Learning&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Self-supervised pre-training and representation learning (S3PRL) of Mockingjay, TERA, A-ALBERT, APC, and more to come. With easy-to-use standard downstream evaluation scripts including  phone classification, speaker recognition, and ASR. (All in Pytorch!)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;3-wavaugment-github&quot;&gt;3. WavAugment &lt;a href=&quot;https://github.com/facebookresearch/WavAugment&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;WavAugment performs data augmentation on audio data. The audio data is represented as &lt;a href=&quot;https://pytorch.org/&quot;&gt;pytorch&lt;/a&gt; tensors&lt;/li&gt;
  &lt;li&gt;Augmentations include: pitch randomization, reverberation, additive noise, time dropout (temporal masking), band reject, clipping&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Data Augmenting Contrastive Learning of Speech Representations in the Time Domain&lt;/em&gt;, E. Kharitonov, M. Rivière, G. Synnaeve, L. Wolf, P.-E. Mazaré, M. Douze, E. Dupoux. [&lt;a href=&quot;https://arxiv.org/abs/2007.00991&quot;&gt;arxiv]&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;4-sequence-to-sequence-g2p-toolkit-github&quot;&gt;4. Sequence-to-Sequence G2P Toolkit &lt;a href=&quot;https://github.com/cmusphinx/g2p-seq2seq&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;The tool does Grapheme-to-Phoneme (G2P) conversion using transformer model from tensor2tensor toolkit based on TensorFlow&lt;/li&gt;
  &lt;li&gt;Lukasz Kaiser. “&lt;em&gt;Accelerating Deep Learning Research with the Tensor2Tensor Library&lt;/em&gt;.” In Google Research Blog, 2017&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;5-abkhazia-github&quot;&gt;5. Abkhazia &lt;a href=&quot;https://github.com/bootphon/abkhazia&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Online documentation https://docs.cognitive-ml.fr/abkhazia&lt;/li&gt;
  &lt;li&gt;The Abkhazia project makes it easy to obtain simple baselines for supervised ASR (using &lt;a href=&quot;http://kaldi-asr.org&quot;&gt;Kaldi&lt;/a&gt;) and ABX tasks (using &lt;a href=&quot;https://github.com/bootphon/ABXpy&quot;&gt;ABXpy&lt;/a&gt;) on the large corpora of speech recordings typically used in speech engineering, linguistics or cognitive science research&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;datasets&quot;&gt;Datasets&lt;/h1&gt;

&lt;h3 id=&quot;1-libri-adapt-a-new-speech-dataset-for-unsupervised-domain-adaptation-icassp2020-github&quot;&gt;1. Libri-Adapt: a New Speech Dataset for Unsupervised Domain Adaptation &lt;a href=&quot;https://ieeexplore.ieee.org/abstract/document/9053074?casa_token=ljl1YMO-v1sAAAAA:hk53vVqAO9dg8PEJP_qoH982_2bwRF-Otqq9nxP_vJV8yOhBe0b0Vf8Hp0n3TvHE83sjWuPCmw&quot;&gt;[ICASSP2020]&lt;/a&gt; &lt;a href=&quot;https://github.com/akhilmathurs/libriadapt&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;7200 hours of English speech recorded on mobile and embedded scale microphones, 72 different domains (6 microphones x 3 accents x 4 environments x 100-hour Librispeech corpus) sampled at 16KHz&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;2-mls-a-large-scale-multilingual-dataset-for-speech-research-interspeech2020-newer-arxiv-version-recipepretrained-models-openslr&quot;&gt;2. MLS: A Large-Scale Multilingual Dataset for Speech Research &lt;a href=&quot;https://isca-speech.org/archive/Interspeech_2020/abstracts/2826.html&quot;&gt;[INTERSPEECH2020]&lt;/a&gt; (&lt;a href=&quot;https://arxiv.org/abs/2012.03411&quot;&gt;Newer arxiv version&lt;/a&gt;) &lt;a href=&quot;https://github.com/facebookresearch/wav2letter&quot;&gt;[Recipe/Pretrained Models]&lt;/a&gt; &lt;a href=&quot;http://www.openslr.org/&quot;&gt;[OpenSLR]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Multilingual LibriSpeech (MLS) dataset: derived from LibriVox, 50.5K hours, 8 languages (44.5K hours of English and a total of about 6K hours for other languages)&lt;/li&gt;
  &lt;li&gt;Languages covered: English, German, Dutch, French, Spanish, Italian, Portuguese, Polish&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;end-to-end-asr&quot;&gt;End-to-End ASR&lt;/h1&gt;

&lt;h3 id=&quot;1-listen-attentively-and-spell-once-whole-sentence-generation-via-a-non-autoregressive-architecture-for-low-latency-speech-recognition-interspeech2020&quot;&gt;1. Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition &lt;a href=&quot;https://www.isca-speech.org/archive/Interspeech_2020/abstracts/1600.html&quot;&gt;[INTERSPEECH2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Introduce non-autoregressive ASR system LASO (Listen Attentively, and Spell Once)&lt;/li&gt;
  &lt;li&gt;CER 6.4% (SOTA autoregressive Transformer model 6.7%) on AISHELL-1&lt;/li&gt;
  &lt;li&gt;21 ms, 1/50 average inference latency of the autoressive Transformer model&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;2-independent-language-modeling-architecture-for-end-to-end-asr-icassp2020&quot;&gt;2. Independent Language Modeling Architecture for End-To-End ASR &lt;a href=&quot;https://arxiv.org/abs/1912.00863&quot;&gt;[ICASSP2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Introduce independent language modeling subnet to leverage external text data&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Existing method: Replace encoding with an all-zero vector and freeze the encoder&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;3-conformer-convolution-augmented-transformer-for-speech-recognition-interspeech2020-paper-reading-slide&quot;&gt;3. Conformer: Convolution-augmented Transformer for Speech Recognition &lt;a href=&quot;https://arxiv.org/abs/2005.08100&quot;&gt;[INTERSPEECH2020]&lt;/a&gt; &lt;a href=&quot;https://houwx.net/files/slides/2020_interspeech_conformer.pdf&quot;&gt;[Paper Reading Slide]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Combine CNNs and Transformers to model both local and global dependencies in a parameter-efficient way&lt;/li&gt;
  &lt;li&gt;LibriSpeech WER of 2.1/4.3 without using LM and 1.9/3.9 with an external LM on test clean/other, 2.7/6.3 with a 10M-parameter small model&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;4-exploring-transformers-for-large-scale-speech-recognition-interspeech2020&quot;&gt;4. Exploring Transformers for Large-Scale Speech Recognition &lt;a href=&quot;https://www.isca-speech.org/archive/Interspeech_2020/abstracts/2638.html&quot;&gt;[INTERSPEECH2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Depth-scale model initialization to accelerate convergence&lt;/li&gt;
  &lt;li&gt;Pre-LayerNorm instead of Post-LayerNorm to accelerate convergence&lt;/li&gt;
  &lt;li&gt;Chunk-based Transformer-XL for streaming ASR (low computation + GPU memory-saving)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;5-distilling-the-knowledge-of-bert-for-sequence-to-sequence-asr-interspeech2020-github&quot;&gt;5. Distilling the Knowledge of BERT for Sequence-to-Sequence ASR &lt;a href=&quot;https://www.isca-speech.org/archive/Interspeech_2020/abstracts/1179.html&quot;&gt;[INTERSPEECH2020]&lt;/a&gt; &lt;a href=&quot;https://github.com/hfutami/distill-bert-for-seq2seq-asr&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Use BERT to generate soft labels by masking and predicting target words for the training of seq2seq ASR&lt;/li&gt;
  &lt;li&gt;Concatenate multiple utterances together to a fixed size for BERT prediction to make pre-training and distillation consistent and improve WER&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;multilingual&quot;&gt;Multilingual&lt;/h1&gt;

&lt;h2 id=&quot;a-asr&quot;&gt;A. ASR&lt;/h2&gt;

&lt;h3 id=&quot;1-massively-multilingual-adversarial-speech-recognition-naacl-hlt2019&quot;&gt;1. Massively Multilingual Adversarial Speech Recognition &lt;a href=&quot;https://arxiv.org/abs/1904.02210&quot;&gt;[NAACL-HLT2019]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Analyze the relative importance of similarity between the target and pre-training languages along the dimensions of phonetics, phonology, language family, geographical location, and orthography&lt;/li&gt;
  &lt;li&gt;Investigate 2 additional objectives for hybrid CTC/Attention architecture: phoneme CTC and language-adversarial during pre-training&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;2-learning-robust-and-multilingual-speech-representations-findings-of-emnlp2020&quot;&gt;2. Learning Robust and Multilingual Speech Representations &lt;a href=&quot;https://www.aclweb.org/anthology/2020.findings-emnlp.106/&quot;&gt;[Findings of EMNLP2020]&lt;/a&gt;&lt;/h3&gt;

&lt;h3 id=&quot;3-language-agnostic-multilingual-modeling-icassp2020&quot;&gt;3. Language-agnostic Multilingual Modeling &lt;a href=&quot;https://arxiv.org/abs/2004.09571&quot;&gt;[ICASSP2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose a language-agnostic multilingual ASR system by transforming all languages to one writing system through a many-to-one transliteration transducer&lt;/li&gt;
  &lt;li&gt;Obtain 10% relative WER reduction on 4 Indic languages&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;6-end-to-end-multilingual-speech-recognition-system-with-language-supervision-training-ieicetrans2020&quot;&gt;6. End-to-End Multilingual Speech Recognition System with Language Supervision Training &lt;a href=&quot;https://www.jstage.jst.go.jp/article/transinf/E103.D/6/E103.D_2019EDL8214/_article&quot;&gt;[IEICETrans.2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose a Language Masks Estimation method to constrain the output distribution&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;7-towards-language-universal-mandarin-english-speech-recognition-interspeech2019&quot;&gt;7. Towards Language-Universal Mandarin-English Speech Recognition &lt;a href=&quot;https://www.isca-speech.org/archive/Interspeech_2019/abstracts/1365.html&quot;&gt;[INTERSPEECH2019]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose to combine two monolingual models to build a bilingual model&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;8-multilingual-speech-recognition-with-a-single-end-to-end-model-icassp2018&quot;&gt;8. Multilingual Speech Recognition With A Single End-To-End Model &lt;a href=&quot;https://arxiv.org/abs/1711.01694&quot;&gt;[ICASSP2018]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Invesitgate the joint/encoding LID multi-task learning/language embedding conditioned cases on 9 Indian languages&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;9-towards-language-universal-end-to-end-speech-recognition-icassp2018&quot;&gt;9. Towards Language-Universal End-to-End Speech Recognition &lt;a href=&quot;https://arxiv.org/abs/1711.02207&quot;&gt;[ICASSP2018]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Explore language-specific/universal output layer&lt;/li&gt;
  &lt;li&gt;Propose language-specific gating units in hidden layers&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;10-large-scale-end-to-end-multilingual-speech-recognition-and-language-identification-with-multi-task-learning-interspeech2020-slide-recipe&quot;&gt;10. Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning &lt;a href=&quot;https://www.isca-speech.org/archive/Interspeech_2020/abstracts/2164.html&quot;&gt;[INTERSPEECH2020]&lt;/a&gt; &lt;a href=&quot;https://houwx.net/files/slides/2020_interspeech_lid-42.pdf&quot;&gt;[Slide]&lt;/a&gt; &lt;a href=&quot;https://github.com/espnet/espnet/tree/master/egs/li42/asr1&quot;&gt;[Recipe]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Present LID-42: very-large-scale Transformer-based hybrid CTC/Attention models based on subwords / characters for 42-lingual ASR, average CER: 27.8/27.2&lt;/li&gt;
  &lt;li&gt;Language-independent architecture with shared vocabulary including language IDs for joint language identification, LID accuracy: 93.5/94.0&lt;/li&gt;
  &lt;li&gt;Relative improvements of 28.1% in WER by transfer learning to low-resource languages&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;11-leveraging-language-id-in-multilingual-end-to-end-speech-recognition-asru2019&quot;&gt;11. Leveraging Language ID in Multilingual End-to-End Speech Recognition &lt;a href=&quot;https://ieeexplore.ieee.org/abstract/document/9003870&quot;&gt;[ASRU2019]&lt;/a&gt;&lt;/h3&gt;

&lt;h3 id=&quot;12-investigating-end-to-end-speech-recognition-for-mandarin-english-code-switching-icassp2019&quot;&gt;12. Investigating End-to-end Speech Recognition for Mandarin-english Code-switching &lt;a href=&quot;http://lxie.npu-aslp.org/papers/2019ICASSP-ChanghaoShan-CS.pdf&quot;&gt;[ICASSP2019]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Introduce MTL where at each time step, the model predicts both the modeling unit and the language ID&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;13-meta-learning-for-end-to-end-low-resource-speech-recognition-icassp2020&quot;&gt;13. Meta Learning for End-to-End Low-Resource Speech Recognition &lt;a href=&quot;https://arxiv.org/abs/1910.12094&quot;&gt;[ICASSP2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Apply model-agnostic meta-learning (MAML) to pre-train a CTC multilingual model and transfer to low-resource languages with language-specific head&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;14-language-adaptive-multilingual-ctc-speech-recognition&quot;&gt;14. Language Adaptive Multilingual CTC Speech Recognition&lt;/h3&gt;

&lt;h3 id=&quot;15-multilingual-speech-recognition-with-corpus-relatedness-sampling&quot;&gt;15. Multilingual Speech Recognition with Corpus Relatedness Sampling&lt;/h3&gt;

&lt;h3 id=&quot;16-adversarial-multilingual-training-for-low-resource-speech-recognition-icassp2018&quot;&gt;16. Adversarial Multilingual Training for Low-Resource Speech Recognition &lt;a href=&quot;http://159.226.21.132/file/2018_Speech%20Recognition_ICASSP_EI-Jiangyan%20Yi.pdf&quot;&gt;[ICASSP2018]&lt;/a&gt;&lt;/h3&gt;

&lt;h3 id=&quot;17-bytes-are-all-you-need-end-to-end-multilingual-speech-recognition-and-synthesis-with-bytes-icassp2019&quot;&gt;17. Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes &lt;a href=&quot;https://arxiv.org/abs/1811.09021&quot;&gt;[ICASSP2019]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose Audio-to-Byte (A2B) and Byte-to-Audio (B2A) models for multilingual ASR and TTS&lt;/li&gt;
  &lt;li&gt;ASR model: LAS, TTS model: Tacotron 2. Input/Output layer: 256 possible byte values.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;18-an-investigation-of-deep-neural-networks-for-multilingual-speech-recognition-training-and-adaptation&quot;&gt;18. An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation&lt;/h3&gt;

&lt;h3 id=&quot;19-bootstrap-an-end-to-end-asr-system-by-multilingual-training-transfer-learning-text-to-text-mapping-and-synthetic-audio-arxiv2020&quot;&gt;19. Bootstrap an End-to-End ASR System by Multilingual Training, Transfer Learning, Text-to-Text Mapping and Synthetic Audio &lt;a href=&quot;https://arxiv.org/abs/2011.12696&quot;&gt;[Arxiv2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Demostrate that post-ASR text-to-text mapping and synthetic TTS data can be effectively combined with approaches such as multilingual training and transfer learning for improving simulated Italian ASR bootstrapping scenario&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;20-adversarial-meta-sampling-for-multilingual-low-resource-speech-recognition-aaai2021-github&quot;&gt;20. Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition &lt;a href=&quot;https://arxiv.org/abs/2012.11896&quot;&gt;[AAAI2021]&lt;/a&gt; &lt;a href=&quot;https://github.com/iamxiaoyubei/AMS&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose Adversarial Meta-Sampling (AMS) approach for multilingual meta-learning ASR (MML-ASR) and multilingual transfer learning ASR (MTL-ASR) by&lt;/li&gt;
  &lt;li&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;b-stt&quot;&gt;B. STT&lt;/h2&gt;

&lt;h3 id=&quot;1-phonological-features-for-0-shot-multilingual-speech-synthesis-interspeech2020&quot;&gt;1. Phonological Features for 0-shot Multilingual Speech Synthesis &lt;a href=&quot;https://www.isca-speech.org/archive/Interspeech_2020/abstracts/1821.html&quot;&gt;[INTERSPEECH2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Utilize binary phonological features&lt;/li&gt;
  &lt;li&gt;Mapping tables for phoneme to IPA, as well as an IPA-PF lookup dictionary are available at https://github.com/papercup-open-source/phonological-features&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;c-speech-translation&quot;&gt;C. Speech Translation&lt;/h2&gt;

&lt;h3 id=&quot;1-effectively-pretraining-a-speech-translation-decoder-with-machine-translation-data-emnlp2020&quot;&gt;1. Effectively Pretraining a Speech Translation Decoder with Machine Translation Data &lt;a href=&quot;https://www.aclweb.org/anthology/2020.emnlp-main.644.pdf&quot;&gt;[EMNLP2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose to use adversarial discriminator to train NMT and ASR systems simultaneously by aligning their encodings to the same latent space –&amp;gt; both the pre-trained ASR encoder and NMT deocder can be used to improve AST&lt;/li&gt;
  &lt;li&gt;1.5 BLEU improvements on En-De and En-Fr compared with conventional pretraining methods (ASR encoder only / ASR encoder + NMT deocder pre-trained separately)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;d-analysis&quot;&gt;D. Analysis&lt;/h2&gt;

&lt;h3 id=&quot;1-automatically-identifying-language-family-from-acoustic-examples-in-low-resource-scenarios-arxiv2020&quot;&gt;1. Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios &lt;a href=&quot;https://arxiv.org/abs/2012.00876&quot;&gt;[Arxiv2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Train a LID model on the Wilderness dataset and analyze the learned embeddings by comparing with classical language family findings (Ethnologue, Glottolog, Wikipedia)&lt;/li&gt;
  &lt;li&gt;Show that languages grouped by learned embeddings perform better than distance-based or phoneme-based approaches on zero-shot TTS&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;curriculum-learning&quot;&gt;Curriculum Learning&lt;/h1&gt;

&lt;h3 id=&quot;1-when-do-curricula-work-iclr2021-submitted&quot;&gt;1. When Do Curricula Work? &lt;a href=&quot;https://openreview.net/forum?id=tW4QEInpni&quot;&gt;[ICLR2021 (submitted)]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Investigate the &lt;strong&gt;implicit curricula&lt;/strong&gt; resulting from architectural and optimization bias and find that samples are learned in a highly consistent order&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum and find that benefit is entirely due to the dynamic training set size rather than the order of examples&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Curriculum, but not anti-curriculum or random ordering can indeed improve the performance either with limited training time budget or in the existence of noisy data&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;semi-supervised-learning&quot;&gt;Semi-Supervised Learning&lt;/h1&gt;

&lt;h3 id=&quot;1-deep-contextualized-acoustic-representations-for-semi-supervised-speech-recognition&quot;&gt;1. Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition&lt;/h3&gt;

&lt;h3 id=&quot;2-semi-supervised-development-of-asr-systems-for-multilingual-code-switched-speech-in-under-resourced-languages&quot;&gt;2. Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages&lt;/h3&gt;

&lt;h3 id=&quot;3-semi-supervised-end-to-end-speech-recognition-interspeech2018-github&quot;&gt;3. Semi-Supervised End-to-End Speech Recognition &lt;a href=&quot;https://isca-speech.org/archive/Interspeech_2018/abstracts/1746.html&quot;&gt;[INTERSPEECH2018]&lt;/a&gt; &lt;a href=&quot;https://github.com/ShigekiKarita/espnet-semi-supervised&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;h3 id=&quot;4-self-training-for-end-to-end-speech-recognition-icassp2020-recipe&quot;&gt;4. Self-Training for End-to-End Speech Recognition &lt;a href=&quot;https://arxiv.org/abs/1909.09116&quot;&gt;[ICASSP2020]&lt;/a&gt; &lt;a href=&quot;https://github.com/facebookresearch/wav2letter/tree/master/recipes/self_training&quot;&gt;[Recipe]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Train a base AM on limited paried data and LM on large-scale unparied text data, use beam search to generate peudo-labels for unlabelled speech data&lt;/li&gt;
  &lt;li&gt;Filtering a) n-gram repeated more than c times (looping). b) hypotheses with an EOS probability below a threshold (early-stopping) or no EOS generated. c) length-normalized log likelihood as the confidence score for quality ranking: $\text{ConfidenceScore}(\hat{Y_i})=\frac{\log P_{AM}(\hat{Y_i}|X_i)}{|\hat{Y_i}|}$&lt;/li&gt;
  &lt;li&gt;Propose sample ensemble: Combine pseudo samples generated by M models and average their loss during optimization&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;5-end-to-end-asr-from-supervised-to-semi-supervised-learning-with-modern-architectures-sas-workshopicml2020-recipe&quot;&gt;5. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures &lt;a href=&quot;https://arxiv.org/abs/1911.08460&quot;&gt;[SAS Workshop@ICML2020]&lt;/a&gt; &lt;a href=&quot;https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019&quot;&gt;[Recipe]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Train AM + LM on Librispeech (960 hours) to generate pseudo labels for LibriVox (53.8k hours) –&amp;gt; (10k hours) can already obtain promising results as shown in the paper&lt;/li&gt;
  &lt;li&gt;E2E AMs are implicit LMs, thus with enough unlabeled audio, decoding with an external LM doesn’t improve performance&lt;/li&gt;
  &lt;li&gt;On Librispeech: achieve 2.27/4.8 WER on test-clean/other sets without LM, 2.09/4.11 with LM&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;self-supervised-learning&quot;&gt;Self-Supervised Learning&lt;/h1&gt;

&lt;h3 id=&quot;1-unsupervised-pretraining-transfers-well-across-languages-icassp2020-github&quot;&gt;1. Unsupervised Pretraining Transfers Well Across Languages &lt;a href=&quot;https://arxiv.org/abs/2002.02848&quot;&gt;[ICASSP2020]&lt;/a&gt; &lt;a href=&quot;https://github.com/facebookresearch/CPC_audio&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Investigate into the CPC for cross-lingual tasks and evaluate the linear separability of the learned phoneme representation (Common Voice / LibriSpeech phoneme classification, Zerospeech2017)&lt;/li&gt;
  &lt;li&gt;Introduce two modifications to improve CPC: a) replacing batch normalization with a channel-wise normalization to avoid information leakage and stablize training; b) Replace the linear classifier with a 1-layer Transformer to make the future prediction target more reasonable&lt;/li&gt;
  &lt;li&gt;PER of modified CPC pretrained on LS-360 is comparable to the supervised model pretrained on LS-100&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;2-wav2vec-20-a-framework-for-self-supervised-learning-of-speech-representations-neurips2020-paper-reading-slide&quot;&gt;2. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations &lt;a href=&quot;https://arxiv.org/abs/2006.11477&quot;&gt;[NeurIPS2020]&lt;/a&gt; &lt;a href=&quot;https://houwx.net/files/slides/2020_neurips_wav2vec2.0.pdf&quot;&gt;[Paper Reading Slide]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Self-supervised learning by masking + contrastive loss + quantized representations&lt;/li&gt;
  &lt;li&gt;On Librispeech: achieve 1.8/3.3 WER on the clean/other test sets, 4.8/8.2 WER on 10 minutes of labeled data and 53k hours of unlabeled data. All experiments are with external LM&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;3-unsupervised-cross-lingual-representation-learning-for-speech-recognition-arxiv2020&quot;&gt;3. Unsupervised Cross-lingual Representation Learning for Speech Recognition &lt;a href=&quot;https://arxiv.org/abs/2006.13979&quot;&gt;[Arxiv2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Investigate into self-supervised pre-training (wav2vec 2.0) for multilingual / cross-lingual ASR&lt;/li&gt;
  &lt;li&gt;Publish XLSR-53, a large-scale multilingual wav2vec 2.0 pre-trained on 56K-hour combined corpora of Common Voice (38 languages) + BABEL (14 languages) + Multilingual LibriSpeech (MLS) (8 languages)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;4-unsupervised-learning-of-disentangled-representations-for-speech-with-neural-variational-inference-models-mit-phd-dissertation&quot;&gt;4. Unsupervised Learning of Disentangled Representations for Speech with Neural Variational Inference Models &lt;a href=&quot;https://groups.csail.mit.edu/sls/publications/2018/Wei-NingHsu_MS-Thesis.pdf&quot;&gt;[MIT PhD Dissertation]&lt;/a&gt;&lt;/h3&gt;

&lt;h3 id=&quot;5-a-further-study-of-unsupervised-pre-training-for-transformer-based-speech-recognition&quot;&gt;5. A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition&lt;/h3&gt;

&lt;h3 id=&quot;6-leveraging-text-data-using-hybrid-transformer-lstm-based-end-to-end-asr-in-transfer-learning&quot;&gt;6. Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning&lt;/h3&gt;

&lt;h3 id=&quot;7-self-training-and-pre-training-are-complementary-for-speech-recognition-arxiv2020&quot;&gt;7. Self-training and Pre-training are Complementary for Speech Recognition &lt;a href=&quot;https://arxiv.org/abs/2010.11430&quot;&gt;[Arxiv2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Combine self-training (paper 5 in semi-supervised learning section) and unsupervised pre-training (wav2vec 2.0) to achieve new SOTA&lt;/li&gt;
  &lt;li&gt;Use pre-trained wav2vec 2.0 for pseudo-labelling, then fine-tune wav2vec 2.0 (CTC) or train randomly-initialized Transformer model (S2S) as final model&lt;/li&gt;
  &lt;li&gt;On Librispeech: achieve 1.8/3.3 (CTC) or 1.5/3.1 (S2S) WER on the clean/other test sets, 3.0/5.2 (CTC) or 3.1/5.4 (S2S) WER on 10 minutes of labeled data and 53k hours of unlabeled data. All experiments are with external LM&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;8-vector-quantized-autoregressive-predictive-coding-interspeech2020-github&quot;&gt;8. Vector-Quantized Autoregressive Predictive Coding &lt;a href=&quot;https://www.isca-speech.org/archive/Interspeech_2020/abstracts/1228.html&quot;&gt;[INTERSPEECH2020]&lt;/a&gt; &lt;a href=&quot;https://github.com/iamyuanchung/VQ-APC&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Introduce one or more vector quantization layers to the APC model to explicitly control the amount of encoded information&lt;/li&gt;
  &lt;li&gt;Probiing tasks show the APC model prefer to retain speaker information over phonetic information when the capacity is limited&lt;/li&gt;
  &lt;li&gt;When the phonetic information is present, the learned VQ codes correspond well with English phones&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;9-speech-xlnet-unsupervised-acoustic-model-pretraining-for-self-attention-networks-interspeech2020&quot;&gt;9. Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks &lt;a href=&quot;https://www.isca-speech.org/archive/Interspeech_2020/abstracts/1511.html&quot;&gt;[INTERSPEECH2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Apply XLNet (refer to NLP 13th) to ASR tasks (w/o segment recurrence mechanism)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;10-mockingjay-unsupervised-speech-representation-learning-with-deep-bidirectional-transformer-encoders-icassp2020-github&quot;&gt;10. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders &lt;a href=&quot;https://arxiv.org/abs/1910.12638&quot;&gt;[ICASSP2020]&lt;/a&gt; &lt;a href=&quot;https://github.com/andi611/Self-Supervised-Speech-Pretraining-and-Representation-Learning&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Apply BERT to speech with proposed additional consecutive masking (mask consecutive C frames to 0) and evaluate on phoneme classification, sentiment classification and speaker recognition&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;11-tera-self-supervised-learning-of-transformer-encoder-representation-for-speech-taslp2020-submitted-s3prl&quot;&gt;11. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech &lt;a href=&quot;https://arxiv.org/abs/2007.06028&quot;&gt;[TASLP2020 (submitted)]&lt;/a&gt; &lt;a href=&quot;https://github.com/andi611/Self-Supervised-Speech-Pretraining-and-Representation-Learning&quot;&gt;[S3PRL]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose TERA (Transformer Encoder Representations from Alteration)&lt;/li&gt;
  &lt;li&gt;Multi-target 3 auxiliary L1 reconstruction objectives: time (time masking, Mockingjay) / channel (frequency masking) / magnitude (+ Gaussian noise) alteration&lt;/li&gt;
  &lt;li&gt;Comprehensive analysis on its application to ASR (representation / fine-tune) / phoneme classification / speaker classification&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;12-unsupervised-pre-training-of-bidirectional-speech-encoders-via-masked-reconstruction-icassp2020&quot;&gt;12. Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction &lt;a href=&quot;https://arxiv.org/abs/2001.10603&quot;&gt;[ICASSP2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Inspied by BERT, pre-train bidirectional RNNs via masked reconstruction loss to improve ASR&lt;/li&gt;
  &lt;li&gt;Inspired by SpecAugment, masking segments of sufficient width in both time and frequency&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;13-improving-transformer-based-speech-recognition-using-unsupervised-pre-training-arxiv2019&quot;&gt;13. Improving Transformer-based Speech Recognition Using Unsupervised Pre-training &lt;a href=&quot;https://arxiv.org/abs/1910.09932&quot;&gt;[Arxiv2019]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Almost same as paper 9 except it is evaluated on HKUST and AISHELL Mandarin ASR datasets&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;14-a-simple-framework-for-contrastive-learning-of-visual-representations-icml2020-github&quot;&gt;14. A Simple Framework for Contrastive Learning of Visual Representations &lt;a href=&quot;https://arxiv.org/abs/2002.05709&quot;&gt;[ICML2020]&lt;/a&gt; &lt;a href=&quot;https://github.com/google-research/simclr&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose SimCLR framework consisting of: input image $x$ –&amp;gt; two data augmentation to $x_i$, $x_j$ –&amp;gt; same fetaure extractor –&amp;gt; $h_i$, $h_j$ –&amp;gt; same two-layer DNN –&amp;gt; $z_i$, $z_j$ –&amp;gt; contrastive loss (positive samples are $(i, j)$ and $(j, i)$, negative samples are other $2(N-1)$ augmented examples&lt;/li&gt;
  &lt;li&gt;Important factors: multiple data augmentation operations, larger batch sizes and longer training, normalize temperature for contrastive cross entropy, nonlinear transformation between representation and contrastive loss&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;15-speech-simclr-combining-contrastive-and-reconstruction-objective-for-self-supervised-speech-representation-learning-arxiv2020-github&quot;&gt;15. Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning &lt;a href=&quot;https://arxiv.org/abs/2010.13991&quot;&gt;[Arxiv2020]&lt;/a&gt; &lt;a href=&quot;https://github.com/athena-team/athena/tree/simclr&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Data augmentation: random pitch shift,  speed perturbation, room reverberation and additive noise to the waveform; time and frequency masking to the spectrogram implemented with &lt;a href=&quot;https://github.com/facebookresearch/WavAugment&quot;&gt;WavAugment toolkit&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Final loss = reconstruction loss (TERA) + contrastive loss (SimCLR)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;16-bootstrap-your-own-latent-a-new-approach-to-self-supervised-learning-arxiv2020&quot;&gt;16. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning &lt;a href=&quot;http://arxiv.org/abs/2006.07733&quot;&gt;[Arxiv2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose BYOL,&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;17-the-zero-resource-speech-benchmark-2021-metrics-and-baselines-for-unsupervised-spoken-language-modeling-neurips2020-sas&quot;&gt;17. The Zero Resource Speech Benchmark 2021: Metrics and Baselines for Unsupervised Spoken Language Modeling &lt;a href=&quot;https://arxiv.org/abs/2011.11588&quot;&gt;[NeurIPS2020 SAS]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;d&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;18-multi-format-contrastive-learning-of-audio-representations-neurips2020-sas&quot;&gt;18. Multi-Format Contrastive Learning of Audio Representations &lt;a href=&quot;https://drive.google.com/file/d/1PfzgtuCU36Wd2Fi3TCwKWeabbRaav64X/view?usp=sharing&quot;&gt;[NeurIPS2020 SAS]&lt;/a&gt;&lt;/h3&gt;

&lt;h3 id=&quot;19-similarity-analysis-of-self-supervised-speech-representations-neurips2020-sas&quot;&gt;19. Similarity Analysis of Self-Supervised Speech Representations &lt;a href=&quot;https://arxiv.org/abs/2010.11481&quot;&gt;[NeurIPS2020 SAS]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;d&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;transfer-learning--domain-adaptation&quot;&gt;Transfer Learning / Domain Adaptation&lt;/h1&gt;

&lt;h3 id=&quot;1-co-tuning-for-transfer-learning--neurips2020-github&quot;&gt;1. Co-Tuning for Transfer Learning  &lt;a href=&quot;https://proceedings.neurips.cc//paper/2020/hash/c8067ad1937f728f51288b3eb986afaa-Abstract.html&quot;&gt;[NeurIPS2020]&lt;/a&gt; &lt;a href=&quot;https://github.com/thuml/CoTuning&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose Co-Tuning to fully transfer pre-trained models by utilizing pretraining-task-specific parameters&lt;/li&gt;
  &lt;li&gt;Loss of Co-Tuning: $\text{CE}(\text{prediction of target head}, y_t) + \lambda \cdot \text{CE}(\text{prediction of pretrained head}, p(y_s|y_t))$&lt;/li&gt;
  &lt;li&gt;Propose two category relationship learning approaches to translate target labels into probabilistic source labels&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;2-self-training-for-few-shot-transfer-across-extreme-task-differences-iclr2021-submitted&quot;&gt;2. Self-training For Few-shot Transfer Across Extreme Task Differences &lt;a href=&quot;https://openreview.net/forum?id=O3Y56aqpChA&quot;&gt;[ICLR2021 (submitted)]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose “Self Training to Adapt Representations To Unseen Problems (STARTUP)”&lt;/li&gt;
  &lt;li&gt;Three stages to learn representations: train teacher model on base dataset –&amp;gt; contruct softly-labeled set on target unlabeled set –&amp;gt; train student model&lt;/li&gt;
  &lt;li&gt;Loss for training student model: CE loss on base dataset + KL divergence on softly-labeled set + self-supervised loss on unlabeled set (SimCLR in this paper)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;3-adaptation-algorithms-for-speech-recognition-an-overview-ieee-ojsp2021-submitted&quot;&gt;3. Adaptation Algorithms for Speech Recognition: An Overview &lt;a href=&quot;https://arxiv.org/abs/2008.06580&quot;&gt;[IEEE OJSP2021 (submitted)]&lt;/a&gt;&lt;/h3&gt;

&lt;h3 id=&quot;4-unsupervised-domain-adaptation-for-speech-recognition-via-uncertainty-driven-self-training-icassp2021-submitted&quot;&gt;4. Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training &lt;a href=&quot;https://arxiv.org/abs/2011.13439&quot;&gt;[ICASSP2021 (submitted)]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose Dropout-based Uncertainty-driven Self-Training (DUST) by filtering uncertain predictions from pseudo-labelled samples&lt;/li&gt;
  &lt;li&gt;Confidence is computed based on the maximum edit distance between model outputs w/o dropout vs. model outputs w/ dropout with different random seeds&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;5-supervised-contrastive-learning-for-pre-trained-language-model-fine-tuning-iclr2021-submitted&quot;&gt;5. Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning &lt;a href=&quot;https://arxiv.org/abs/2011.01403&quot;&gt;[ICLR2021 (submitted)]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;dd&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;6-domain-adaptation-using-class-similarity-for-robust-speech-recognition-interspeech2020&quot;&gt;6. Domain Adaptation Using Class Similarity for Robust Speech Recognition &lt;a href=&quot;https://arxiv.org/abs/2011.02782&quot;&gt;[INTERSPEECH2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose a novel adaptation method composed of two stages: a) train the source model and computing mean soft labels of every class over source samples; b) use the soft labels as a regularization term to train the target model on the target domain data&lt;/li&gt;
  &lt;li&gt;Experiment on a) accent adaptation (Common Voice English) and b) noise adapation (CHiME-3). The proposed method is more robust and performs even better on the latter task&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;7-unks-everywhere-adapting-multilingual-language-models-to-new-scripts-arxiv2020&quot;&gt;7. UNKs Everywhere: Adapting Multilingual Language Models to New Scripts &lt;a href=&quot;https://arxiv.org/abs/2012.15562&quot;&gt;[Arxiv2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;d&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;nlp-nmt-plm&quot;&gt;NLP (NMT, PLM)&lt;/h1&gt;

&lt;h3 id=&quot;1-cross-lingual-language-model-pretraining-neurips2019&quot;&gt;1. Cross-lingual Language Model Pretraining &lt;a href=&quot;https://papers.nips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf&quot;&gt;[NeurIPS2019]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Investigate into the cross-lingual language model (XLM) pretrained by Casual LM (classic LM), Masked LM (unsupervised) and Translation LM (supervised) tasks&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;2-unicoder-a-universal-language-encoder-by-pre-training-with-multiple-cross-lingual-tasks-emnlp2019&quot;&gt;2. Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks &lt;a href=&quot;https://arxiv.org/abs/1909.00964&quot;&gt;[EMNLP2019]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Based on 2 tasks (MLM, TLM) in XLM, further introduces 3 new cross-lingual pretraining tasks: Cross-lingual Word Recovery (based on attention), Cross-lingual Paraphrase Classification (similar to Next Sentence Prediction but predicting meaning), Cross-lingual MLM (code-switch sentences)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;3-learning-deep-transformer-models-for-machine-translation-acl2019&quot;&gt;3. Learning Deep Transformer Models for Machine Translation &lt;a href=&quot;https://arxiv.org/abs/1906.01787&quot;&gt;[ACL2019]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Study the effects of Pre-Norm and Post-Norm in deep Transformer, the gradient of Post-Norm poses a higher risk of gradient vanishing or exploding&lt;/li&gt;
  &lt;li&gt;Propose Dynamic Linear Combination of Layers (DLCL) to memorizing the features extracted from all preceding layers&lt;/li&gt;
  &lt;li&gt;Sucessfully train a 30-layer deepest Transformer encoder and 6-layer decoder for NMT&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;4-massively-multilingual-sentence-embeddings-for-zero-shot-cross-lingual-transfer-and-beyond-tacl2019&quot;&gt;4. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond &lt;a href=&quot;https://arxiv.org/abs/1812.10464&quot;&gt;[TACL2019]&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;###&lt;/p&gt;

&lt;h3 id=&quot;5-a-study-of-cross-lingual-ability-and-language-specific-information-in-multilingual-bert&quot;&gt;5. A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT&lt;/h3&gt;

&lt;h3 id=&quot;6-zero-shot-reading-comprehension-by-cross-lingual-transfer-learning-with-multi-lingual-language-representation-model&quot;&gt;6. Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model&lt;/h3&gt;

&lt;h3 id=&quot;7-are-all-languages-created-equal-in-multilingual-bert&quot;&gt;7. Are All Languages Created Equal in Multilingual BERT?&lt;/h3&gt;

&lt;h3 id=&quot;8-multilingual-neural-machine-translation-with-language-clustering&quot;&gt;8. Multilingual Neural Machine Translation with Language Clustering&lt;/h3&gt;

&lt;h3 id=&quot;9-multilingual-unsupervised-nmt-using-shared-encoder-and-language-specific-decoders&quot;&gt;9. Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders&lt;/h3&gt;

&lt;h3 id=&quot;10-multilingual-nmt-with-a-language-independent-attention-bridge-httpsgithubcomhelsinki-nlpopennmt-pytreeatt-brg&quot;&gt;10. Multilingual NMT with a Language-Independent Attention Bridge https://github.com/Helsinki-NLP/OpenNMT-py/tree/att-brg&lt;/h3&gt;

&lt;h3 id=&quot;11-cross-lingual-spoken-language-understanding-with-regularized-representation-alignment&quot;&gt;11. Cross-lingual Spoken Language Understanding with Regularized Representation Alignment&lt;/h3&gt;

&lt;h3 id=&quot;12-transformer-xl-attentive-language-models-beyond-a-fixed-length-context-acl2019-github&quot;&gt;12. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context &lt;a href=&quot;https://arxiv.org/abs/1901.02860&quot;&gt;[ACL2019]&lt;/a&gt; &lt;a href=&quot;https://github.com/kimiyoung/transformer-xl&quot;&gt;[GitHub]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose Transformer-XL (extra long) to capture longer-term dependency and resolve the context fragmentation problem&lt;/li&gt;
  &lt;li&gt;Introduce segment-level recurrence mechanism to reuse previous segments as context&lt;/li&gt;
  &lt;li&gt;Introduce relative positional encodings for the proposed recurrence mechanism&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;13-xlnet-generalized-autoregressive-pretraining-for-language-understanding-neurips2019-github-mask分析&quot;&gt;13. XLNet: Generalized Autoregressive Pretraining for Language Understanding &lt;a href=&quot;https://arxiv.org/abs/1906.08237&quot;&gt;[NeurIPS2019]&lt;/a&gt; &lt;a href=&quot;https://github.com/zihangdai/xlnet&quot;&gt;[GitHub]&lt;/a&gt; &lt;a href=&quot;https://mp.weixin.qq.com/s?__biz=MzIwMTc4ODE0Mw==&amp;amp;mid=2247514606&amp;amp;idx=2&amp;amp;sn=bf45236870c8e04ddd0bda6a8ab43337&amp;amp;chksm=96ea666ea19def7834e45e0b1aca52dc324ba1cbc42aee7d712e40fae346e8bc3c73b8992d2c&amp;amp;scene=126&amp;amp;sessionid=1606654762&amp;amp;key=81b7b09de2e501a29d938e698bd05dd4b729c8c891b4e013ba75be9fa8ea638a6ddeb88736ba9e0976afcb18fda158f33671e81cdc7186a69e73286a25ce4857a06ad7e427bd941fc425400f8caf7099c0df19da6085f209e1561ee69b296a016d46254042c45f07a57b58ea62b3b9f902605653249bb4e29239b1d0ecac040d&amp;amp;ascene=1&amp;amp;uin=NjA0ODE1NTgx&amp;amp;devicetype=Windows+10+x64&amp;amp;version=6300002f&amp;amp;lang=zh_CN&amp;amp;exportkey=Aa4iNLil%2FDw6FReJr1imHpY%3D&amp;amp;pass_ticket=FT97m6EOkfyNaHm4C46Lu8nVrxTVO%2BCKeGa%2F3JbVyQIB0jydRsW%2BwAL2s%2FUeZ6hD&amp;amp;wx_header=0&quot;&gt;[Mask分析]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Introduce a permutation language modeling (PLM) pre-training objective&lt;/li&gt;
  &lt;li&gt;Two-stream self-attention + partial prediction (only predict the lasttokens in a factorization order)&lt;/li&gt;
  &lt;li&gt;Integrate Transformer-XL (relative positional encoding + segment recurrence mechanism)&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;multi-modal&quot;&gt;Multi-Modal&lt;/h1&gt;

&lt;h3 id=&quot;1-speech-to-text-adaptation-towards-an-efficient-cross-modal-distillation&quot;&gt;1. Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Knowledge distillation from BERT to ASR pretrained module for Spoken Language Understanding&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;deep-learning-architecture&quot;&gt;Deep Learning Architecture&lt;/h1&gt;

&lt;h3 id=&quot;1-pay-less-attention-with-lightweight-and-dynamic-convolutions-iclr2019&quot;&gt;1. Pay Less Attention with Lightweight and Dynamic Convolutions &lt;a href=&quot;https://openreview.net/forum?id=SkVhlh09tX&quot;&gt;[ICLR2019]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Depth-wise convolution independently performed on every channel&lt;/li&gt;
  &lt;li&gt;Lightweight convolution (depth-wise + weight sharing + GLU)&lt;/li&gt;
  &lt;li&gt;Dynamic convolution: compute weight based on input X&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;2-dynamic-convolution-attention-over-convolution-kernels-cvpr2020&quot;&gt;2. Dynamic Convolution: Attention over Convolution Kernels &lt;a href=&quot;https://arxiv.org/abs/1912.03458&quot;&gt;[CVPR2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Aggregate multiple convolution kernels dynamically based on their attentions&lt;/li&gt;
  &lt;li&gt;Tricks: Sum the attention to 1 + Near-uniform attention in early epochs (softmax + large temperature)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;3-rethinking-the-value-of-transformer-components-coling2020&quot;&gt;3. Rethinking the Value of Transformer Components &lt;a href=&quot;https://arxiv.org/abs/2011.03803&quot;&gt;[COLING2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;For decoder self-attention least important, FFN most important, higher encoder-attention &amp;gt; lower encoder-attention (closer to the output layer)&lt;/li&gt;
  &lt;li&gt;For encoder, the lower components (self-attention, FFN) are more important.&lt;/li&gt;
  &lt;li&gt;Two methods to improve Transformer NMT: a) Prune unimportant components and retrain the model. b) Rewind unimportant components and finetune the model&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;4-understanding-the-difficulty-of-training-transformers-emnlp2020&quot;&gt;4. Understanding the Difficulty of Training Transformers &lt;a href=&quot;https://arxiv.org/abs/2004.08249&quot;&gt;[EMNLP2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Strong dependencies of Post-Norm amplify fluctuations brought by parameter changes and destabilize the training&lt;/li&gt;
  &lt;li&gt;Loose reliance on residual branches in Pre-Norm generally limits the algorithm’s potential and often produces inferior models (than Post-Norm)&lt;/li&gt;
  &lt;li&gt;Propose Admin: an &lt;strong&gt;Ad&lt;/strong&gt;aptive &lt;strong&gt;m&lt;/strong&gt;odel &lt;strong&gt;in&lt;/strong&gt;itialization method to stablize the early stage of training&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;5-made-masked-autoencoder-for-distribution-estimation-icml2015&quot;&gt;5. MADE: Masked Autoencoder for Distribution Estimation &lt;a href=&quot;https://arxiv.org/abs/1502.03509&quot;&gt;[ICML2015]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Modify the autoencoder to make it autoregssive by using masks to change hidden layer connectivity structures&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;6-lite-transformer&quot;&gt;6. Lite Transformer&lt;/h3&gt;

&lt;h3 id=&quot;7-object-centric-learning-with-slot-attention&quot;&gt;7. Object-Centric Learning with Slot Attention&lt;/h3&gt;

&lt;h3 id=&quot;8-deep-learning-recommendation-model-for-personalization-and-recommendation-systems&quot;&gt;8. Deep Learning Recommendation Model for Personalization and Recommendation Systems&lt;/h3&gt;

&lt;h3 id=&quot;9-multi-head-attentioncollaborate-instead-of-concatenat&quot;&gt;9. Multi-Head Attention:Collaborate Instead of Concatenat&lt;/h3&gt;

&lt;h1 id=&quot;meta-learning&quot;&gt;Meta-Learning&lt;/h1&gt;

&lt;h3 id=&quot;1-on-first-order-meta-learning-algorithms&quot;&gt;1. On First-Order Meta-Learning Algorithms&lt;/h3&gt;

&lt;h3 id=&quot;2-reptile--a-scalable-metalearning-algorithm&quot;&gt;2. Reptile:  a Scalable Metalearning Algorithm&lt;/h3&gt;

&lt;h3 id=&quot;3-rapid-learning-or-feature-reuse-towards-understanding-the-effectiveness-of-maml-iclr2020&quot;&gt;3. Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML &lt;a href=&quot;https://openreview.net/forum?id=rkgMkCEtPB&quot;&gt;[ICLR2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Question: is the effectiveness of MAML due to the meta-initialization being primed for rapid learning (large, efficient changes in the representations) or due to feature reuse, with the meta-initialization already containing high quality features? Answer: feature reuse is the dominant factor&lt;/li&gt;
  &lt;li&gt;Propose Almost No Inner Loop (ANIL) algorithm, a competitive simplification of MAML by removing the inner loop for all but the (task-specific) head of the underlying neural network&lt;/li&gt;
  &lt;li&gt;Propose No Inner Loop (NIL) algorithm to classify the test sample based on cosine similarities of penultimate layer representations with the k labelled examples (support set)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;4-convergence-of-meta-learning-with-task-specific-adaptation-over-partial-parameters&quot;&gt;4. Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters&lt;/h3&gt;

&lt;h1 id=&quot;others&quot;&gt;Others&lt;/h1&gt;

&lt;h3 id=&quot;1-scaling-hidden-markov-language-models--emnlp2020&quot;&gt;1. Scaling Hidden Markov Language Models  &lt;a href=&quot;https://arxiv.org/abs/2011.04640&quot;&gt;[EMNLP2020]&lt;/a&gt;&lt;/h3&gt;

&lt;h3 id=&quot;2-revisit-knowledge-distillation-a-teacher-free-framework&quot;&gt;2. REVISIT KNOWLEDGE DISTILLATION: A TEACHER-FREE FRAMEWORK&lt;/h3&gt;

&lt;h3 id=&quot;3-pegasus-pre-training-with-extracted-gap-sentences-forabstractive-summarization&quot;&gt;3. PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization&lt;/h3&gt;

&lt;h3 id=&quot;4-adapterfusionnon-destructive-task-composition-for-transfer-learning&quot;&gt;4. AdapterFusion:Non-Destructive Task Composition for Transfer Learning&lt;/h3&gt;

&lt;h3 id=&quot;5-improving-massively-multilingual-neural-machine-translation-and-zero-shot-translation&quot;&gt;5. Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation&lt;/h3&gt;

&lt;h3 id=&quot;6-dynamic-fusion-network-for-multi-domain-end-to-end-task-oriented-dialog&quot;&gt;6. Dynamic Fusion Network for Multi-Domain End-to-end Task-Oriented Dialog&lt;/h3&gt;

&lt;h3 id=&quot;7-multi-source-domain-adaptation-with-mixture-of-experts&quot;&gt;7. Multi-Source Domain Adaptation with Mixture of Experts&lt;/h3&gt;

&lt;h3 id=&quot;8-meta-learning-for-few-shot-nmt-adaptation&quot;&gt;8. Meta-Learning for Few-Shot NMT Adaptation&lt;/h3&gt;

&lt;h3 id=&quot;8-simple-scalable-adaptation-for-neural-machine-translation&quot;&gt;8. Simple, Scalable Adaptation for Neural Machine Translation&lt;/h3&gt;

&lt;h3 id=&quot;9-parameter-efficient-transfer-learning-for-nlp&quot;&gt;9. Parameter-Efficient Transfer Learning for NLP&lt;/h3&gt;

&lt;h3 id=&quot;11-improving-target-side-lexical-transfer-in-multilingual-neural-machine-translation&quot;&gt;11. Improving Target-side Lexical Transfer in Multilingual Neural Machine Translation&lt;/h3&gt;

&lt;h3 id=&quot;12-large-memory-layers-with-product-keys&quot;&gt;12. Large Memory Layers with Product Keys&lt;/h3&gt;

&lt;h3 id=&quot;13-an-analysis-of-massively-multilingual-neural-machine-translation-forlow-resource-languages&quot;&gt;13. An Analysis of Massively Multilingual Neural Machine Translation forLow-Resource Languages&lt;/h3&gt;

&lt;h3 id=&quot;14-large-product-key-memory-for-pretrained-language-models&quot;&gt;14. Large Product Key Memory for Pretrained Language Models&lt;/h3&gt;

&lt;h3 id=&quot;15-contextual-parameter-generation-for-universal-neural-machine-translation&quot;&gt;15. Contextual Parameter Generation for Universal Neural Machine Translation&lt;/h3&gt;

&lt;h3 id=&quot;16-multilingual-speech-recognition-with-self-attention-structured-parameterization&quot;&gt;16. Multilingual Speech Recognition with Self-Attention Structured Parameterization&lt;/h3&gt;

&lt;h3 id=&quot;17-experience-grounds-language-emnlp2020&quot;&gt;17. Experience Grounds Language &lt;a href=&quot;https://www.aclweb.org/anthology/2020.emnlp-main.703/&quot;&gt;[EMNLP2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose the notion of a World Scope (WS) as a lens to audit progress in NLP&lt;/li&gt;
  &lt;li&gt;Five levels of WS: Corpus, Internet, Perception, Embodiment, Social&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;18-monolingual-adapters-for-zero-shot-neural-machine-translation-emnlp2020&quot;&gt;18. Monolingual Adapters for Zero-Shot Neural Machine Translation &lt;a href=&quot;https://www.aclweb.org/anthology/2020.emnlp-main.361/&quot;&gt;[EMNLP2020]&lt;/a&gt;&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Propose language-specific adapter layers which are more parameter-efficient $2n$, $O(n)$ than bilingual ones $n\cdot (n-1)$, $O(n^2)$ and enable combining any encoder adapter with other decoder adapters&lt;/li&gt;
&lt;/ol&gt;</content><author><name>Wenxin Hou 侯汶昕</name><email>houwx001@gmail.com</email></author><category term="Deep Learning" /><category term="ASR" /><category term="Meta-Learning" /><category term="Curriculum Learning" /><category term="Multilingual" /><category term="NLP" /><summary type="html">Last Updated: 2020-12-10</summary></entry><entry><title type="html">PyTorch模型部署踩坑记录</title><link href="https://houwx.net/posts/2020/01/blog-post-27/" rel="alternate" type="text/html" title="PyTorch模型部署踩坑记录" /><published>2020-08-14T00:00:00-07:00</published><updated>2020-08-14T00:00:00-07:00</updated><id>https://houwx.net/posts/2020/01/blog-post-27-%E6%A8%A1%E5%9E%8B%E9%83%A8%E7%BD%B2%E8%B8%A9%E5%9D%91%E8%AE%B0%E5%BD%95</id><content type="html" xml:base="https://houwx.net/posts/2020/01/blog-post-27/">&lt;p&gt;Last Updated: 2020-08-14&lt;/p&gt;

&lt;p&gt;最后更新：2020/8/14&lt;/p&gt;

&lt;h2 id=&quot;0-基础知识&quot;&gt;0. 基础知识&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;ONNX是一种通用的神经网络交换格式，用于统一不同框架（PyTorch，TensorFlow等）写出的模型，目的是方便测试或者部署，目前ONNX只支持推理。https://github.com/onnx/onnx&lt;/li&gt;
  &lt;li&gt;PyTorch转ONNX，https://pytorch.org/docs/stable/onnx.html&lt;/li&gt;
  &lt;li&gt;基于torch.onnx的volksdep开源库，https://github.com/Media-Smart/volksdep&lt;/li&gt;
  &lt;li&gt;相关阅读：开源一年多的模型交换格式ONNX，已经一统框架江湖了？https://zhuanlan.zhihu.com/p/51387600&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;1-pytorch模型转onnx&quot;&gt;1. PyTorch模型转ONNX&lt;/h1&gt;

&lt;h3 id=&quot;11-部分operator不支持问题&quot;&gt;1.1. 部分Operator不支持问题&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. torch.tril: RuntimeError: Exporting operator tril to ONNX opset version 11 is not supported.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;解决方案&lt;/strong&gt;：暂时使用numpy的tril函数代替，torch.tril() –&amp;gt; torch.from_numpy(np.tril)，triu同理&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. KLDivLoss:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;解决方案&lt;/strong&gt;：最新版pytorch里刚刚支持，将issue中的代码复制到相应文件中即可解决 https://github.com/pytorch/pytorch/pull/41858/files&lt;/p&gt;

&lt;h3 id=&quot;12-tracerwarning问题&quot;&gt;1.2. TracerWarning问题&lt;/h3&gt;

&lt;p&gt;TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.&lt;/p&gt;

&lt;h4 id=&quot;原因以及解决方案&quot;&gt;原因以及解决方案&lt;/h4&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;if else语句。&lt;/strong&gt;如果Pytorch定义的网络结构太过于灵活，那么转成ONNX的时候很有可能出错。这个报错通常情况下是你的网络结构中出现if else 语句。（https://blog.csdn.net/Einstellung/article/details/105886873）&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;存在参数覆盖（原地修改/计算）&lt;/strong&gt;。比如在原地修改p[:, 0:2]=torch.sigmoid(p[:, 0:2]，声明一个临时变量可解决)（https://blog.csdn.net/weixin_39908946/article/details/106855482）&lt;/li&gt;
&lt;/ol&gt;</content><author><name>Wenxin Hou 侯汶昕</name><email>houwx001@gmail.com</email></author><category term="NLP" /><summary type="html">Last Updated: 2020-08-14</summary></entry><entry><title type="html">[Paper-NLP] Meta-Learning for Low-Resource Neural Machine Translation</title><link href="https://houwx.net/posts/2020/01/blog-post-25/" rel="alternate" type="text/html" title="[Paper-NLP] Meta-Learning for Low-Resource Neural Machine Translation" /><published>2020-08-11T00:00:00-07:00</published><updated>2020-08-11T00:00:00-07:00</updated><id>https://houwx.net/posts/2020/01/blog-post-25-maml-nmt</id><content type="html" xml:base="https://houwx.net/posts/2020/01/blog-post-25/">&lt;p&gt;Last Updated: 2020-08-12&lt;/p&gt;

&lt;p&gt;This paper: &lt;a href=&quot;https://arxiv.org/pdf/1808.08437.pdf&quot;&gt;Meta-Learning for Low-Resource Neural Machine Translation&lt;/a&gt; is published at EMNLP 2018. Authors include Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho and Victor O.K.Li from The University of Hong Kong and New York University.&lt;/p&gt;

&lt;p&gt;This paper introduces model-agnostic meta-learning algorithm (MAML) to the low-resource neural machine translation (NMT) task. The proposed approach significantly outperforms the multilingual, transfer-learning-based method.&lt;/p&gt;

&lt;h1 id=&quot;1-introduction&quot;&gt;1. Introduction&lt;/h1&gt;

&lt;p&gt;To address the problem of low-resource language pairs, various approaches have been presented including:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Utilizing monolingual corpora (multi-task learning, back-translation, dual learning, unsupervised machine translation with monolingual corpora for both sides)&lt;/li&gt;
  &lt;li&gt;Exploiting knowledge from high-resource language pairs (auxiliary translations/tasks, &lt;strong&gt;multilingual translation&lt;/strong&gt;, universal lexical representation)&lt;/li&gt;
  &lt;li&gt;Pre-train the NMT on high resource language pair and transfer to target low-resource pair&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The authors follow up on the latest multilingual NMT approaches and &lt;strong&gt;introduce MAML to low-resource NMT&lt;/strong&gt; by regarding different language-pairs as separate tasks. They further &lt;strong&gt;incorporate the universal lexical representation&lt;/strong&gt; to overcome the problem of vanilla MAML that it can not handle mismatched input and output.&lt;/p&gt;

&lt;h1 id=&quot;2-background&quot;&gt;2. Background&lt;/h1&gt;

&lt;h4 id=&quot;neural-machine-translation-nmt&quot;&gt;Neural Machine Translation (NMT)&lt;/h4&gt;

&lt;p&gt;Given a source sentence:&lt;/p&gt;

\[X=\{x_1, ..., x_{T&apos;}\}.\]

&lt;p&gt;NMT task is to model:&lt;/p&gt;

\[p(Y|X;\theta)=\prod_{t=1}^{T+1}p(y_t|y_{0:t-1}, x_{1:T&apos;};\theta).\]

&lt;h3 id=&quot;meta-learning&quot;&gt;Meta Learning&lt;/h3&gt;

&lt;p&gt;Meta-learning is to solve the problem of “fast adaptation on new training data”, there are two categories of meta-learning:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Learning a meta-policy for updating model parameters&lt;/li&gt;
  &lt;li&gt;Learning a good parameter initialization for fast adaptation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Model-agnostic meta-learning (MAML) belongs to the second category.&lt;/p&gt;

&lt;h1 id=&quot;3-meta-learning-for-low-resource-nmt&quot;&gt;3. Meta Learning for Low-Resource NMT&lt;/h1&gt;

&lt;p&gt;MAML is to find a proper initialization of parameters $\theta^0$ based on a set of tasks ${T^1, T^2, …, T^K}$ so that the model can learn a new target task $T^0$ with only small amount of training data.&lt;/p&gt;

&lt;p&gt;The process can be understood as:&lt;/p&gt;

\[\theta^*=\text{Learn}(T^0; \text{MetaLearn}(T^1, ..., T^K)).\]

&lt;p&gt;In the context of NMT, each language pair is regarded as a different task. &lt;strong&gt;The objective is to find an initialization from high-resource language-pairs to fast adapt the model to low-resource pairs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The overall illustration is shown in Figure 1&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-08-11-blog-post-25-1.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-08-11-blog-post-25-1.jpg&quot; alt=&quot;2020-08-11-blog-post-25-1&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;31-learn-language-specific-learning&quot;&gt;3.1. Learn: language-specific learning&lt;/h2&gt;

&lt;p&gt;The language-specific learning process $\text{Learn}(D_T;\theta^0)$ is formulated to maximize the log-posterior given data $D_T$ and randomly initialized or meta-learned parameters $\theta^0$ :&lt;/p&gt;

\[\text{Learn}(D_T;\theta^0)=\text{argmax}_\theta L^{D_T}(\theta)=argmax_\theta \sum_{(X, Y)\in D_T}\log p(Y|X, \theta)-\beta||\theta -\theta^0||^2,\]

&lt;p&gt;note that &lt;strong&gt;the second term is used to discourage the newly learned parameters from deviating too much, alleviating the overfitting issue&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;32-metalearn&quot;&gt;3.2. MetaLearn&lt;/h2&gt;

&lt;p&gt;The meta-objective to find the initialization $\theta^0$ is given by:&lt;/p&gt;

\[\mathcal{L}(\theta)=\mathbb{E}_k\mathbb{E}_{D_{T^k}, D&apos;_{T^k}}\left[ \sum_{(X, Y)\in D&apos;_{T^k}}\log p(Y|X;Learn(D_{T^k}; \theta))\right],\]

&lt;p&gt;where $k\sim\mathcal{U}({1, …, K})$ refers the $k$-th meta-learning episode. For each episode, task $T^k$ is uniformly chosen at random. &lt;strong&gt;$D_{T^k}$ and $D’_{T^k}$ are subsets of training examples for learning and evaluating, respectively.&lt;/strong&gt; They are sampled independently from the chosen task $T^k$.&lt;/p&gt;

&lt;p&gt;In learning process, the model parameters are updated by:&lt;/p&gt;

\[\theta_k&apos;=\text{Learn}(D_{T^k};\theta)=\theta-\eta \nabla_\theta \mathcal{L}^{D_{T^k}}(\theta),\]

&lt;p&gt;note that this process is not really applied on meta model $\theta$ but a simulation.&lt;/p&gt;

&lt;p&gt;By applying the updated parameters $\theta’&lt;em&gt;k$  to evaluation set $D’&lt;/em&gt;{T^k}$ , &lt;strong&gt;the meta-model $\theta$ is updated with meta-gradient computed on evaluation set.&lt;/strong&gt; As shown in the formula below, note that it is possible to aggregate multiple episodes before updating:&lt;/p&gt;

\[\theta \leftarrow \theta-\eta&apos; \sum_k \nabla_\theta \mathcal{L}^{D&apos;_{T^k}}(\theta&apos;_k),\]

&lt;p&gt;where $\eta’$ is meta-learning rate.&lt;/p&gt;

&lt;h3 id=&quot;meta-gradient&quot;&gt;Meta-Gradient&lt;/h3&gt;

&lt;p&gt;Based on the property below:&lt;/p&gt;

\[H(x)v \approx \frac{\nabla (x+uv)-\nabla (x)}{u},\]

&lt;p&gt;meta-gradient is approximated as follows:&lt;/p&gt;

\[\nabla_\theta \mathcal{L}^{D&apos;}(\theta&apos;)=\nabla_{\theta&apos;} \mathcal{L}^{D&apos;}(\theta&apos;) \nabla_{\theta} \left(\theta -\eta \nabla \mathcal{L}^{D}(\theta)\right)=\nabla_{\theta&apos;} \mathcal{L}^{D&apos;}(\theta&apos;) -\eta \nabla_{\theta&apos;}\mathcal{L}^{D&apos;}(\theta&apos;)H_\theta(\mathcal{L}^D(\theta))\\
\approx \nabla_{\theta&apos;} \mathcal{L}^{D&apos;}(\theta&apos;) -\frac{\eta}{u} \left[ \left .\nabla_{\theta}\mathcal{L}^{D}(\theta)\right|_{\hat{\theta}}-	\left .\nabla_{\theta}\mathcal{L}^{D}(\theta)\right|_{\theta} \right],\]

&lt;p&gt;where $u$ is a small constant and&lt;/p&gt;

\[\hat{\theta}=\theta + u\nabla_{\theta&apos;}\mathcal{L}^{D&apos;}(\theta&apos;)\]

&lt;p&gt;&lt;strong&gt;In practice, the authors omitted the second term by using the simplified rule&lt;/strong&gt;:&lt;/p&gt;

\[\nabla_\theta \mathcal{L}^{D&apos;}(\theta&apos;)\approx \nabla_{\theta&apos;} \mathcal{L}^{D&apos;}(\theta&apos;)\]

&lt;h3 id=&quot;comparison-with-related-works&quot;&gt;Comparison with related works&lt;/h3&gt;

&lt;p&gt;As we can see from Figure 2, the difference between transfer learning and meta learning lies in that the former aims at directly solving the source tasks. On the other hand, &lt;strong&gt;meta-learning is to be useful for fine-tuning on various tasks including the source and target tasks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-08-11-blog-post-25-2.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-08-11-blog-post-25-2.jpg&quot; alt=&quot;2020-08-11-blog-post-25-2&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;33-unified-lexical-representation&quot;&gt;3.3. Unified Lexical Representation&lt;/h2&gt;

&lt;p&gt;One limitation of meta learning is that it assumes the input and output spaces shared across all the tasks.&lt;/p&gt;

&lt;h3 id=&quot;unified-lexical-representation-ulr&quot;&gt;Unified Lexical Representation (ULR)&lt;/h3&gt;

&lt;p&gt;ULR starts with multilingual word embedding matrices $\epsilon_{\text{query}^k}\in \mathbb{R}^{|V_k|\times d}$ pretrained on monolingual corpora, where $V_k$ is the vocabulary of the $k$-th language. One of these languages are used to build universal lexical representation consisting of a universal embedding matrix: $\epsilon_u \in \mathbb{R}^{M\times d}$ and a corresponding key matrix $\epsilon_{\text{key}} \in \mathbb{R}^{M\times d}$, where $M&amp;lt;|V’_k|$.&lt;/p&gt;

&lt;p&gt;Both $\epsilon^k_{\text{query}}$ and $\epsilon_{\text{key}}$ are fixed during meta-learning.&lt;/p&gt;

&lt;p&gt;The language-specific embedding of token $x$ from language $k$ is computed as the convex sum of universal embedding vectors:&lt;/p&gt;

\[\epsilon^0[x]=\sum_{i=1}^M \alpha_i \epsilon_u[i],\]

&lt;p&gt;where&lt;/p&gt;

\[\alpha_i \propto \exp\{-\frac{1}{\tau}\epsilon_{\text{key}}[i]^T A \epsilon^k_{\text{query}}[x]\}\]

&lt;p&gt;and $\tau=0.05$. This approach allows with a fixed number of shared parameters ($\epsilon_u$, $\epsilon_{\text{key}}$ and $A$).&lt;/p&gt;

&lt;h3 id=&quot;learning-of-ulr&quot;&gt;Learning of ULR&lt;/h3&gt;

&lt;p&gt;During language-specific learning, the authors estimate the change to each embedding vector by a separate parameter $\triangle \epsilon^k[x]$ &lt;strong&gt;to avoid directly updating the universal embedding&lt;/strong&gt;:&lt;/p&gt;

\[\epsilon^k[x]=\epsilon^0[x] + \triangle \epsilon^k[x]\]

&lt;p&gt;During language-specific learning, the first term is fixed while during meta-learning stage, only the first term ($\epsilon_u$ and $A$) is updated.&lt;/p&gt;

&lt;h1 id=&quot;4-experiments&quot;&gt;4. Experiments&lt;/h1&gt;

&lt;h2 id=&quot;41-dataset&quot;&gt;4.1. Dataset&lt;/h2&gt;

&lt;h3 id=&quot;source-tasks&quot;&gt;Source Tasks&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Europarl&lt;/strong&gt;: Bulgarian (Bg), Czech (Cs), Danish (Da), German (De), Greek (El), Spanish (Es),Estonian (Et), French (Fr), Hungarian (Hu), Italian (It), Lithuanian (Lt), Dutch (Nl), Polish (Pl),Portuguese  (Pt),  Slovak  (Sk),  Slovene  (Sl)  and Swedish  (Sv)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WMT’17&lt;/strong&gt;: Russian (Ru) (2M pairs subset)&lt;/p&gt;

&lt;h3 id=&quot;target-tasks&quot;&gt;Target Tasks&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;WMT’16&lt;/strong&gt;: Romanian (Ro)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WMT’17&lt;/strong&gt;: Latvian (Lv), Finnish (Fi), Turkish (Tr)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Korean Parallel Dataset&lt;/strong&gt;: Korean (Ko)&lt;/p&gt;

&lt;p&gt;Ro-En or Lv-En is used as a validation set for meta-learning.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-08-11-blog-post-25-3.jpg&quot; alt=&quot;2020-08-11-blog-post-25-3&quot; style=&quot;zoom:45%;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-08-11-blog-post-25-3.jpg&quot; alt=&quot;2020-08-11-blog-post-25-3&quot; style=&quot;zoom:45%;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;42-model-and-learning&quot;&gt;4.2. Model and Learning&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Model&lt;/strong&gt;: Transformer of default hyper-parameter setting&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training &amp;amp; Fine-tuning&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;During meta-learning, all the parameters are updated, but during fine-tuning, three strategies are considered to update the model:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;updating all the modules (all)&lt;/li&gt;
  &lt;li&gt;updating the embedding and encoder only (emb+enc)&lt;/li&gt;
  &lt;li&gt;updating the embedding only (emb)&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;5-results&quot;&gt;5. Results&lt;/h1&gt;

&lt;h3 id=&quot;vs-multilingual-transfer-learning&quot;&gt;vs. Multilingual Transfer Learning&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;From Figure 3 below, we can observe significant improvement of meta-learning &lt;strong&gt;compared with multilingual transfer learning strategy&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;Note that the training sets are only subsampled sets with around 16,000 English tokens. However, the best fine-tuned MetaNMT achieves 2/3 (Ro-En) and 1/2 (rest) of the BLEU score achieved by the supervised model trained on full training sets (as shown in Table 1 above).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-08-11-blog-post-25-5.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-08-11-blog-post-25-5.jpg&quot; alt=&quot;2020-08-11-blog-post-25-5&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;impact-of-validation-tasks&quot;&gt;Impact of Validation Tasks&lt;/h3&gt;

&lt;p&gt;Furthermore, we can notice the impact of validation tasks. &lt;strong&gt;Fi-En benefits more when Ro-En is used for validation ((c) in Figure 3)&lt;/strong&gt;, while the opposite happens with Tr-En. The relationship between the task similarity and the impact of a validation task remains to be further investigated in the future.&lt;/p&gt;

&lt;h3 id=&quot;training-size&quot;&gt;Training Size&lt;/h3&gt;

&lt;p&gt;From Figure 4, we can also observe that the BLEU curve of MetaNMT is more flat. As target task’s training set increases, the gap between MetaNMT and MultiNMT shrinks, **indicating the robustness of MetaNMT in handling low-resource language pairs **.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-08-11-blog-post-25-6.jpg&quot; alt=&quot;2020-08-11-blog-post-25-6&quot; style=&quot;zoom:70%;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-08-11-blog-post-25-6.jpg&quot; alt=&quot;2020-08-11-blog-post-25-6&quot; style=&quot;zoom:70%;&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;impact-of-source-tasks&quot;&gt;Impact of Source Tasks&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;It can be inferred from Table 2 that using more source tasks is always beneficial to MetaNMT, there is up to 2x improvement from one source task (Es) to 18 source tasks (All).&lt;/li&gt;
  &lt;li&gt;Choice of source languages also have impact on target languages. For instance, comparing {De Ru} and {Es Fr It Pt}, the latter benefits Ro-En more, but the former benefits all the other pairs more.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-08-11-blog-post-25-4.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-08-11-blog-post-25-4.jpg&quot; alt=&quot;2020-08-11-blog-post-25-4&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;training-curves&quot;&gt;Training Curves&lt;/h3&gt;

&lt;p&gt;Compared with MetaNMT, we can observe from Figure 5 that MultiNMT saturates rapidly and eventually degrades (overfitting), whereas MetaNMT continues to improve and never degrades.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-08-11-blog-post-25-7.jpg&quot; alt=&quot;2020-08-11-blog-post-25-7&quot; style=&quot;zoom:50%;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-08-11-blog-post-25-7.jpg&quot; alt=&quot;2020-08-11-blog-post-25-7&quot; style=&quot;zoom:50%;&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;sample-translations&quot;&gt;Sample Translations&lt;/h3&gt;

&lt;p&gt;Table 3 presents zero-shot and meta-learned examples for Tr-En and Ko-En.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Zero-shot examples provide a word-by-word translation without re-ordering&lt;/strong&gt;, demonstrating the success of applying universal lexical representation and meta-learned initialization.&lt;/li&gt;
  &lt;li&gt;After 600 sentence pairs (16,000 English tokens), the model rapidly learns to re-order tokens and produces better translation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-08-11-blog-post-25-8.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-08-11-blog-post-25-8.jpg&quot; alt=&quot;2020-08-11-blog-post-25-8&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;6-conclusion&quot;&gt;6. Conclusion&lt;/h1&gt;

&lt;p&gt;Contribution of this paper include:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Propose MetaNMT, a &lt;strong&gt;meta-learning&lt;/strong&gt; approach for low-resource neural machine translation&lt;/li&gt;
  &lt;li&gt;Applying &lt;strong&gt;universal lexical representation&lt;/strong&gt; to tackle the I/O mismatch problem across language pairs.&lt;/li&gt;
  &lt;li&gt;MetaNMT significantly &lt;strong&gt;outperforms multilingual transfer learning&lt;/strong&gt; based method on low-resource tasks&lt;/li&gt;
&lt;/ol&gt;</content><author><name>Wenxin Hou 侯汶昕</name><email>houwx001@gmail.com</email></author><category term="NLP" /><category term="Text Summarization" /><category term="Attention" /><summary type="html">Last Updated: 2020-08-12</summary></entry><entry><title type="html">[Paper-NLP] Get To The Point: Summarization with Pointer-Generator Networks</title><link href="https://houwx.net/posts/2020/01/blog-post-22/" rel="alternate" type="text/html" title="[Paper-NLP] Get To The Point: Summarization with Pointer-Generator Networks" /><published>2020-07-09T00:00:00-07:00</published><updated>2020-07-09T00:00:00-07:00</updated><id>https://houwx.net/posts/2020/01/blog-post-22-pointer-network</id><content type="html" xml:base="https://houwx.net/posts/2020/01/blog-post-22/">&lt;p&gt;Last Updated: 2020-08-11&lt;/p&gt;

&lt;p&gt;Slides used for my final presentation in Language Engineering at Tokyo Tech: &lt;a href=&quot;/houwx.net/files/slides/2017_acl_pointer-network.pdf&quot;&gt;[slides]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This paper:  &lt;a href=&quot;https://arxiv.org/pdf/1704.04368.pdf&quot;&gt;Get To The Point: Summarization with Pointer-Generator Networks&lt;/a&gt; is proposed by researchers from Stanford and Google.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;: https://github.com/rohithreddy024/Text-Summarizer-Pytorch (PyTorch, not official), https://github.com/abisee/pointer-generator (TensorFlow 1.x, official)&lt;/p&gt;

&lt;p&gt;This paper introduces Pointer-Generator Network which solves the two &lt;strong&gt;shortcomings&lt;/strong&gt; of current models for &lt;strong&gt;abstractive&lt;/strong&gt; text summarization:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;they are liable to reproduce factual details inaccurately&lt;/li&gt;
  &lt;li&gt;they tend to repeat themselves&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Novelty&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;use a hybrid pointer-generator network that can &lt;strong&gt;copy words from the source text&lt;/strong&gt; via &lt;strong&gt;pointing&lt;/strong&gt;,  which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator.&lt;/li&gt;
  &lt;li&gt;use coverage to keep track of what has been summarized, which discourages repetition&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;1-introduction&quot;&gt;1. Introduction&lt;/h2&gt;

&lt;p&gt;Two approaches to summarization: extractive and abstractive.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Extractive methods assemble summaries exclusively from passages (usually whole sentences) taken directly from the source text&lt;/li&gt;
  &lt;li&gt;abstractive methods may generate novel words and phrases not featured in the source text – as a human-written abstract usually does.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;An comparison of proposed method and baseline Seq2Seq model with attention, from which we can find that the &lt;strong&gt;pointer-generator&lt;/strong&gt; solve the out-of-vocabulary (&lt;strong&gt;OOV&lt;/strong&gt;) problem while &lt;strong&gt;coverage&lt;/strong&gt; eliminates the &lt;strong&gt;repetition&lt;/strong&gt; problem:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-09-blog-post-22-1.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-09-blog-post-22-1.jpg&quot; alt=&quot;2020-07-09-blog-post-22-1&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;2-our-models&quot;&gt;2. Our Models&lt;/h2&gt;

&lt;h3 id=&quot;21-sequence-to-sequence-attentional-model-baseline&quot;&gt;2.1. Sequence-to-sequence attentional model (baseline)&lt;/h3&gt;

&lt;p&gt;The model architecture is shown below:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-09-blog-post-22-2.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-09-blog-post-22-2.jpg&quot; alt=&quot;2020-07-09-blog-post-22-2&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Attention distribution/weights (Blue)&lt;/strong&gt;: &lt;strong&gt;a^t&lt;/strong&gt; at &lt;strong&gt;decoding&lt;/strong&gt; time step &lt;strong&gt;t&lt;/strong&gt; over the encoder hidden states &lt;strong&gt;h_i&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

\[a^t=\text{softmax}(e^t), \quad \text{where} \quad e_i^t=v^T \tanh(W_h h_i + W_s s_t + b_{attn}),\]

&lt;p&gt;​&lt;/p&gt;

&lt;p&gt;​		where &lt;strong&gt;v, W_h, W_s, b_{attn}&lt;/strong&gt; are learnable parameters.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Context Vector&lt;/strong&gt;: weighted average of encoder hidden states &lt;strong&gt;h_i&lt;/strong&gt; with attention weights &lt;strong&gt;a^t&lt;/strong&gt; used at decoding time step &lt;strong&gt;t&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

\[h^*_t=\sum_{i}{a_i^t * h_i}\]

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Vocabulary distribution (Green)&lt;/strong&gt;: the probability distribution over all words in the vocabulary.&lt;/li&gt;
&lt;/ol&gt;

\[P_{vocab}=\text{softmax}(V&apos;(V[s_t; h_t^*]+b)+b)\]

&lt;p&gt;​	More intuitively,&lt;/p&gt;

\[P_{vocab}=\text{softmax}(\text{Linear(Linear([$s_t;h_t^*$]))})\]

&lt;p&gt;​&lt;/p&gt;

&lt;p&gt;​	We can see that the input is just &lt;strong&gt;concatenation of context vector and decoder state&lt;/strong&gt; at time step &lt;strong&gt;t&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Loss function&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;During training, the loss for time step &lt;strong&gt;t&lt;/strong&gt; is the &lt;strong&gt;negative log likelihood&lt;/strong&gt; (NLLLoss) of the target word &lt;strong&gt;w^∗_t&lt;/strong&gt; for that timestep:&lt;/p&gt;

\[loss_t = -\log P(w_t^*)\]

&lt;p&gt;and the &lt;strong&gt;overall loss for the whole sequence&lt;/strong&gt; is:&lt;/p&gt;

\[loss=\frac{1}{T}\sum_{t=0}^{T-1} loss_t=-\frac{1}{T}\sum_{t=0}^{T-1}\log P(w^*_t)\]

&lt;h3 id=&quot;22-pointer-generator-network&quot;&gt;2.2. Pointer-generator network&lt;/h3&gt;

&lt;p&gt;The model architecture is shown below:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-09-blog-post-22-3.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-09-blog-post-22-3.jpg&quot; alt=&quot;2020-07-09-blog-post-22-3&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;The first three components are the same as the baseline model as explained in section 2.1.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Attention distribution/weights (Blue)&lt;/strong&gt;: &lt;strong&gt;a^t&lt;/strong&gt; at &lt;strong&gt;decoding&lt;/strong&gt; time step &lt;strong&gt;t&lt;/strong&gt; over the encoder hidden states &lt;strong&gt;h_i&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

\[a^t=\text{softmax}(e^t), \quad \text{where} \quad e_i^t=v^T \tanh(W_h h_i + W_s s_t + b_{attn}),\]

&lt;p&gt;​	where &lt;strong&gt;v, W_h, W_s, b_{attn}&lt;/strong&gt; are learnable parameters.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Context Vector&lt;/strong&gt;: weighted average of encoder hidden states &lt;strong&gt;h_i&lt;/strong&gt; with attention weights &lt;strong&gt;a^t&lt;/strong&gt; used at decoding time step &lt;strong&gt;t&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

\[h^*_t=\sum_{i}{a_i^t * h_i}\]

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Vocabulary distribution (Green)&lt;/strong&gt;: the probability distribution over all words in the vocabulary.&lt;/li&gt;
&lt;/ol&gt;

\[P_{vocab}=\text{softmax}(V&apos;(V[s_t; h_t^*]+b)+b)\]

&lt;p&gt;​	More intuitively,&lt;/p&gt;

\[P_{vocab}=\text{softmax}(\text{Linear(Linear([$s_t;h_t^*$]))})\]

&lt;p&gt;​&lt;/p&gt;

&lt;p&gt;​	which serves as the final distribution of the model output:&lt;/p&gt;

\[P(w)=P_{vocab}(w)\]

&lt;p&gt;The components below are introduced by pointer-generator to bring improvements:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Generation probability (Yellow)&lt;/strong&gt;: the probability of generating a word from the vocabulary at decoding time step &lt;strong&gt;t&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

\[p_{gen}=\text{sigmoid}(w_{h^*}^T h_t^*+w_s^T s_t+w_x^T x_t + b_{ptr})\]

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Final distribution&lt;/strong&gt;: probability distribution over the extended vocabulary composed of the &lt;strong&gt;vocabulary&lt;/strong&gt; and &lt;strong&gt;words from the source text&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

\[P(w)=p_{gen}P_{vocab}(w)+(1-p_{gen})\sum_{i:w_i=w}a_i^t\]

&lt;p&gt;&lt;strong&gt;Loss function:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The loss  function is  as described  in equations (5) and (6), but with respect to the modified probability distribution &lt;strong&gt;P(w)&lt;/strong&gt; given in equation (13).&lt;/p&gt;

&lt;h3 id=&quot;23-coverage-mechanism&quot;&gt;2.3. Coverage mechanism&lt;/h3&gt;

&lt;p&gt;Coverage mechanism is proposed to solve the problem of &lt;strong&gt;repetition&lt;/strong&gt; in Seq2Seq models.&lt;/p&gt;

&lt;p&gt;A coverage vector &lt;strong&gt;c^t&lt;/strong&gt; is maintained, which is the sum of attention distributions &lt;strong&gt;over all previous decoder timesteps&lt;/strong&gt;:&lt;/p&gt;

\[c^t=\sum_{t&apos;=0}^{t-1}a^{t&apos;}\]

&lt;p&gt;Note that &lt;strong&gt;c^0 is a zero vector&lt;/strong&gt;, because on the first timestep, none of the source document has been covered.&lt;/p&gt;

&lt;p&gt;The coverage vector is used as &lt;strong&gt;extra input to the attention mechanism&lt;/strong&gt;:&lt;/p&gt;

\[a^t=\text{softmax}(e^t), \quad \text{where} \quad e_i^t=v^T \tanh(W_h h_i + W_s s_t + w_c c_i^t + b_{attn})\]

&lt;p&gt;where &lt;strong&gt;w_c&lt;/strong&gt; is a learnable parameter vector of same length as &lt;strong&gt;v&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The authors also find it necessary (see section 5) to additionally define a &lt;strong&gt;coverage loss&lt;/strong&gt; to penalize repeatedly attending to the same locations:&lt;/p&gt;

\[\text{covloss}_t=\sum_i \min(a_i^t, c_i^t)\]

&lt;p&gt;Note that the coverage loss is bounded; in particular &lt;strong&gt;equal to or less than 1&lt;/strong&gt; (sum of softmax attention).&lt;/p&gt;

&lt;p&gt;Finally, the coverage loss, reweighted by some hyperparameter &lt;strong&gt;λ&lt;/strong&gt;, is added to the primary loss function to yield a new composite loss function:&lt;/p&gt;

\[loss=\frac{1}{T}\sum_{t=0}^{T-1} loss_t=\frac{1}{T}\sum_{t=0}^{T-1}\left(-\log P(w^*_t) +\lambda \sum_i \min(a_i^t, c_i^t) \right)\]

&lt;h2 id=&quot;3-related-work&quot;&gt;3. Related Work&lt;/h2&gt;

&lt;p&gt;Skipped&lt;/p&gt;

&lt;h2 id=&quot;4-dataset&quot;&gt;4. Dataset&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CNN/Daily Mail dataset&lt;/strong&gt; (Hermannet  al., 2015; Nallapati et al., 2016), which contains online news articles (781 tokens on average) paired with multi-sentence summaries (3.75 sentences or 56 tokens on average).&lt;/p&gt;

&lt;p&gt;We use the scripts supplied by Nallapati et al. (2016) to obtain the same version of the the data, which has &lt;strong&gt;287,226 training pairs, 13,368 validation pairs and 11,490 test  pairs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Both the dataset’s published results (Nallapati et al., 2016, 2017) use the anonymized version of the data, which has been pre-processed to replace each named entity, e.g.,The United Nations, with its own unique identifier for the example pair, e.g., @entity5. By contrast, we operate directly on the original text (or non-anonymized version of the data), which we believe is the favorable problem to solve because it requires no pre-processing.&lt;/p&gt;

&lt;h2 id=&quot;5-experiments&quot;&gt;5. Experiments&lt;/h2&gt;

&lt;p&gt;For all experiments, our model has 256-dimensional hidden states and 128-dimensional word embeddings. For the pointer-generator models,  we use a vocabulary of &lt;strong&gt;50k word&lt;/strong&gt;s for both source and target – note that due to the pointer network’s ability to handle OOV words, we can use &lt;strong&gt;a smaller vocabulary size than Nallapati et al.’s (2016) 150k source and 60k target vocabularies&lt;/strong&gt;. For the baseline model, we also try a larger vocabulary size of 150k.&lt;/p&gt;

&lt;p&gt;Note that the pointer and the coverage mechanism &lt;strong&gt;introduce very few additional parameters to the network&lt;/strong&gt;:  for the models with vocabulary size 50k, the baseline model has 21,499,600 parameters, the pointer-generator adds 1153 (&lt;strong&gt;256 * 2 * 2 + 128 + 1&lt;/strong&gt;) extra parameters (&lt;strong&gt;w_{h^∗}&lt;/strong&gt;, &lt;strong&gt;w_s&lt;/strong&gt;, &lt;strong&gt;w_x&lt;/strong&gt;, and &lt;strong&gt;b_{ptr}&lt;/strong&gt; in equation 12), and coverage adds 512 (&lt;strong&gt;256 * 2 directions&lt;/strong&gt;) extra parameters (&lt;strong&gt;w_c&lt;/strong&gt; in equation 15).&lt;/p&gt;

&lt;p&gt;Training details can be found in the paper.&lt;/p&gt;

&lt;p&gt;About the coverage mechanism:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;To obtain our final coverage model, we added the coverage mechanism with coverage loss weighted to λ=1 (as described in equation 17), and trained for a further 3000 iterations (about 2 hours).&lt;/p&gt;

  &lt;p&gt;We tried training  the  coverage  model  without the loss function but found this to be ineffective, with no discernible reduction in repetition. We also tried training with coverage from the first iteration rather than as a separate training phase,but found that in the early phase of training, the coverage objective &lt;strong&gt;interfered with the main objective, reducing overall performance&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;6-results&quot;&gt;6. Results&lt;/h2&gt;

&lt;h3 id=&quot;61-preliminaries&quot;&gt;6.1. Preliminaries&lt;/h3&gt;

&lt;p&gt;Results are given in Table 1. The models are evaluated with the standard &lt;strong&gt;ROUGH metric&lt;/strong&gt;, reporting the F1 scores for ROUGE-1, ROUGE-2 and ROUGE-L (which respectively measure the word-overlap, bigram-overlap,  and longest common sequence between the reference summary and the summary to be evaluated). ROUGE scores are obtained using the &lt;strong&gt;pyrouge&lt;/strong&gt; package. (https://arxiv.org/pdf/pypi.python.org/pypi/pyrouge/0.1.3)&lt;/p&gt;

&lt;p&gt;We also evaluate with the &lt;strong&gt;METEOR metric&lt;/strong&gt;, both in exact match mode (rewarding only exact matches between words) and full mode (which additionally rewards matching stems, synonyms and para-phrases). (http://www.cs.cmu.edu/~alavie/METEOR)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-09-blog-post-22-4.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-09-blog-post-22-4.jpg&quot; alt=&quot;2020-07-09-blog-post-22-4&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;62-observations&quot;&gt;6.2. Observations&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Baselines&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Both baseline models perform poorly with respect to ROUGE and METEOR, and the larger vocabulary size (150k) does not seem to help.  (50k is even better)&lt;/p&gt;

&lt;p&gt;Factual details are frequently reproduced incorrectly, often replacing an uncommon (but in-vocabulary) word with a more-common alternative. For example in  Figure 1,the baseline model appears to struggle with the rare word &lt;strong&gt;&lt;em&gt;thwart&lt;/em&gt;&lt;/strong&gt;,  producing &lt;strong&gt;&lt;em&gt;destabilize&lt;/em&gt;&lt;/strong&gt; instead, which leads to the fabricated phrase &lt;strong&gt;&lt;em&gt;destabilize nigeria’s economy&lt;/em&gt;&lt;/strong&gt;. Even more catastrophically, the summaries sometimes devolve into &lt;strong&gt;repetitive nonsense, such as the third sentence&lt;/strong&gt; produced by the baseline model in Figure 1. In addition, the baseline model can’t reproduce OOV words (such as &lt;strong&gt;&lt;em&gt;muhammadu buhariin&lt;/em&gt;&lt;/strong&gt; Figure 1). Further examples of all these problems are provided in the supplementary material.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pointer-generator&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Pointer-generator model achieves much better ROUGE and METEOR scores than the baseline, despite many &lt;strong&gt;fewer training epochs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;OOV words are handled  easily, factual details are almost always copied correctly, and there are no fabrications (see Figure 1). However, &lt;strong&gt;repetition&lt;/strong&gt; is still very common.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pointer-generator with coverage&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Pointer-generator model with coverage improves the ROUGE and METEOR scores further,convincingly surpassing the best abstractive model of Nallapati et al. (2016)&lt;/p&gt;

&lt;p&gt;Despite the brevity of the coverage training phase (about 1% of the total training time), the repetition problem is almost completely eliminated, which can be seen both qualitatively (Figure1) and quantitatively (Figure 4). However, our best model does not quite surpass the ROUGE scores of the lead-3 baseline, nor the current best extractive model (Nallapati et al., 2017).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-09-blog-post-22-1.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-09-blog-post-22-1.jpg&quot; alt=&quot;2020-07-09-blog-post-22-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-09-blog-post-22-5.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-09-blog-post-22-5.jpg&quot; alt=&quot;2020-07-09-blog-post-22-5&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;7-discussion&quot;&gt;7. Discussion&lt;/h2&gt;

&lt;h3 id=&quot;71-comparison-with-extractive-systems&quot;&gt;7.1. Comparison with extractive systems&lt;/h3&gt;

&lt;p&gt;Table 1 shows stronger performance by extractive systems (including lead-3 baseline which summarize using the first three sentences only).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;News articles tend to be structured with the most important information at the start&lt;/li&gt;
  &lt;li&gt;ROUGE metric naturally prefers extractive approaches, but abstractive summaries are subjective&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;72-how-abstractive-is-our-model&quot;&gt;7.2. How abstractive is our model?&lt;/h3&gt;

&lt;p&gt;Figure 6 shows final model copies whole article sentences 35% of the time; by comparison the reference summaries do so only 1.3% of the time.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-09-blog-post-22-6.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-09-blog-post-22-6.jpg&quot; alt=&quot;2020-07-09-blog-post-22-6&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;8-conclusion&quot;&gt;8. Conclusion&lt;/h2&gt;

&lt;p&gt;This paper introduces copying mechanism to abstractive summarization task and further achieve improvements by proposing coverage mechanism. It is an interesting attempt and obtains relatively promising results.&lt;/p&gt;

&lt;p&gt;For me there are, however, still several drawbacks to overcome:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;For abstractive text generation, what is a more suitable metric apart from existing BLEU, ROUGE &amp;amp; METEROR, etc.?&lt;/li&gt;
  &lt;li&gt;The copying mechanism improves model performance but degrades the abstractive capability, which remains a problem to solve.&lt;/li&gt;
&lt;/ol&gt;</content><author><name>Wenxin Hou 侯汶昕</name><email>houwx001@gmail.com</email></author><category term="NLP" /><category term="Classic" /><category term="Text Summarization" /><category term="Attention" /><summary type="html">Last Updated: 2020-08-11</summary></entry><entry><title type="html">[Paper-Vocoder] WaveNet: A Generative Model for Raw Audio</title><link href="https://houwx.net/posts/2020/01/blog-post-21/" rel="alternate" type="text/html" title="[Paper-Vocoder] WaveNet: A Generative Model for Raw Audio" /><published>2020-07-04T00:00:00-07:00</published><updated>2020-07-04T00:00:00-07:00</updated><id>https://houwx.net/posts/2020/01/blog-post-21-wavenet</id><content type="html" xml:base="https://houwx.net/posts/2020/01/blog-post-21/">&lt;p&gt;Last Updated: 2020-07-04&lt;/p&gt;

&lt;p&gt;This paper:  &lt;a href=&quot;https://arxiv.org/pdf/1609.03499.pdf&quot;&gt;WaveNet: A Generative Model for Raw Audio&lt;/a&gt; is proposed by researchers from Google and DeepMind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;: https://github.com/kan-bayashi/PytorchWaveNetVocoder (not official)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Samples&lt;/strong&gt;: https://www.deepmind.com/blog/article/wavenet-generative-model-raw-audio&lt;/p&gt;

&lt;p&gt;This paper introduces WaveNet, a deep neural network for generating raw audio waveforms.&lt;/p&gt;

&lt;h2 id=&quot;1-introduction&quot;&gt;1. Introduction&lt;/h2&gt;

&lt;p&gt;This paper introduces WaveNet, an audio generative model based on the PixelCNN (van den Oordet al., 2016a;b) architecture. The main contributions of this work are as follows:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;It shows that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS), as assessed by human raters.&lt;/li&gt;
  &lt;li&gt;In order to deal with &lt;strong&gt;long-range temporal dependencies&lt;/strong&gt; needed for raw audio generation, the authors develop new architectures based on &lt;strong&gt;dilated causal convolutions&lt;/strong&gt;, which &lt;strong&gt;exhibit very large receptive fields&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;It is shown that when conditioned on a speaker identity, a single model can be used to generate different voices.&lt;/li&gt;
  &lt;li&gt;The same architecture shows strong results when tested on a small &lt;strong&gt;speech recognition&lt;/strong&gt; dataset, and is promising when used to generate other audio modalities such as music.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-04-blog-post-21-1.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-04-blog-post-21-1.jpg&quot; alt=&quot;2020-07-04-blog-post-21-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In general, &lt;strong&gt;WaveNets provide a generic and flexible framework for tackling many applications that rely on audio generation&lt;/strong&gt; (e.g. TTS, music, speech enhancement, voice conversion, source separation).&lt;/p&gt;

&lt;h2 id=&quot;2-wavenet&quot;&gt;2. WaveNet&lt;/h2&gt;

&lt;p&gt;WaveNet is a generative model operating directly on the &lt;strong&gt;raw audio waveform&lt;/strong&gt;. The joint probability of a waveform &lt;strong&gt;x&lt;/strong&gt;={x_1, …, x_T} is factorised as a product of conditional probabilities as follows:&lt;/p&gt;

\[p(\textbf{x})=\prod_{t=1}^T p(x_t|x_1, ..., x_{t-1})\]

&lt;p&gt;Each audio sample x_t is therefore conditioned on the samples at all previous timesteps.&lt;/p&gt;

&lt;p&gt;Similarly to PixelCNNs (van den Oord et al., 2016a;b), the conditional probability distribution is modelled by a stack of convolutional layers.  There are no pooling layers in the network, and the output of the model has the same time dimensionality as the input. The model &lt;strong&gt;outputs a categorical distribution over the next value x_t with a softmax layer&lt;/strong&gt; and it is optimized to maximize the log-likelihood of the data w.r.t.  the parameters.&lt;/p&gt;

&lt;h3 id=&quot;21-dilated-causal-convolutions&quot;&gt;2.1. Dilated Causal Convolutions&lt;/h3&gt;

&lt;p&gt;Illustration of causal convolutional layers are shown in Figure 2.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-04-blog-post-21-2.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-04-blog-post-21-2.jpg&quot; alt=&quot;2020-07-04-blog-post-21-2&quot; /&gt;&lt;/p&gt;

&lt;p&gt;At training time, the conditional predictions for all timesteps can be made in parallel because all timesteps of ground truth &lt;strong&gt;x&lt;/strong&gt; are known.  When generating with the model, the predictions are sequential: &lt;strong&gt;after each sample is predicted, it is fed back into the network to predict the next sample&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Pros and cons of causal convolutions compared to RNNs:&lt;/p&gt;

&lt;p&gt;Because models with causal convolutions do not have recurrent connections, they are typically &lt;strong&gt;faster to train than RNNs&lt;/strong&gt;, especially when applied to very long sequences. &lt;strong&gt;One of the problems of causal convolutions is that they require many layers&lt;/strong&gt;, or large filters to increase the receptive field. For example, in Fig. 2 the receptive field is only 5 (= #layers + filter/kernel length - 1 &lt;strong&gt;= 4 + 2 - 1&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;Authors’ improvements:&lt;/p&gt;

&lt;p&gt;In this paper &lt;strong&gt;dilated convolutions are used to increase the receptive field by orders of magnitude&lt;/strong&gt;, without greatly increasing computational cost.&lt;/p&gt;

&lt;p&gt;A dilated convolution is a convolution where the filter/kernel is applied over an area larger than its length &lt;strong&gt;by skipping input values with a certain step&lt;/strong&gt;.  It is &lt;strong&gt;equivalent to a convolution with a larger filter derived from the original filter by dilating it with zeros&lt;/strong&gt;, but is significantly more efficient. A dilated convolution effectively allows the network to operate on a coarser scale than with a normal convolution. This is similar to pooling or strided convolutions, but here &lt;strong&gt;the output has the same size as the input&lt;/strong&gt;.  As a special case, &lt;strong&gt;dilated convolution with dilation 1 yields the standard convolution&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Figure 3 depicts dilated causal convolutions for &lt;strong&gt;dilations 1, 2, 4,and 8&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-04-blog-post-21-3.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-04-blog-post-21-3.jpg&quot; alt=&quot;2020-07-04-blog-post-21-3&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers&lt;/strong&gt;, while preserving the input resolution throughout the network as well as computational efficiency. &lt;strong&gt;In this paper, the dilation is doubled for every layer up to a limit and then repeated&lt;/strong&gt;: e.g. 1, 2, 4, …, 512, 1, 2, 4, …, 512, 1, 2, 4, …, 512.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intuition:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Exponentially increasing the dilation factor results in exponential receptive field growth with depth (Yu &amp;amp; Koltun, 2016).  For example each 1, 2, 4, …, 512 block has receptive field of size 1024, and can be seen as a more efficient and discriminative (non-linear) counterpart of a 1×1024 convolution.&lt;/li&gt;
  &lt;li&gt;Stacking these blocks further increases the model capacity and the receptive field size.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;22-softmax-distributions&quot;&gt;2.2. Softmax Distributions&lt;/h3&gt;

&lt;p&gt;One approach to modeling the conditional distributions &lt;strong&gt;p(x_t|x_1, …, x_t−1)&lt;/strong&gt; over  the  individual audio samples would be to &lt;strong&gt;use a mixture model such as a mixture density network&lt;/strong&gt; (Bishop, 1994) or mixture of conditional Gaussian scale mixtures (MCGSM) (Theis &amp;amp; Bethge, 2015).  However,van den Oord et al. (2016a) showed that &lt;strong&gt;a softmax distribution tends to work better&lt;/strong&gt;, even when the data is implicitly continuous (as is the case for image pixel intensities or audio sample values). One of the reasons is that a categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions about their shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Raw audio is typically stored as a sequence of 16-bit integer values (one per timestep), a softmax layer would need to output 65,536 probabilities per timestep to model all possible values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The authors apply a μ-law companding transformation (ITU-T, 1988) to the data, and then quantize it to 256 possible values:&lt;/p&gt;

\[f(x_t)=\text{sign}(x_t)\frac{\ln(1+\mu|x_t|)}{\ln(1+\mu)},\]

&lt;p&gt;where &lt;strong&gt;−1&amp;lt; x_t&amp;lt;1&lt;/strong&gt; and &lt;strong&gt;μ = 255&lt;/strong&gt;. This non-linear quantization produces a significantly better reconstruction than a simple linear quantization scheme.  Especially &lt;strong&gt;for speech&lt;/strong&gt;, it is found that &lt;strong&gt;the reconstructed signal after quantization sounded very similar to the original&lt;/strong&gt;.&lt;/p&gt;

&lt;h3 id=&quot;23-gated-activation-units&quot;&gt;2.3. Gated Activation Units&lt;/h3&gt;

&lt;p&gt;The authors use the same gated activation unit as used in the gated PixelCNN (van den Oord et al., 2016b):&lt;/p&gt;

\[\textbf{z}=\tanh(W_{f,k}*\textbf{x}) \odot \sigma(W_{g,k}*\textbf{x})\]

&lt;p&gt;where &lt;strong&gt;∗ denotes a convolution operator&lt;/strong&gt;, &lt;strong&gt;\odot&lt;/strong&gt; denotes an element-wise multiplication operator, &lt;strong&gt;σ(·)&lt;/strong&gt; is a sigmoid function, &lt;strong&gt;k&lt;/strong&gt; is the layer index, &lt;strong&gt;f&lt;/strong&gt; and &lt;strong&gt;g&lt;/strong&gt; denote filter and gate, respectively, and &lt;strong&gt;W&lt;/strong&gt; is a learnable convolution filter.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In our initial experiments, &lt;strong&gt;we observed that this non-linearity worked significantly better than the rectified linear activation function (Nair &amp;amp; Hinton, 2010) for modeling audio signals&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;24-residual-and-skip-connections&quot;&gt;2.4. Residual And Skip Connections&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;Both residual (He et al., 2015) and parameterised skip connections are used throughout the network,to speed up convergence and enable training of much deeper models.  In Fig. 4 we show a residual block of our model, which is stacked many times in the network.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-04-blog-post-21-4.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-04-blog-post-21-4.jpg&quot; alt=&quot;2020-07-04-blog-post-21-4&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;25-conditional-wavenets&quot;&gt;2.5. Conditional WaveNets&lt;/h3&gt;

&lt;p&gt;Given an additional input &lt;strong&gt;h&lt;/strong&gt;, WaveNets can model the conditional distribution &lt;strong&gt;p(x|h)&lt;/strong&gt; of the audio given this input. Eq. (1) now becomes&lt;/p&gt;

\[p(\textbf{x}|\textbf{h})=\prod_{t=1}^T p(x_t|x_1, ..., x_{t-1}, \textbf{h})\]

&lt;p&gt;By conditioning the model on other input variables, we can guide WaveNet’s generation to produce audio with the required characteristics.  For example, in a multi-speaker setting we can choose the speaker by feeding the speaker identity to the model as an extra input.  Similarly, for TTS we need to feed information about the text as an extra input.&lt;/p&gt;

&lt;p&gt;The authors condition the model on other inputs in two different ways: &lt;strong&gt;global conditioning&lt;/strong&gt; and &lt;strong&gt;local conditioning&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Global conditioning:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Global conditioning is characterised by a single latent representation &lt;strong&gt;h&lt;/strong&gt; that influences the output distribution &lt;strong&gt;across all timesteps&lt;/strong&gt;, e.g. a speaker embedding in a TTS model.  The activation function from Eq. (2) now becomes:&lt;/p&gt;

\[\textbf{z}=\tanh(W_{f,k}*\textbf{x}+V_{f,k}^T\textbf{h}) \odot \sigma(W_{g,k}*\textbf{x}+V_{g,k}^T\textbf{h})\]

&lt;p&gt;where &lt;strong&gt;V_{∗,k}&lt;/strong&gt; is a learnable linear projection, and the vector &lt;strong&gt;V^T_{∗,k}h&lt;/strong&gt; is broadcast over the time dimension.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local conditioning:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For local conditioning we have a &lt;strong&gt;second time series&lt;/strong&gt; &lt;strong&gt;h_t&lt;/strong&gt;, possibly &lt;strong&gt;with a lower sampling frequency than the audio signal, e.g. linguistic features in a TTS model&lt;/strong&gt;.  We first transform this time series using a transposed convolutional network (learned upsampling) that maps it to a new time series &lt;strong&gt;y=f(h)&lt;/strong&gt; with the same resolution as the audio signal, which is then used in the activation unit as follows:&lt;/p&gt;

\[\textbf{z}=\tanh(W_{f,k}*\textbf{x}+V_{f,k} * \textbf{y}) \odot \sigma(W_{g,k}*\textbf{x}+V_{g,k} * \textbf{y})\]

&lt;p&gt;where &lt;strong&gt;V_{f, k} ∗ y&lt;/strong&gt; is now a 1×1 convolution. As an alternative to the transposed convolutional network, it is also possible to use &lt;strong&gt;V_{f, k} ∗ h&lt;/strong&gt; and repeat these values across time. This worked slightly worse in the experiments.&lt;/p&gt;

&lt;h3 id=&quot;26-context-stacks&quot;&gt;2.6. Context Stacks&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;We have already mentioned several different ways to increase the receptive field size of a WaveNet:&lt;/p&gt;

  &lt;ol&gt;
    &lt;li&gt;increasing the number of dilation stages&lt;/li&gt;
    &lt;li&gt;using more layers&lt;/li&gt;
    &lt;li&gt;larger filters&lt;/li&gt;
    &lt;li&gt;greater dilation factors&lt;/li&gt;
    &lt;li&gt;a combination thereof.&lt;/li&gt;
  &lt;/ol&gt;

  &lt;p&gt;A complementary approach is to use a separate, smaller context stack that processes a long part of the audio signal and locally conditions a larger WaveNet that processes only a smaller part of the audio signal (cropped at the end).  One can use multiple context stacks with varying lengths and numbers of hidden units.  Stacks with larger receptive fields have fewer units per layer. Context stacks can also have pooling layers to run at a lower frequency. This keeps the computational requirements at a reasonable level and is consistent with the intuition that less capacity is required to model temporal correlations at longer timescales.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;3-experiments&quot;&gt;3. Experiments&lt;/h2&gt;

&lt;p&gt;The authors evaluate WaveNet on three different tasks:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;multi-speaker speech generation (&lt;strong&gt;not conditioned on text&lt;/strong&gt;)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;TTS&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;music audio modelling.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;31-multi-speaker-speech-generation&quot;&gt;3.1. Multi-Speaker Speech Generation&lt;/h3&gt;

&lt;p&gt;The authors used &lt;strong&gt;the English multi-speaker corpus from CSTR voice cloning toolkit (VCTK)&lt;/strong&gt; (Yamagishi, 2012) and conditioned WaveNet only on the speaker. The conditioning was applied by &lt;strong&gt;feeding the speaker ID to the model in the form of a one-hot vector&lt;/strong&gt;. The dataset consisted of 44 hours of data from 109 different speakers.&lt;/p&gt;

&lt;p&gt;Because the model is &lt;strong&gt;not conditioned on text&lt;/strong&gt;,  it generates non-existent but human language-like words in a smooth way with realistic sounding intonations.  This is similar to generative models of language or images, where samples look realistic at first glance, but are clearly unnatural upon closer inspection.  &lt;strong&gt;The lack of long range coherence is partly due to the limited size of the model’s receptive field (about 300 milliseconds), which means it can only remember the last 2–3 phonemes it produced.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A single WaveNet was able to model speech from any of the speakers by conditioning it on a one-hot encoding of a speaker. This confirms that it is powerful enough to capture the characteristics of all 109 speakers from the dataset in a single model.  We observed that adding speakers resulted in better validation set performance compared to training solely on a single speaker. This suggests that WaveNet’s internal representation was shared among multiple speakers.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Finally, we observed that the model also picked up on other characteristics in the audio apart from the voice itself.  For instance, it also mimicked the acoustics and recording quality, as well as the breathing and mouth movements of the speakers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;32-text-to-speech&quot;&gt;3.2. Text-To-Speech&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Datasets:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The North American English dataset contains 24.6 hours of speech data&lt;/li&gt;
  &lt;li&gt;The Mandarin Chinese dataset contains 34.8 hours;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both datasets were spoken by professional female speakers.&lt;/p&gt;

&lt;p&gt;WaveNets for the TTS task were &lt;strong&gt;locally conditioned on linguistic features&lt;/strong&gt; which were derived from input texts. The authors also trained WaveNets conditioned on the &lt;strong&gt;logarithmic fundamental frequency&lt;/strong&gt; (logF0) values in addition to the linguistic features.  External models predicting &lt;strong&gt;logF0&lt;/strong&gt; values and phone durations from linguistic features were also trained for each language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Receptive field size and baselines:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The receptive field size of the WaveNets was 240 milliseconds. As example-based and model-based speech synthesis baselines, hidden Markov model (HMM)-driven unit selection concatenative (Gonzalvo et al., 2016) and long short-term memory recurrent neural network (LSTM-RNN)-based statistical parametric (Zenet al., 2016) speech synthesizers were built.  Since the same datasets and linguistic features were used to train both the baselines and WaveNets, these speech synthesizers could be fairly compared.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subjective paired comparison tests&lt;/strong&gt; and &lt;strong&gt;mean opinion score (MOS) tests&lt;/strong&gt; were conducted. In the paired comparison tests, after listening to each pair of samples, the subjects were asked to choose which they preferred, though they could choose  “neutral”  if they did not have any preference. In the MOS tests, after  listening to each stimulus, the subjects were asked to rate the naturalness of the stimulus in a five-point Likert scale score (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent). Please refer to Appendix B for details.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subjective paired comparison tests:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fig. 5 shows a selection of the subjective paired comparison test results (see Appendix B for the complete table). It can be seen from the results that WaveNet outperformed the baseline statistical parametric and concatenative speech synthesizers in both languages.&lt;/p&gt;

&lt;p&gt;The authors found that &lt;strong&gt;WaveNet conditioned on linguistic features only&lt;/strong&gt; could synthesize speech samples with natural segmental quality but &lt;strong&gt;sometimes it had unnatural prosody by stressing wrong words in a sentence&lt;/strong&gt;. This could be due to the long-term dependency of &lt;strong&gt;F0&lt;/strong&gt; contours: the size of the receptive field of the WaveNet, 240 milliseconds, was not long enough to capture such long-term dependency. &lt;strong&gt;WaveNet conditioned on both linguistic features and F0 values did not have this problem&lt;/strong&gt;:  the external &lt;strong&gt;F0&lt;/strong&gt; prediction model runs at a lower frequency (200 Hz) so it can learn long-range dependencies that exist in &lt;strong&gt;F0&lt;/strong&gt; contours.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-04-blog-post-21-5.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-04-blog-post-21-5.jpg&quot; alt=&quot;2020-07-04-blog-post-21-5&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mean opinion score (MOS) tests:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Table 1 show the MOS test results.  It can be seen from the table that WaveNets achieved 5-scale MOSs in naturalness above 4.0, which were significantly better than those from the baseline systems. They were the highest ever reported MOS values with these training datasets and test sentences. The gap in the MOSs from the best synthetic speech to the natural ones decreased from 0.69 to 0.34 (51%) in US English and 0.42 to 0.13 (69%) in Mandarin Chinese.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-07-04-blog-post-21-6.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-07-04-blog-post-21-6.jpg&quot; alt=&quot;2020-07-04-blog-post-21-6&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;33-music&quot;&gt;3.3. Music&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Datasets:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;the MagnaTagATune dataset (Law &amp;amp; Von Ahn, 2009), which consists of about 200 hours of music audio. Each 29-second clip is annotated with tags from a set of 188, which describe the genre, instrumentation, tempo, volume and mood of the music.&lt;/li&gt;
  &lt;li&gt;the YouTube piano dataset, which consists of about 60 hours of solo piano music obtained from YouTube videos. Because it is constrained to a single instrument, it is considerably easier to model.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The authors found that &lt;strong&gt;enlarging the receptive field was crucial&lt;/strong&gt; to obtain samples that sounded musical. Even with a receptive field of several seconds, the models did not enforce long-range consistency which resulted in second-to-second variations in genre, instrumentation, volume and sound quality.  Nevertheless, the samples were often harmonic and aesthetically pleasing, even when produced by unconditional models.&lt;/p&gt;

&lt;p&gt;Of particular interest are &lt;strong&gt;conditional music models, which can generate music given a set of tags specifying e.g. genre or instruments&lt;/strong&gt;.  Similarly to conditional speech models, the authors &lt;strong&gt;insert biases that depend on a binary vector representation of the tags&lt;/strong&gt; associated with each training clip. This makes it possible to control various aspects of the output of the model when sampling, by feeding in a binary vector that encodes the desired properties of the samples.  Such models are trained on the &lt;strong&gt;MagnaTagATune&lt;/strong&gt; dataset; although the tag data bundled with the dataset was relatively noisy and had many omissions, after cleaning it up by merging similar tags and removing those with too few associated clips, this works reasonably well.&lt;/p&gt;

&lt;h3 id=&quot;34-speech-recognition&quot;&gt;3.4. Speech Recognition&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: TIMIT (Garofolo et al.,1993) dataset&lt;/p&gt;

&lt;p&gt;WaveNets show that layers of dilated convolutions allow the receptive field to grow longer in a much cheaper way than using LSTM units.&lt;/p&gt;

&lt;p&gt;For this task a mean-pooling layer is added after the dilated convolutions that aggregated the activations to coarser frames spanning 10 milliseconds (160×downsampling).  The pooling layer was followed by a few non-causal convolutions.  The authors trained WaveNet with &lt;strong&gt;two loss terms, one to predict the next sample and one to classify the frame&lt;/strong&gt;, the model generalized better than with a single loss and achieved &lt;strong&gt;18.8 PER&lt;/strong&gt; on the test set, which is to our knowledge the best score obtained from a model trained directly on raw audio on TIMIT.&lt;/p&gt;

&lt;h2 id=&quot;4-conclusion&quot;&gt;4. Conclusion&lt;/h2&gt;

&lt;p&gt;The authors introduced WaveNets,  which are autoregressive and combine causal filters with dilated convolutions to allow their receptive fields to grow exponentially with depth, which is important to model the long-range temporal dependencies in audio signals.&lt;/p&gt;</content><author><name>Wenxin Hou 侯汶昕</name><email>houwx001@gmail.com</email></author><category term="Generative Model" /><category term="Speech Synthesis" /><category term="TTS" /><summary type="html">Last Updated: 2020-07-04</summary></entry><entry><title type="html">[Paper-PreTrain] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations</title><link href="https://houwx.net/posts/2020/01/blog-post-19/" rel="alternate" type="text/html" title="[Paper-PreTrain] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations" /><published>2020-06-27T00:00:00-07:00</published><updated>2020-06-27T00:00:00-07:00</updated><id>https://houwx.net/posts/2020/01/blog-post-19-wav2vec2</id><content type="html" xml:base="https://houwx.net/posts/2020/01/blog-post-19/">&lt;p&gt;Last Updated: 2020-06-28&lt;/p&gt;

&lt;p&gt;This paper:  &lt;a href=&quot;https://arxiv.org/pdf/2006.11477.pdf&quot;&gt;wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations&lt;/a&gt; is proposed by researchers from Facebook AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;: https://github.com/pytorch/fairseq/ (seems not released yet)&lt;/p&gt;

&lt;p&gt;In this paper, the authors shows that wav2vec 2.0 learns powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.&lt;/p&gt;

&lt;p&gt;wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Results&lt;/strong&gt;: SOTA on 100 hour subset of Librispeech as well as on TIMIT phoneme recognition.&lt;/p&gt;

&lt;h2 id=&quot;1-introduction&quot;&gt;1. Introduction&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Motivation&lt;/strong&gt;: current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance which is not available for the vast majority of the nearly 7,000 languages spoken worldwide [30]. Learning purely from labeled examples does not resemble language acquisition in humans: &lt;strong&gt;infants learn language by listening to adults around them - a process that requires learning good representations of speech.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our approach encodes speech audio via a multi-layer convolutional neural network and then masks spans of the resulting latent speech representations [25,54], similar to masked language modeling [9]. The latent representations are fed to a Transformer network to build contextualized representations and the model is trained via a contrastive task where the true latent is to be distinguished from distractors [51, 47, 46, 27] .&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-27-blog-post-19-1.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-27-blog-post-19-1.jpg&quot; alt=&quot;2020-06-27-blog-post-19-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Pretraining and fine-tuning:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;As part of training, we learn discrete linguistic units [51,31,7,17] &lt;strong&gt;via a Gumbel softmax [23,5] to represent the latent representations in the contrastive task (Figure 1) which we find to be more effective than non-quantized targets&lt;/strong&gt;. After pre-training on unlabeled speech, the model is fine-tuned on labeled data with a Connectionist Temporal Classification (CTC) loss [14,4] to be used for downstream speech recognition tasks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Limitation of previous works (vq-wav2vec):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Previous work learned a quantization of the data followed by a contextualized representations with a self-attention model [5,4], whereas our approach solves both problems &lt;strong&gt;end-to-end&lt;/strong&gt;. Masking parts of the input with Transformer networks for speech has been explored  [4,25], but prior work relies either on a two-step pipeline or their model is trained by reconstructing the filter bank input features. Other related work includes learning representations from auto-encoding the input data [50,11] or directly predicting future timesteps [8].&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Our results demonstrate the feasibility of &lt;strong&gt;ultra-low resource speech recognition&lt;/strong&gt;: when using only &lt;strong&gt;10 minutes of labeled data&lt;/strong&gt;, our approach achieves word error rate (WER) 5.7/10.1 on the clean/noisy test sets of Librispeech. We set a new state of the art on TIMIT phoneme recognition as well as the 100 hour clean subset of Librispeech. Moreover, when we lower the amount of labeled data to just &lt;strong&gt;one hour&lt;/strong&gt;, we still &lt;strong&gt;outperform the previous state of the art self-training method of [41] while using 100 times less labeled data and the same amount of unlabeled data&lt;/strong&gt;. When we use all 960 hours of labeled data from Librispeech, then our model achieves 1.9/3.5 WER which performs competitively to the best published result while using a simpler baseline architecture.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;2-model&quot;&gt;2. Model&lt;/h2&gt;

&lt;p&gt;The model is composed of a multi-layer convolutional feature encoder &lt;strong&gt;f:X → Z&lt;/strong&gt; which takes as input raw audio &lt;strong&gt;X&lt;/strong&gt; and outputs latent speech representations &lt;strong&gt;z_1,…,z_T&lt;/strong&gt;. They are then fed to a Transformer &lt;strong&gt;g:Z → C&lt;/strong&gt; to build representations &lt;strong&gt;c_1,…,c_T&lt;/strong&gt; capturing information from the entire sequence [9,5,4]. The output of the feature encoder is discretized to &lt;strong&gt;q_t&lt;/strong&gt; with a quantization module &lt;strong&gt;Z → Q&lt;/strong&gt; to represent the targets (Figure 1) in the self-supervised objective.&lt;/p&gt;

&lt;p&gt;Design details of model component architectures:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature encoder&lt;/strong&gt; The encoder consists of several blocks containing a temporal convolution followed by a GELU activation function [20]. The first block maps raw audio to a feature representation and to increase robustness, we add a group normalization before the GELU to normalize each output channel over the sequence. We apply layer normalization to the output channels of this network [1].&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contextualized representations with Transformers&lt;/strong&gt; The output of the feature encoder is fed to a context network which follows the Transformer architecture [53,9,32]. &lt;strong&gt;Instead of fixed positional embeddings which encode absolute positional information, we use a convolutional layer with kernel size 128 and 16 groups similar to [36,4,55] which acts as relative positional embedding.&lt;/strong&gt; We add the output of the convolution followed by a GELU to the inputs and then apply layer normalization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quantization module&lt;/strong&gt; For self-supervised training we discretize the output of the feature encoder &lt;strong&gt;z&lt;/strong&gt; to a finite set of speech representations via product quantization [24,5]. This amounts to choosing quantized representations from multiple codebooks and concatenating them. Given &lt;strong&gt;G&lt;/strong&gt; codebooks, or groups, with &lt;strong&gt;V&lt;/strong&gt; entries &lt;strong&gt;e∈R^{V×d/G}&lt;/strong&gt;, we choose one entry from each codebook and concatenate the resulting vectors &lt;strong&gt;e_1,…,e_G&lt;/strong&gt; and apply a linear transformation &lt;strong&gt;R^d→R^f&lt;/strong&gt; to obtain &lt;strong&gt;q∈R^f&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Gumbel softmax:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The Gumbel softmax enables choosing discrete codebook entries in a fully differentiable way [15,23,34]. We use the straight-through estimator [25] and setup &lt;strong&gt;G&lt;/strong&gt; hard Gumbel softmax operations [23]. The feature encoder output &lt;strong&gt;z&lt;/strong&gt; is mapped to &lt;strong&gt;l∈R^{G×V}&lt;/strong&gt; logits and the probabilities for choosing the &lt;strong&gt;v&lt;/strong&gt;-th codebook entry for group &lt;strong&gt;g&lt;/strong&gt; are&lt;/p&gt;

\[p_{g, v}=\frac{exp(l_{g, v}+n_v)/\tau}{\sum_{k=1}^{V}exp(l_{g,k}+n_k)/\tau}\]

  &lt;p&gt;where &lt;strong&gt;τ&lt;/strong&gt; is a non-negative temperature, &lt;strong&gt;n=−log(−log(u))&lt;/strong&gt; and &lt;strong&gt;u&lt;/strong&gt; are uniform samples from &lt;strong&gt;U(0,1)&lt;/strong&gt;.During the forward pass, code word &lt;strong&gt;i&lt;/strong&gt; is chosen by &lt;strong&gt;i=argmax_j p_{g,j}&lt;/strong&gt; and in the backward pass, the true gradient of the Gumbel softmax outputs is used.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;3-training&quot;&gt;3. Training&lt;/h2&gt;

&lt;h3 id=&quot;31-masking&quot;&gt;3.1. Masking&lt;/h3&gt;

&lt;p&gt;The authors &lt;strong&gt;mask a proportion of the feature encoder outputs, or time steps before feeding them to the context network and replace them with a trained feature vector shared between all masked time steps&lt;/strong&gt;; they do &lt;strong&gt;not mask inputs to the quantization module&lt;/strong&gt;. To mask the latent speech representations output by the encoder, they randomly sample without replacement &lt;strong&gt;p=0.065&lt;/strong&gt; of all time steps to be starting indices and then mask the subsequent &lt;strong&gt;M=10&lt;/strong&gt; consecutive time steps from every sampled index; spans may overlap. This results in approximately 49% of all time steps to be masked with a mean span length of 14.7, or 299ms (see Appendix A in the original paper for more details on masking) .&lt;/p&gt;

&lt;h3 id=&quot;32-objective&quot;&gt;3.2. Objective&lt;/h3&gt;

&lt;p&gt;During pre-training, the model learns by multiple objectives: a contrastive task L_m, a codebook diversity loss L_d, a L2 penalty L_f.&lt;/p&gt;

\[L=L_m+\alpha L_d + \beta L_f\]

&lt;p&gt;where \alpha and \beta are tuned hyperparameters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contrastive Loss&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Given context network output &lt;strong&gt;c_t&lt;/strong&gt; centered over &lt;strong&gt;masked&lt;/strong&gt; time step &lt;strong&gt;t&lt;/strong&gt;, the model needs to identify the true quantized latent speech representation &lt;strong&gt;q_t&lt;/strong&gt; in a set of &lt;strong&gt;K+ 1&lt;/strong&gt; quantized candidate representations &lt;strong&gt;\tilde{q}∈Q_t&lt;/strong&gt; which includes &lt;strong&gt;q_t&lt;/strong&gt; and &lt;strong&gt;K&lt;/strong&gt; distractors [22,52]. &lt;strong&gt;Distractors are uniformly sampled from other masked time steps of the same utterance&lt;/strong&gt;. The loss is defined as&lt;/p&gt;

\[L_m=-log\frac{exp(sim(c_t, q_t)/k)}{\sum_{\tilde{q}\sim Q_t}exp(sim(c_t, \tilde{q}))/k}\]

&lt;p&gt;where &lt;strong&gt;sim(a,b) =a^T b/‖a‖‖b‖&lt;/strong&gt; is the cosine similarity between context representations and quantized latent speech representations [18, 6].&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diversity Loss&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The contrastive task depends on the codebook to represent both positive and negative examples and the diversity loss &lt;strong&gt;L_d&lt;/strong&gt; is designed to increase the use of the quantized codebook representations [10].  We encourage the equal use of the &lt;strong&gt;V&lt;/strong&gt; entries in each of the &lt;strong&gt;G&lt;/strong&gt; codebooks by &lt;strong&gt;maximizing the entropy of the averaged softmax distribution&lt;/strong&gt; &lt;strong&gt;l&lt;/strong&gt; over the codebook entries for each codebook &lt;strong&gt;\overline{p_g}&lt;/strong&gt; across a batch of utterances; the softmax distribution &lt;strong&gt;does not contain the Gumbel noise nor a temperature&lt;/strong&gt;.&lt;/p&gt;

\[L_d=\frac{1}{GV}\sum_{g=1}^G -H(\overline{p}_g)=\frac{1}{GV} \sum_{g=1}^{G} \sum_{v=1}^{V} \overline{p}_{g, v}\log \overline{p}_{g,v}\]

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-27-blog-post-19-2.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-27-blog-post-19-2.jpg&quot; alt=&quot;2020-06-27-blog-post-19-2&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stabilizing the Feature Encoder&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The authors found it helpful to apply an &lt;strong&gt;L2 penalty to the activations of the final layer of the feature encoder but before the final layer normalization&lt;/strong&gt;. They also scale down the global learning for weight updates to the feature encoder by &lt;strong&gt;γ&lt;/strong&gt;, see §4.2.&lt;/p&gt;

&lt;h3 id=&quot;33-fine-tuning&quot;&gt;3.3. Fine-tuning&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;Pre-trained models are fine-tuned for speech recognition by adding a randomly initialized linear projection on top of the context network into &lt;strong&gt;C&lt;/strong&gt; classes representing the vocabulary of the task [4].For Librispeech, we have 29 tokens for character targets plus a word boundary token. Models are optimized by &lt;strong&gt;minimizing a CTC loss&lt;/strong&gt; [14] and we apply a modified version of SpecAugment [40] by masking to time-steps and channels during training which delays overfitting and significantly improves the final error rates, especially on the Libri-light subsets with few labeled examples.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;4-experimental-setup&quot;&gt;4. Experimental Setup&lt;/h2&gt;

&lt;h3 id=&quot;41-datasets&quot;&gt;4.1. Datasets&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pretraining&lt;/strong&gt;: 960-hour Librispeech (LS-960) without transcriptions or {the audio data from LibriVox (LV-60k) + same preprocessing following Libri-light [26] –&amp;gt; 53.2k hours of audio.}&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;960 hours of transcribed Librispeech&lt;/li&gt;
  &lt;li&gt;train-clean-100 subset comprising 100 hours (100 hours labeled)&lt;/li&gt;
  &lt;li&gt;Libri-light limited resource training subsets originally extracted from Librispeech: train-10h (10 hours labeled), train-1h (1 hour labeled), train-10min (10 min labeled)&lt;/li&gt;
  &lt;li&gt;TIMIT dataset containing five hours of audio recordings with detailed 39 phoneme labels.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;42-pre-training&quot;&gt;4.2. Pre-training&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;The feature encoder contains seven blocks and the temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths(10,3,3,3,3,2,2). This results in an encoder output frequency of 49 Hz with a stride of about 20ms between each sample, and a receptive field of 400 input samples or 25 ms of audio.&lt;/p&gt;

  &lt;p&gt;We experiment with two model configurations which use the same encoder architecture but differ in the Transformer setup: BASE contains 12 transformer blocks, model dimension 768, inner dimension (FFN) 3,072 and 8 attention heads. Batches are built by cropping 250k audio samples, or 15.6 sec, from each example. Crops are batched together to not exceed 1.4m samples per GPU and we train on a total of 64 V100 GPUs for 1.6 days [37]; the total batch size is 1.6h.&lt;/p&gt;

  &lt;p&gt;The LARGE model contains 24 transformer blocks with model dimension 1,024, inner dimension 4,096 and 16 attention heads. We crop 320K audio samples, or 20sec, with a limit of 1.2M samples per GPU and train on 128 V100 GPUs over 2.3 days for Librispeech and 5.2 days for LibriVox; the total batch size is 2.7h.&lt;/p&gt;

  &lt;p&gt;We use dropout 0.1 in the Transformer, at the output of the feature encoder and the input to the quantization module. Layers are dropped at a rate of 0.05 for BASE and 0.2 for LARGE [21, 12]; there is no layer drop for LV-60k.&lt;/p&gt;

  &lt;p&gt;We optimize with Adam [28], warming up the learning rate for the first 8% of updates to a peak of 5×10^{−3} for BASE and 3×10^{−3} for LARGE, and then linearly decay it. LARGE trains for 250k updates, BASE for 400k updates, and LARGE on LV-60k for 600k updates.  We use weight &lt;strong&gt;α= 0.1&lt;/strong&gt; for the diversity loss and &lt;strong&gt;β=10&lt;/strong&gt; for the feature penalty in Equation 2. For the quantization module we use G= 2 and V= 320 for both models, &lt;strong&gt;resulting in a theoretical maximum of 102.4k codewords (How does it come???)&lt;/strong&gt;.  Entries are of size d/G=128 for BASE and d/G= 384 for LARGE.  The Gumbel softmax temperature &lt;strong&gt;τ&lt;/strong&gt; is annealed from 2 to a minimum of 0.5 for BASE and 0.1 for LARGE by a factor of 0.999995 at every update.  The temperature in the contrastive loss (Equation 3) is set to κ=0.1. We set the feature encoder gradient scaling factor to γ= 0.1 for Librispech and γ= 0.03 for LibriVox. In the contrastive loss we use K=100 distractors. &lt;strong&gt;We choose the training checkpoint with the lowest L_m on the validation set.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;43-fine-tuning&quot;&gt;4.3. Fine-tuning&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;After pre-training we fine-tune the learned representations on labeled data and add a randomly initialized output layer on top of the Transformer to predict (Librispeech/Libri-light) or phonemes (TIMIT). For Libri-light, we train three seeds with two different learning rates (2e-5 and 3e-5) for all subsets and choose the configuration with lowest WER on dev-other subset decoded with the official 4-gram language model (LM) with beam 50 and fixed model weights (LM weight 2, word insertion penalty -1). For BASE on the labeled 960h subset we use a learning rate of 1e-4. We optimize with Adam and a tri-state rate schedule where the learning rate is warmed up for the first10% of updates, held constant for the next 40% and then linearly decayed for the remainder. BASE uses a batch size of 3.2m samples per GPU and we fine-tune on 8 GPUs, giving a total batch size of 1,600sec. LARGE batches 1.28M samples on each GPU and we fine-tune on 24 GPUs, resulting in an effective batch size of 1920sec. For the first 10k updates only the output classifier is trained,after which the Transformer is also updated. The feature encoder is not trained during fine-tuning. We mask the feature encoder representations with a strategy similar to SpecAugment [40] detailed in Appendix B.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;44--language-models-and-decoding&quot;&gt;4.4.  Language Models and Decoding&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;We consider two types of language models (LM): a 4-gram model and a Transformer [3] trained on the Librispeech LM corpus. The Transformer LM is identical to [49] and contains 20 blocks, model dimension 1280, inner dimension 6144 and 16 attention heads. We tune the weights of the language model (interval[0,5]) and a word insertion penalty ([−5,5]) via Bayesian optimization3: we run 128 trials with beam 500 for the 4-gram LM and beam 50 for the Transformer LM and choose the best set of weights according to performance on dev-other. Test performance is measured with beam 1,500 for the n-gram LM and beam 500 for the Transformer LM. We use the beam search decoder of [43].&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;5-results&quot;&gt;5. Results&lt;/h2&gt;

&lt;h3 id=&quot;51-low-resource-labeled-data-evaluation&quot;&gt;5.1. Low-Resource Labeled Data Evaluation&lt;/h3&gt;

&lt;p&gt;WER results on Librispeech dev/test sets:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-27-blog-post-19-3.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-27-blog-post-19-3.jpg&quot; alt=&quot;2020-06-27-blog-post-19-3&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can see that:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The LARGE model pre-trained on LV-60k and &lt;strong&gt;fine-tuned on only 10 minutes of labeled data&lt;/strong&gt; achieves a WER of &lt;strong&gt;5.7/10.1 on clean/other test sets&lt;/strong&gt;. his demonstrates that ultra-low resource speech recognition is possible with self-supervised learning on unlabeled data. This approach improves over previous pre-training work which did not learn quantized audio units jointly [4], reducing WER by a about a third.&lt;/li&gt;
  &lt;li&gt;A recent &lt;strong&gt;iterative self-training approach [41]&lt;/strong&gt; represents the &lt;strong&gt;SOTA on the clean 100 hour subset of Librispeech&lt;/strong&gt; but it requires multiple iterations of labeling, filtering, and re-training. On the 100 hour subset of Librispeech, &lt;strong&gt;their method achieves WER 4.2/8.6&lt;/strong&gt; on test-clean/other which compares to &lt;strong&gt;WER 2.3/5.0 with the LARGE model&lt;/strong&gt; in a like for like setup, &lt;strong&gt;a relative WER reduction of 45%/42%&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;When the LARGE model uses an order of magnitude less labeled data (&lt;strong&gt;10h labeled&lt;/strong&gt;), then it still achieves &lt;strong&gt;WER 3.2/6.1&lt;/strong&gt;, an error &lt;strong&gt;reduction of 24%/29% relative to iterative self-training&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;Using only &lt;strong&gt;a single hour of labeled data&lt;/strong&gt;, the same model achieves &lt;strong&gt;WER 3.9/7.6&lt;/strong&gt; which improves on both test-clean and test-other by &lt;strong&gt;7%/12%&lt;/strong&gt; - with two orders of magnitude less labeled data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Libri-light data splits contain both clean and noisy data leading to better accuracy&lt;/strong&gt; on test-other compared to test-clean. (&lt;strong&gt;where???&lt;/strong&gt;)&lt;/li&gt;
  &lt;li&gt;Increasing &lt;strong&gt;model size&lt;/strong&gt; reduces WER on all setups with the largest improvements on test-other (BASE vs. LARGE both on LS-960) and increasing the &lt;strong&gt;amount of unlabeled training data&lt;/strong&gt; also leads to large improvements (LARGE LS-960 vs. LV-60k).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;52-high-resource-labeled-data-evaluation-on-librispeech&quot;&gt;5.2. High-Resource Labeled Data Evaluation on Librispeech&lt;/h3&gt;

&lt;p&gt;WER results on Librispeech with 960 hours labeled data:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-27-blog-post-19-4.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-27-blog-post-19-4.jpg&quot; alt=&quot;2020-06-27-blog-post-19-4&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can find that:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;This work achieves &lt;strong&gt;WER 1.9/3.5&lt;/strong&gt; on test-clean/other. This is the first time self-supervised learning achieves results competitive to the state of the art iterative semi-supervised methods in a high-resource labeled data setup.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Authors’ explanation&lt;/strong&gt;: This is despite a weaker baseline architecture: supervised training of the same architecture achieves &lt;strong&gt;WER 2.1/4.6&lt;/strong&gt; (LARGE- from scratch) compared to &lt;strong&gt;WER 1.9/4.1 for ContextNet&lt;/strong&gt; [16], the baseline architecture of the &lt;strong&gt;SOTA Noisy student&lt;/strong&gt; [41]. The &lt;strong&gt;vocabulary&lt;/strong&gt; of their acoustic model (characters) does not match the vocabulary of the LM (words) which delays feedback from the LM and is likely to be detrimental (did not use subwords). Moreover, they did not use any &lt;strong&gt;data balancing&lt;/strong&gt; such as [41]. Finally, self-training is likely complimentary to pre-training and their combination may yield even better results. Appendix E presents a detailed error analysis of their pre-trained models in various labeled data setups.&lt;/p&gt;

&lt;h3 id=&quot;53-phoneme-recognition-on-timit&quot;&gt;5.3. Phoneme Recognition on TIMIT&lt;/h3&gt;

&lt;p&gt;The authors fine-tuned as for the 10 hour subset of Libri-light but did not use a language model.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-27-blog-post-19-5.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-27-blog-post-19-5.jpg&quot; alt=&quot;2020-06-27-blog-post-19-5&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can find:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The proposed approach achieves a new SOTA on this dataset, reducing PER by &lt;strong&gt;a relative 23%/29% over the next best result&lt;/strong&gt; on the dev/test sets.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Appendix D shows an analysis of how the discrete latent speech representations related to phonemes&lt;/p&gt;

&lt;h3 id=&quot;54-ablations&quot;&gt;5.4. Ablations&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;A difference to previous work [5,4] is that &lt;strong&gt;we quantize the latent audio representations only for the contrastive loss, i.e., when latents are used as targets, but not when the latents are input to the Transformer network&lt;/strong&gt;.  We motivate this choice by an ablating for which we adopt a reduced training setup to increase experimental turn around: we pre-train BASE on LS-960 for 250k updates with masking probability p= 0.075, fine-tune on train-10h for 60k updates on a single GPU with 640k samples per batch, or 40 sec of speech audio. We &lt;strong&gt;report the average WER and standard deviation on the concatenation of dev-clean and dev-other for three seeds of fine-tuning&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-27-blog-post-19-6.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-27-blog-post-19-6.jpg&quot; alt=&quot;2020-06-27-blog-post-19-6&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Table 4 shows that:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Strategy of continuous inputs with quantized targets (Baseline) performs best. Continuous latent speech representations retain more information to enable better context representations and quantizing the target representations leads to more robust training.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Quantizing the latents both in the input and the targets performs least well, and explains the lower performance of prior work [5,4].&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Continuous targets reduce the effectiveness of self-supervised training &lt;strong&gt;since targets can capture detailed artifacts of the current sequence, e.g.  speaker and background information,which make the task easier&lt;/strong&gt; and prevent the model from learning general representations beneficial to speech recognition.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Continuous inputs and continuous targets perform second best but various attempts to improve it did not lead to better results (see Appendix F for this experiment and other ablations on various hyperparameters). The &lt;strong&gt;training accuracy of identifying the correct latent audio representation increases from 62% to 78.0% when switching from quantized to continuous targets&lt;/strong&gt;.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;6-conclusion&quot;&gt;6. Conclusion&lt;/h2&gt;

&lt;p&gt;Contribution:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We presented wav2vec 2.0, a framework for self-supervised learning of speech representations which &lt;strong&gt;masks latent representations of the raw waveform&lt;/strong&gt; and solves a contrastive task over quantized speech representations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Our experiments show the large potential of pre-training on unlabeled data for speech processing: when using &lt;strong&gt;only 10 minutes of labeled training data&lt;/strong&gt;, or 48 recordings of 12.5 seconds on average, we achieve &lt;strong&gt;a WER of 5.7/10.1 on test-clean/other of Librispeech&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Potential improvements:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We expect performance gains by switching to a seq2seq architecture and a word piece vocabulary.&lt;/p&gt;
&lt;/blockquote&gt;</content><author><name>Wenxin Hou 侯汶昕</name><email>houwx001@gmail.com</email></author><category term="Speech Pretrain Model" /><category term="ASR" /><summary type="html">Last Updated: 2020-06-28</summary></entry><entry><title type="html">[Paper-NLP] URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors</title><link href="https://houwx.net/posts/2020/01/blog-post-18/" rel="alternate" type="text/html" title="[Paper-NLP] URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors" /><published>2020-06-24T00:00:00-07:00</published><updated>2020-06-24T00:00:00-07:00</updated><id>https://houwx.net/posts/2020/01/blog-post-18-lang2vec</id><content type="html" xml:base="https://houwx.net/posts/2020/01/blog-post-18/">&lt;p&gt;Last Updated: 2020-06-24&lt;/p&gt;

&lt;p&gt;This paper:  &lt;a href=&quot;https://www.aclweb.org/anthology/E17-2002.pdf&quot;&gt;URIEL andlang2vec: Representing languages as typological,geographical, and phylogenetic vectors&lt;/a&gt; is proposed by researchers from CMU and University of Pittsburgh. It is accepted by EACL 2017. This paper is recommended for introducing &lt;strong&gt;lang2vec&lt;/strong&gt; containing information of languages which helps multilingual NLP research.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;: https://github.com/antonisa/lang2vec&lt;/p&gt;

&lt;p&gt;In this paper, the authors introduced the URIEL knowledge base for massively multilingual NLP and the &lt;strong&gt;lang2vec&lt;/strong&gt; utility which provides information-rich vector identifications of languages drawn from typological,  geographical, and phylogenetic databases that are normalized to have straightforward and consistent formats, naming, and semantics.&lt;/p&gt;

&lt;h2 id=&quot;1-introduction&quot;&gt;1. Introduction&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;lang2vec&lt;/strong&gt; feature primarily represent binary language facts (e.g., that negation precedes the verb or is represented as a suffix, that the language is part of the Germanic family, etc.) and are sourced and predicted from a variety of linguistic resources including WALS (Dryer and Haspel-math, 2013), PHOIBLE (Moran et al., 2014), Ethnologue (Lewis et al., 2015), and Glottolog (Ham-marstr ̈om et al., 2015).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;lang2vec&lt;/strong&gt; takes as its in-put a list of ISO 639-3 codes and outputs a matrix of [0.0, 1.0] feature values (like those in Table1):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-24-blog-post-18-1.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-24-blog-post-18-1.jpg&quot; alt=&quot;2020-06-24-blog-post-18-1&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;2-motivation&quot;&gt;2. Motivation&lt;/h2&gt;

&lt;p&gt;The recent success of “polyglot” models (Hermann and Blunsom, 2014; Faruqui and Dyer, 2014; Ammar et al., 2016; Tsvetkov et al., 2016; Daiber et al.,  2016),  in which a language model is trained on multiple languages and shares representations across languages, represents a promising avenue for NLP, especially for less-resourced languages, as &lt;strong&gt;these models appear to be able to learn useful patterns from better-resourced languages even when training data in the target language is limited&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Tsvetkov et al. (2016) shows that  vectors that represent in formation about the language outperform a simple “one-hot” representation where each language is represented by a 1 in a single dimension. Sample results from Tsvetkov et al. (2016) are reproduced in Table 2.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-24-blog-post-18-2.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-24-blog-post-18-2.jpg&quot; alt=&quot;2020-06-24-blog-post-18-2&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can see that training on a set of three similar languages, and a set of four similar and dissimilar languages, raises perplexity above the baseline monolingual model, even when the language is identified to the model by a one-hot (id) vector. However, &lt;strong&gt;perplexity is lowered by the introduction of phonological feature vectors for each language&lt;/strong&gt; (the phonology and inventory vector types described in §3.1), &lt;strong&gt;giving consistently lower perplexity than even the monolingual baseline&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The initial &lt;strong&gt;motivation&lt;/strong&gt; for the URIEL knowledge base and the &lt;strong&gt;lang2vec&lt;/strong&gt; utility is to make such research  easier, &lt;strong&gt;allowing different sources of information to be easily used together or as different experimental conditions  (e.g.,  is it better to provide this model information about the syntactic features of the language, or the phylogenetic relationships between the  languages?).&lt;/strong&gt; Standardizing the use of this kind of information also makes it easier to replicate and expand on previous work, without needing to know how the authors processed, for example, WALS feature classes or PHOIBLE inventories into model input.&lt;/p&gt;

&lt;h2 id=&quot;3-vector-types&quot;&gt;3. Vector types&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;General composition: binary vectors&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;lang2vec** offers a variety of vector representations of languages, of different types and derived from  different sources, but all reporting feature values between 0.0 (generally representing the absence of a phenomenon or non-membership in a class)  and  1.0  (generally representing the presence of a phenomenon or membership in a class). This normalization makes vectors from different sources more easily interchangeable and more easily predictable for each other (§4).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Different features are not mutually exclusive&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As in SSWL (Collins and Kayne, 2011), different features are not held to be mutually exclusive; the features SSVO and SSOV can both be 1 if both orders are normally encountered in the language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing values:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Phylogeny&lt;/strong&gt;, &lt;strong&gt;geography&lt;/strong&gt;, and &lt;strong&gt;identity&lt;/strong&gt; vectors are complete—they have no missing values.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The typological features (&lt;strong&gt;syntax&lt;/strong&gt;, &lt;strong&gt;phonology&lt;/strong&gt;, and &lt;strong&gt;inventory&lt;/strong&gt;) have missing values, reflecting the coverage of the original sources; missing values are represented in the output as “–”. Predicted typological vectors (§4) attempt to impute these values based on related, neighboring, and typologically similar languages.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Dimensionality&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;All vectors within the syntax, phonology, and inventory categories &lt;strong&gt;have the same dimensionality as other types of vectors in the same category&lt;/strong&gt;, even though the sources themselves may only represent a subset of these values, to allow straightforward element-wise comparison of values.&lt;/p&gt;

&lt;h3 id=&quot;31-typological-vectors&quot;&gt;3.1. Typological vectors&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;syntax&lt;/strong&gt; features are adapted (after conversion to binary features) from the World Atlas of Language Structures (&lt;strong&gt;WALS&lt;/strong&gt;) (Dryer and Haspel-math, 2013), directly from Syntactic Structures of World Languages  (Collins and Kayne, 2011) (whose features are already binary), and indirectly by text-mining the short prose descriptions on typological features in Ethnologue (Lewis  et  al.,2015).&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;phonology&lt;/strong&gt; features are adapted in the same manner from WALS and Ethnologue.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;phonetic inventory&lt;/strong&gt; features are adapted from the PHOIBLE database, itself a collection and normalization of seven phonological databases  (Moran et al., 2014; Chanard, 2006; Crothers et al., 1979; Hartell, 1993; Michael et al.,  2012;  Maddieson and Precoda, 1990; Ramaswami, 1999). The PHOIBLE-based features in &lt;strong&gt;lang2vec&lt;/strong&gt; primarily &lt;strong&gt;represent the presence or absence of natural classes of features (e.g., interdental fricatives, voiced uvulars, etc.)&lt;/strong&gt;, with 1 representing the presence of at least one sound of that class and 0 representing absence. They are derived from PHOIBLE’s phonetic inventories by extracting each segment’s articulatory features using the &lt;strong&gt;PanPhon*&lt;/strong&gt; feature extractor (Mortensen etal.,  2016), and using these features to determine the presence or absence of the relevant natural classes.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-24-blog-post-18-3.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-24-blog-post-18-3.jpg&quot; alt=&quot;2020-06-24-blog-post-18-3&quot; /&gt;&lt;/p&gt;

&lt;p&gt;* About &lt;strong&gt;PanPhone&lt;/strong&gt;: https://github.com/dmort27/panphon&lt;/p&gt;

&lt;h3 id=&quot;32-phylogeny-vectors&quot;&gt;3.2. Phylogeny vectors&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;fam&lt;/strong&gt; vectors express shared membership in language families, according to the world language family tree in Glottolog (Hammarstr ̈om et al.,  2015). &lt;strong&gt;Each dimension represents a language family or branch  thereof&lt;/strong&gt; (such as “Indo-European” or “West Germanic” in Table 4)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-24-blog-post-18-4.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-24-blog-post-18-4.jpg&quot; alt=&quot;2020-06-24-blog-post-18-4&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;33-geography-vectors&quot;&gt;3.3. Geography vectors&lt;/h3&gt;

&lt;p&gt;Although another component of &lt;strong&gt;URIEL&lt;/strong&gt; (to be de-scribed in a future publication) provides geographical distances between languages, &lt;strong&gt;geo&lt;/strong&gt; vectors express geographical location with a fixed number of  dimensions and each dimension representing the same feature even when different sets of languages are considered. &lt;strong&gt;Each dimension represents the orthodromic distance&lt;/strong&gt;—that is, the “great circle” distance—from the language in question to a fixed point on the Earth’s surface. These distances are expressed as a fraction of the Earth’s antipodal distance, so that values will always be in between 0.0 (directly at the fixed point) and 1.0 (at the antipode of the fixed point).&lt;/p&gt;

&lt;h3 id=&quot;34-identity-vectors&quot;&gt;3.4. Identity vectors&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;id&lt;/strong&gt; vector is simply a one-hot vector identifying each language. These vectors can serve as simple identifiers of languages to a system, serve as the control in an experiment in introducing typological information to a system, as in Tsvetkov et al. (2016), or serve in combination with other vectors (such as &lt;strong&gt;fam&lt;/strong&gt;) that do not always identify a language uniquely.&lt;/p&gt;

&lt;h2 id=&quot;4-feature-prediction&quot;&gt;4. Feature prediction&lt;/h2&gt;

&lt;p&gt;One of the major difficulties in using typological features in multilingual processing is that &lt;strong&gt;many languages, and many features of individual languages, happen to be missing from the databases&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The authors efforts towards filling missing values using &lt;strong&gt;KNN&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The question of how we can best predict unknown typological features is a larger question (Daum ́e III and Campbell, 2007; Daum ́e III, 2009;Coke et al., 2016) than this article can capture in detail, but nonetheless we can offer a preliminary attempt at providing practically useful approximations of missing features &lt;strong&gt;by a k-nearest-neighbors approach&lt;/strong&gt;.&lt;/p&gt;

  &lt;p&gt;By taking an average of &lt;strong&gt;genetic&lt;/strong&gt;, &lt;strong&gt;geographical&lt;/strong&gt;, and &lt;strong&gt;feature distances&lt;/strong&gt; between languages, and calculating a weighted 10-nearest-neighbors classification, we can predict feature missing values with an &lt;strong&gt;accuracy of 92.93%&lt;/strong&gt; in 10-fold cross-validation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;5-conclusion&quot;&gt;5. Conclusion&lt;/h2&gt;

&lt;p&gt;While there are many language-information resources available to NLP, their heterogeneity in format, semantics, language naming, and feature naming makes it difficult to combine them, compare them, and use them to predict missing values from each other. &lt;strong&gt;lang2vec&lt;/strong&gt; aims to make cross-source and cross-information-type experiments straightforward by providing standardized, normalized vectors representing a variety of information types.&lt;/p&gt;</content><author><name>Wenxin Hou 侯汶昕</name><email>houwx001@gmail.com</email></author><category term="multilingual" /><category term="NLP" /><summary type="html">Last Updated: 2020-06-24</summary></entry><entry><title type="html">[Paper-ST] Phone Features Improve Speech Translation</title><link href="https://houwx.net/posts/2020/01/blog-post-16/" rel="alternate" type="text/html" title="[Paper-ST] Phone Features Improve Speech Translation" /><published>2020-06-05T00:00:00-07:00</published><updated>2020-06-05T00:00:00-07:00</updated><id>https://houwx.net/posts/2020/01/blog-post-16-phoneST</id><content type="html" xml:base="https://houwx.net/posts/2020/01/blog-post-16/">&lt;p&gt;Last Updated: 2020-06-09&lt;/p&gt;

&lt;p&gt;This paper:  &lt;a href=&quot;https://arxiv.org/pdf/2005.13681.pdf&quot;&gt;Phone Features Improve Speech Translation&lt;/a&gt; is proposed by researchers from JHU and CMU. It is accepted by ACL 2020. This paper is recommended for its comprehensive experiments and analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;: github.com/esalesky/xnmt-devel (seems not released yet)&lt;/p&gt;

&lt;p&gt;The authors compared cascaded and end-to-end models across &lt;strong&gt;high, medium, and low-resource&lt;/strong&gt; conditions, and showed that cascades remain stronger baselines.&lt;/p&gt;

&lt;p&gt;Further, the authors introduced two methods to &lt;strong&gt;incorporate phone features into ST models&lt;/strong&gt; which improves both architectures and closes the gap between end-to-end models and cascades.&lt;/p&gt;

&lt;h2 id=&quot;1-introduction&quot;&gt;1. Introduction&lt;/h2&gt;

&lt;p&gt;The authors propose two simple  heuristics to integrate phoneme-level information into neural speech translation models:&lt;/p&gt;

&lt;p&gt;(1) as a more robust intermediate representation in a cascade&lt;/p&gt;

&lt;p&gt;(2) as a concatenated embedding factor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data&lt;/strong&gt;: Fisher Spanish–English dataset&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We compare to recent work using &lt;strong&gt;phone segmentation&lt;/strong&gt; for end-to-end speech translation (Salesky et al., 2019), and show that our methods outperform this model by up to 20 BLEU on our lowest-resource condition.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Finally, we test model &lt;strong&gt;robustness&lt;/strong&gt; by varying the quality of our phone features, which may indicate which models will better generalize across differently-resourced conditions&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;2-models-with-phone-supervision&quot;&gt;2. Models with Phone Supervision&lt;/h2&gt;

&lt;p&gt;Two proposed methods to incorporate phone features into cascaded and end-to-end models:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-05-blog-post-16-1.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-05-blog-post-16-1.jpg&quot; alt=&quot;2020-06-05-blog-post-16-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Phone cascade&lt;/strong&gt;: uses phone labels  as the ASR output and the machine translation input&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Phone end-to-end&lt;/strong&gt;: concatenates trainable phone embeddings to typical speech feature vector input. Note that this method maintains the same source sequence length as the original speech feature sequence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Phone Segmentation (E2E baseline)&lt;/strong&gt;: uses phone boundaries to segment consecutive speech frames by averaging a variable number of features with the same phone label. This significantly reduces source sequence lengths (by∼80%), reducing the number of model parameters and memory&lt;/p&gt;

&lt;h3 id=&quot;3-data&quot;&gt;3. Data&lt;/h3&gt;

&lt;p&gt;Fisher Spanish-English Corpus containing 160 hours of Spanish telephone speech, split into 138k utterances. Standard dev/test sets are used. For medium / low resource experiments, 40 / 20 hours subsets of the data are randomly selected.&lt;/p&gt;

&lt;h2 id=&quot;4-generating-phone-supervision&quot;&gt;4. Generating Phone Supervision&lt;/h2&gt;

&lt;blockquote&gt;
  &lt;p&gt;We extract 40-dimensional Mel filter bank features with per-speaker mean and variance normalization using Kaldi (Povey et al., 2011). We train an HMM/GMM system on the full Fisher Spanish dataset with the Kaldi recipe (Povey et al., 2011), using the Spanish CALLHOME Lexicon (LDC96L16), and compute per-frame phone alignments with the triphone model (tri3a) with LDA+MLLT features. This yields 50 phone labels, including silence(&lt;sil&gt;), noise, and laughter.&lt;/sil&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;To leverage our better-performing neural ASR models for phone generation, we create essentially a ‘2-pass’ alignment procedure:&lt;/p&gt;

  &lt;ol&gt;
    &lt;li&gt;generating a transcript&lt;/li&gt;
    &lt;li&gt;using this transcript to force align phones.&lt;/li&gt;
  &lt;/ol&gt;

  &lt;p&gt;Table 1 shows the mapping between phone quality and the ASR models used for phone feature generation. This  procedure  enables us  to  both  improve phone alignment quality and also match training and inference procedures for phone generation for our translation models.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-05-blog-post-16-2.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-05-blog-post-16-2.jpg&quot; alt=&quot;2020-06-05-blog-post-16-2&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;5-model--training-procedure&quot;&gt;5. Model &amp;amp; Training Procedure&lt;/h2&gt;

&lt;p&gt;The authors used &lt;strong&gt;xnmt&lt;/strong&gt; (Neubig et al., 2018) to build encoder-decoder models.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Our  pyramidal  encoder  uses 3-layer  BiLSTMs  with  linear  network-in-network(NiN)  projections  and  batch  normalization  between layers (Sperber et al., 2019; Zhang et al., 2017).&lt;/p&gt;

  &lt;p&gt;We  use  single layer MLP attention (Bahdanau et al., 2015) with 128 units and &lt;strong&gt;1 decoder layer&lt;/strong&gt; as opposed to 3 or 4 in previous work – &lt;strong&gt;we did not see consistent benefits from additional depth&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Please refer to the original paper for other details.&lt;/p&gt;

&lt;h2 id=&quot;6-prior-work-cascaded-vs-end-to-end-models-on-fisher-spanish-english&quot;&gt;6. Prior Work: Cascaded vs End-to-End Models on Fisher Spanish-English&lt;/h2&gt;

&lt;p&gt;The results are shown as Table 2. Please refer to the paper for detailed baseline settings (e.g. Parameter, Additional Data, etc.)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-05-blog-post-16-3.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-05-blog-post-16-3.jpg&quot; alt=&quot;2020-06-05-blog-post-16-3&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;7-results-using-phone-features&quot;&gt;7. Results Using Phone Features&lt;/h2&gt;

&lt;p&gt;From Table 3, we can find that &lt;strong&gt;phone cascade&lt;/strong&gt; is better than &lt;strong&gt;phone end-to-end&lt;/strong&gt;, but &lt;strong&gt;phone end-to-end&lt;/strong&gt; performs better than &lt;strong&gt;baseline cascade&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid cascade&lt;/strong&gt; uses an ASR model with phone-informed downsampling and BPE targets (Salesky et al., 2019). This improves the WER of ASR model to 28.1 on dev and 23.2 on test, matching Weiss et al. (2017)’s state-of-the-art on test (23.2) and approaching it on dev (25.7). It is best-performing model on the full dataset. However, at lower-resource conditions, it does not perform as favorably compared to phone featured models  – as shown in Figure 2. &lt;strong&gt;This suggests improving ASR may enable cascades to perform better at high-resource conditions, but under lower-resource conditions it is not as effective as utilizing phone features.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-08-blog-post-16-4.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-08-blog-post-16-4.jpg&quot; alt=&quot;2020-06-08-blog-post-16-4&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-08-blog-post-16-6.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-08-blog-post-16-6.jpg&quot; alt=&quot;2020-06-08-blog-post-16-6&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training time&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-08-blog-post-16-5.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-08-blog-post-16-5.jpg&quot; alt=&quot;2020-06-08-blog-post-16-5&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparing to previous work using additional data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Incorporating phone information makes model more efficient.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We note that our phone models further outperform previous work trained with additional corpora. The attention-passing model of Sperber et al. (2019) trained on additional parallel Spanish-English text yields 38.8 on test on the full dataset, which Salesky et al. (2019) matches on the full dataset and our proposed models exceed, with the phone cascade yielding a similar result (37.4) trained on only 40 hours.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Pre-training with 300 hours of English ASR data and fine-tuning on 20 hours of Spanish-English data, Stoianet al. (2020); Bansal et al. (2019) improve their end-to-end models from≈10 BLEU to 20.2. All three of our proposed models exceed this mark trained on 20 hours of Fisher.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;8-model-robustness--further-analysis&quot;&gt;8. Model Robustness &amp;amp; Further Analysis&lt;/h2&gt;

&lt;h3 id=&quot;81-phone-cascade&quot;&gt;8.1. Phone Cascade&lt;/h3&gt;

&lt;p&gt;Figure 3 compares the impact of different phone qualities for downstream MT. Note that with gold alignments, translation performance is similar to text-based translation&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-09-blog-post-16-7.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-09-blog-post-16-7.jpg&quot; alt=&quot;2020-06-09-blog-post-16-7&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redundancy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The authors collapsed adjacent consecutive phones with the same label in phone cascaded models.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;For  the  phone  cascade  models  compared  in  Figure  3,  we  collapse  adjacent  consecutive phones with the same label, i.e. &lt;strong&gt;when three consecutive frames have been aligned to the same phone label ‘BB B’ we have reduced the sequence to a single phone ‘B’ for translation&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Translating with full sequence of phones hurt the performance.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Translating the full sequence of redundant frame-level phone labels  (e.g. the same sequence length as the number of frames), for the full 160hr dataset, all models performed on average 0.6 BLEU worse; for 40hr, 1.8 BLEU worse; and with 20 hours, 4.1 BLEU worse – a 13% decrease in performance solely from non-uniqued sequences.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;82-phone-end-to-end&quot;&gt;8.2. Phone End-to-End&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;Our  phone  end-to-end  model  concatenates  trainable embeddings for phone labels to frame-level filterbank features,  associating similar feature vectors globally across the corpus, as opposed to locally within an utterance with the phone-averaged embeddings.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-09-blog-post-16-8.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-09-blog-post-16-8.jpg&quot; alt=&quot;2020-06-09-blog-post-16-8&quot; /&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The model’s performance degradation compared to the phone cascade in lower-resource conditions is likely due in part to these sequence lengths, as shown by our additional experiments with input redundancy for the cascade.  The greater reduction in performance here using lower quality phones suggests the noise of the labels and concatenated filterbank features compound, further detracting from performance. Perhaps further investigation into the relative weights placed on the two embedding factors over the training process could close this additional gap.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;83-phone-segmentation-salesky-et-al-2019&quot;&gt;8.3. Phone Segmentation: Salesky et al. (2019)&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;That work introduced downsampling informed by phone segmentation– unlike our other models, &lt;strong&gt;the value of the phone label is not used&lt;/strong&gt;, but rather, phone alignments are used only to determine the boundary between adjacent phones for variable-length downsampling. We hypothesize that the primary reason for their BLEU improvements is &lt;strong&gt;the reduction in local redundancy between similar frames&lt;/strong&gt;, as discovered in the previous section.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;84-quality-of-phone-labels&quot;&gt;8.4. Quality of Phone Labels&lt;/h3&gt;

&lt;p&gt;Two examples of phone sequences:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/houwx.net/posts/2020-06-09-blog-post-16-9.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;D:\Study\houwx.net\posts\2020-06-09-blog-post-16-9.jpg&quot; alt=&quot;2020-06-09-blog-post-16-9&quot; /&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;We see the primary difference in produced phones between different models is the label values, rather than the boundaries.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;We note that &lt;strong&gt;differences in frame-level phone boundaries would not affect our phone cascaded models&lt;/strong&gt;, where the speech features are discarded, while they would affect our phone end-to-end models, where the phone labels are concatenated to speech feature vectors and associate them across the corpus.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;9-related-works&quot;&gt;9. Related Works&lt;/h2&gt;

&lt;p&gt;Skipped. Please refer to the paper.&lt;/p&gt;

&lt;h2 id=&quot;10-conclusion&quot;&gt;10. Conclusion&lt;/h2&gt;

&lt;blockquote&gt;
  &lt;p&gt;We show that phone features significantly improve the performance and data efficiency of neural speech translation models.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Greatest improvements in &lt;strong&gt;low-resource settings&lt;/strong&gt; (20 hours):&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;E2E: 5 BLEU &amp;gt; baseline cascade&lt;/li&gt;
  &lt;li&gt;cascade: 9 BLEU &amp;gt; prior work&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
  &lt;p&gt;Generating phone features uses the same data as auxiliary speech recognition tasks from prior work; our experiments suggest these features are a more effective use of this data, with our models matching the performance from previous works’ performance without additional training data.&lt;/p&gt;
&lt;/blockquote&gt;</content><author><name>Wenxin Hou 侯汶昕</name><email>houwx001@gmail.com</email></author><category term="ST" /><category term="low resource" /><summary type="html">Last Updated: 2020-06-09</summary></entry></feed>