# [Paper-NLP] Meta-Learning for Low-Resource Neural Machine Translation

Published:

Last Updated: 2020-08-12

This paper: Meta-Learning for Low-Resource Neural Machine Translation is published at EMNLP 2018. Authors include Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho and Victor O.K.Li from The University of Hong Kong and New York University.

This paper introduces model-agnostic meta-learning algorithm (MAML) to the low-resource neural machine translation (NMT) task. The proposed approach significantly outperforms the multilingual, transfer-learning-based method.

# 1. Introduction

To address the problem of low-resource language pairs, various approaches have been presented including:

1. Utilizing monolingual corpora (multi-task learning, back-translation, dual learning, unsupervised machine translation with monolingual corpora for both sides)
2. Exploiting knowledge from high-resource language pairs (auxiliary translations/tasks, multilingual translation, universal lexical representation)
3. Pre-train the NMT on high resource language pair and transfer to target low-resource pair

The authors follow up on the latest multilingual NMT approaches and introduce MAML to low-resource NMT by regarding different language-pairs as separate tasks. They further incorporate the universal lexical representation to overcome the problem of vanilla MAML that it can not handle mismatched input and output.

# 2. Background

#### Neural Machine Translation (NMT)

Given a source sentence:

$X=\{x_1, ..., x_{T'}\}.$

$p(Y|X;\theta)=\prod_{t=1}^{T+1}p(y_t|y_{0:t-1}, x_{1:T'};\theta).$

### Meta Learning

Meta-learning is to solve the problem of “fast adaptation on new training data”, there are two categories of meta-learning:

1. Learning a meta-policy for updating model parameters
2. Learning a good parameter initialization for fast adaptation

Model-agnostic meta-learning (MAML) belongs to the second category.

# 3. Meta Learning for Low-Resource NMT

MAML is to find a proper initialization of parameters $\theta^0$ based on a set of tasks ${T^1, T^2, …, T^K}$ so that the model can learn a new target task $T^0$ with only small amount of training data.

The process can be understood as:

$\theta^*=\text{Learn}(T^0; \text{MetaLearn}(T^1, ..., T^K)).$

In the context of NMT, each language pair is regarded as a different task. The objective is to find an initialization from high-resource language-pairs to fast adapt the model to low-resource pairs.

The overall illustration is shown in Figure 1

## 3.1. Learn: language-specific learning

The language-specific learning process $\text{Learn}(D_T;\theta^0)$ is formulated to maximize the log-posterior given data $D_T$ and randomly initialized or meta-learned parameters $\theta^0$ :

$\text{Learn}(D_T;\theta^0)=\text{argmax}_\theta L^{D_T}(\theta)=argmax_\theta \sum_{(X, Y)\in D_T}\log p(Y|X, \theta)-\beta||\theta -\theta^0||^2,$

note that the second term is used to discourage the newly learned parameters from deviating too much, alleviating the overfitting issue.

## 3.2. MetaLearn

The meta-objective to find the initialization $\theta^0$ is given by:

$\mathcal{L}(\theta)=\mathbb{E}_k\mathbb{E}_{D_{T^k}, D'_{T^k}}\left[ \sum_{(X, Y)\in D'_{T^k}}\log p(Y|X;Learn(D_{T^k}; \theta))\right],$

where $k\sim\mathcal{U}({1, …, K})$ refers the $k$-th meta-learning episode. For each episode, task $T^k$ is uniformly chosen at random. $D_{T^k}$ and $D’_{T^k}$ are subsets of training examples for learning and evaluating, respectively. They are sampled independently from the chosen task $T^k$.

In learning process, the model parameters are updated by:

$\theta_k'=\text{Learn}(D_{T^k};\theta)=\theta-\eta \nabla_\theta \mathcal{L}^{D_{T^k}}(\theta),$

note that this process is not really applied on meta model $\theta$ but a simulation.

By applying the updated parameters $\theta’k$ to evaluation set $D’{T^k}$ , the meta-model $\theta$ is updated with meta-gradient computed on evaluation set. As shown in the formula below, note that it is possible to aggregate multiple episodes before updating:

$\theta \leftarrow \theta-\eta' \sum_k \nabla_\theta \mathcal{L}^{D'_{T^k}}(\theta'_k),$

where $\eta’$ is meta-learning rate.

Based on the property below:

$H(x)v \approx \frac{\nabla (x+uv)-\nabla (x)}{u},$

$\nabla_\theta \mathcal{L}^{D'}(\theta')=\nabla_{\theta'} \mathcal{L}^{D'}(\theta') \nabla_{\theta} \left(\theta -\eta \nabla \mathcal{L}^{D}(\theta)\right)=\nabla_{\theta'} \mathcal{L}^{D'}(\theta') -\eta \nabla_{\theta'}\mathcal{L}^{D'}(\theta')H_\theta(\mathcal{L}^D(\theta))\\ \approx \nabla_{\theta'} \mathcal{L}^{D'}(\theta') -\frac{\eta}{u} \left[ \left .\nabla_{\theta}\mathcal{L}^{D}(\theta)\right|_{\hat{\theta}}- \left .\nabla_{\theta}\mathcal{L}^{D}(\theta)\right|_{\theta} \right],$

where $u$ is a small constant and

$\hat{\theta}=\theta + u\nabla_{\theta'}\mathcal{L}^{D'}(\theta')$

In practice, the authors omitted the second term by using the simplified rule:

$\nabla_\theta \mathcal{L}^{D'}(\theta')\approx \nabla_{\theta'} \mathcal{L}^{D'}(\theta')$

As we can see from Figure 2, the difference between transfer learning and meta learning lies in that the former aims at directly solving the source tasks. On the other hand, meta-learning is to be useful for fine-tuning on various tasks including the source and target tasks.

## 3.3. Unified Lexical Representation

One limitation of meta learning is that it assumes the input and output spaces shared across all the tasks.

### Unified Lexical Representation (ULR)

ULR starts with multilingual word embedding matrices $\epsilon_{\text{query}^k}\in \mathbb{R}^{|V_k|\times d}$ pretrained on monolingual corpora, where $V_k$ is the vocabulary of the $k$-th language. One of these languages are used to build universal lexical representation consisting of a universal embedding matrix: $\epsilon_u \in \mathbb{R}^{M\times d}$ and a corresponding key matrix $\epsilon_{\text{key}} \in \mathbb{R}^{M\times d}$, where $M<|V’_k|$.

Both $\epsilon^k_{\text{query}}$ and $\epsilon_{\text{key}}$ are fixed during meta-learning.

The language-specific embedding of token $x$ from language $k$ is computed as the convex sum of universal embedding vectors:

$\epsilon^0[x]=\sum_{i=1}^M \alpha_i \epsilon_u[i],$

where

$\alpha_i \propto \exp\{-\frac{1}{\tau}\epsilon_{\text{key}}[i]^T A \epsilon^k_{\text{query}}[x]\}$

and $\tau=0.05$. This approach allows with a fixed number of shared parameters ($\epsilon_u$, $\epsilon_{\text{key}}$ and $A$).

### Learning of ULR

During language-specific learning, the authors estimate the change to each embedding vector by a separate parameter $\triangle \epsilon^k[x]$ to avoid directly updating the universal embedding:

$\epsilon^k[x]=\epsilon^0[x] + \triangle \epsilon^k[x]$

During language-specific learning, the first term is fixed while during meta-learning stage, only the first term ($\epsilon_u$ and $A$) is updated.

# 4. Experiments

## 4.1. Dataset

Europarl: Bulgarian (Bg), Czech (Cs), Danish (Da), German (De), Greek (El), Spanish (Es),Estonian (Et), French (Fr), Hungarian (Hu), Italian (It), Lithuanian (Lt), Dutch (Nl), Polish (Pl),Portuguese (Pt), Slovak (Sk), Slovene (Sl) and Swedish (Sv)

WMT’17: Russian (Ru) (2M pairs subset)

WMT’16: Romanian (Ro)

WMT’17: Latvian (Lv), Finnish (Fi), Turkish (Tr)

Korean Parallel Dataset: Korean (Ko)

Ro-En or Lv-En is used as a validation set for meta-learning.

## 4.2. Model and Learning

Model: Transformer of default hyper-parameter setting

Training & Fine-tuning:

During meta-learning, all the parameters are updated, but during fine-tuning, three strategies are considered to update the model:

1. updating all the modules (all)
2. updating the embedding and encoder only (emb+enc)
3. updating the embedding only (emb)

# 5. Results

### vs. Multilingual Transfer Learning

1. From Figure 3 below, we can observe significant improvement of meta-learning compared with multilingual transfer learning strategy.
2. Note that the training sets are only subsampled sets with around 16,000 English tokens. However, the best fine-tuned MetaNMT achieves 2/3 (Ro-En) and 1/2 (rest) of the BLEU score achieved by the supervised model trained on full training sets (as shown in Table 1 above).

Furthermore, we can notice the impact of validation tasks. Fi-En benefits more when Ro-En is used for validation ((c) in Figure 3), while the opposite happens with Tr-En. The relationship between the task similarity and the impact of a validation task remains to be further investigated in the future.

### Training Size

From Figure 4, we can also observe that the BLEU curve of MetaNMT is more flat. As target task’s training set increases, the gap between MetaNMT and MultiNMT shrinks, **indicating the robustness of MetaNMT in handling low-resource language pairs **.

1. It can be inferred from Table 2 that using more source tasks is always beneficial to MetaNMT, there is up to 2x improvement from one source task (Es) to 18 source tasks (All).
2. Choice of source languages also have impact on target languages. For instance, comparing {De Ru} and {Es Fr It Pt}, the latter benefits Ro-En more, but the former benefits all the other pairs more.

### Training Curves

Compared with MetaNMT, we can observe from Figure 5 that MultiNMT saturates rapidly and eventually degrades (overfitting), whereas MetaNMT continues to improve and never degrades.

### Sample Translations

Table 3 presents zero-shot and meta-learned examples for Tr-En and Ko-En.

1. Zero-shot examples provide a word-by-word translation without re-ordering, demonstrating the success of applying universal lexical representation and meta-learned initialization.
2. After 600 sentence pairs (16,000 English tokens), the model rapidly learns to re-order tokens and produces better translation.

# 6. Conclusion

Contribution of this paper include:

1. Propose MetaNMT, a meta-learning approach for low-resource neural machine translation
2. Applying universal lexical representation to tackle the I/O mismatch problem across language pairs.
3. MetaNMT significantly outperforms multilingual transfer learning based method on low-resource tasks

Tags: