[Collection] Papers, Recipes, Toolkits and Interesting Posts

20 minute read


Last Updated: 2020-12-10



NeurIPS2020 SAS workshop: https://neurips-sas-2020.github.io/#papers

Machine Translation Reading List: https://github.com/THUNLP-MT/MT-Reading-List


1. NLP中的少样本困境问题探究 [Chinese]

2. 探索孪生神经网络:请停止你的梯度传递! [Chinese]

3. 更深的编码器+更浅的解码器=更快的自回归模型 [Chinese]


1. PyTorch Lightning Bolts [GitHub]

  1. Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
  2. Paper: A Framework For Contrastive Self-Supervised Learning And Designing A New Approach

2. S3PRL Speech Toolkit [GitHub]

  1. Self-supervised pre-training and representation learning (S3PRL) of Mockingjay, TERA, A-ALBERT, APC, and more to come. With easy-to-use standard downstream evaluation scripts including phone classification, speaker recognition, and ASR. (All in Pytorch!)

3. WavAugment [GitHub]

  1. WavAugment performs data augmentation on audio data. The audio data is represented as pytorch tensors
  2. Augmentations include: pitch randomization, reverberation, additive noise, time dropout (temporal masking), band reject, clipping
  3. Data Augmenting Contrastive Learning of Speech Representations in the Time Domain, E. Kharitonov, M. Rivière, G. Synnaeve, L. Wolf, P.-E. Mazaré, M. Douze, E. Dupoux. [arxiv]

4. Sequence-to-Sequence G2P Toolkit [GitHub]

  1. The tool does Grapheme-to-Phoneme (G2P) conversion using transformer model from tensor2tensor toolkit based on TensorFlow
  2. Lukasz Kaiser. “Accelerating Deep Learning Research with the Tensor2Tensor Library.” In Google Research Blog, 2017

5. Abkhazia [GitHub]

  1. Online documentation https://docs.cognitive-ml.fr/abkhazia
  2. The Abkhazia project makes it easy to obtain simple baselines for supervised ASR (using Kaldi) and ABX tasks (using ABXpy) on the large corpora of speech recordings typically used in speech engineering, linguistics or cognitive science research


1. Libri-Adapt: a New Speech Dataset for Unsupervised Domain Adaptation [ICASSP2020] [GitHub]

  1. 7200 hours of English speech recorded on mobile and embedded scale microphones, 72 different domains (6 microphones x 3 accents x 4 environments x 100-hour Librispeech corpus) sampled at 16KHz

2. MLS: A Large-Scale Multilingual Dataset for Speech Research [INTERSPEECH2020] (Newer arxiv version) [Recipe/Pretrained Models] [OpenSLR]

  1. Multilingual LibriSpeech (MLS) dataset: derived from LibriVox, 50.5K hours, 8 languages (44.5K hours of English and a total of about 6K hours for other languages)
  2. Languages covered: English, German, Dutch, French, Spanish, Italian, Portuguese, Polish

End-to-End ASR

1. Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition [INTERSPEECH2020]

  1. Introduce non-autoregressive ASR system LASO (Listen Attentively, and Spell Once)
  2. CER 6.4% (SOTA autoregressive Transformer model 6.7%) on AISHELL-1
  3. 21 ms, 1/50 average inference latency of the autoressive Transformer model

2. Independent Language Modeling Architecture for End-To-End ASR [ICASSP2020]

  1. Introduce independent language modeling subnet to leverage external text data

  2. Existing method: Replace encoding with an all-zero vector and freeze the encoder

3. Conformer: Convolution-augmented Transformer for Speech Recognition [INTERSPEECH2020] [Paper Reading Slide]

  1. Combine CNNs and Transformers to model both local and global dependencies in a parameter-efficient way
  2. LibriSpeech WER of 2.1/4.3 without using LM and 1.9/3.9 with an external LM on test clean/other, 2.7/6.3 with a 10M-parameter small model

4. Exploring Transformers for Large-Scale Speech Recognition [INTERSPEECH2020]

  1. Depth-scale model initialization to accelerate convergence
  2. Pre-LayerNorm instead of Post-LayerNorm to accelerate convergence
  3. Chunk-based Transformer-XL for streaming ASR (low computation + GPU memory-saving)

5. Distilling the Knowledge of BERT for Sequence-to-Sequence ASR [INTERSPEECH2020] [GitHub]

  1. Use BERT to generate soft labels by masking and predicting target words for the training of seq2seq ASR
  2. Concatenate multiple utterances together to a fixed size for BERT prediction to make pre-training and distillation consistent and improve WER



1. Massively Multilingual Adversarial Speech Recognition [NAACL-HLT2019]

  1. Analyze the relative importance of similarity between the target and pre-training languages along the dimensions of phonetics, phonology, language family, geographical location, and orthography
  2. Investigate 2 additional objectives for hybrid CTC/Attention architecture: phoneme CTC and language-adversarial during pre-training

2. Learning Robust and Multilingual Speech Representations [Findings of EMNLP2020]

3. Language-agnostic Multilingual Modeling [ICASSP2020]

  1. Propose a language-agnostic multilingual ASR system by transforming all languages to one writing system through a many-to-one transliteration transducer
  2. Obtain 10% relative WER reduction on 4 Indic languages

6. End-to-End Multilingual Speech Recognition System with Language Supervision Training [IEICETrans.2020]

  1. Propose a Language Masks Estimation method to constrain the output distribution

7. Towards Language-Universal Mandarin-English Speech Recognition [INTERSPEECH2019]

  1. Propose to combine two monolingual models to build a bilingual model

8. Multilingual Speech Recognition With A Single End-To-End Model [ICASSP2018]

  1. Invesitgate the joint/encoding LID multi-task learning/language embedding conditioned cases on 9 Indian languages

9. Towards Language-Universal End-to-End Speech Recognition [ICASSP2018]

  1. Explore language-specific/universal output layer
  2. Propose language-specific gating units in hidden layers

10. Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning [INTERSPEECH2020] [Slide] [Recipe]

  1. Present LID-42: very-large-scale Transformer-based hybrid CTC/Attention models based on subwords / characters for 42-lingual ASR, average CER: 27.8/27.2
  2. Language-independent architecture with shared vocabulary including language IDs for joint language identification, LID accuracy: 93.5/94.0
  3. Relative improvements of 28.1% in WER by transfer learning to low-resource languages

11. Leveraging Language ID in Multilingual End-to-End Speech Recognition [ASRU2019]

12. Investigating End-to-end Speech Recognition for Mandarin-english Code-switching [ICASSP2019]

  1. Introduce MTL where at each time step, the model predicts both the modeling unit and the language ID

13. Meta Learning for End-to-End Low-Resource Speech Recognition [ICASSP2020]

  1. Apply model-agnostic meta-learning (MAML) to pre-train a CTC multilingual model and transfer to low-resource languages with language-specific head

14. Language Adaptive Multilingual CTC Speech Recognition

15. Multilingual Speech Recognition with Corpus Relatedness Sampling

16. Adversarial Multilingual Training for Low-Resource Speech Recognition [ICASSP2018]

17. Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes [ICASSP2019]

  1. Propose Audio-to-Byte (A2B) and Byte-to-Audio (B2A) models for multilingual ASR and TTS
  2. ASR model: LAS, TTS model: Tacotron 2. Input/Output layer: 256 possible byte values.

18. An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation

19. Bootstrap an End-to-End ASR System by Multilingual Training, Transfer Learning, Text-to-Text Mapping and Synthetic Audio [Arxiv2020]

  1. Demostrate that post-ASR text-to-text mapping and synthetic TTS data can be effectively combined with approaches such as multilingual training and transfer learning for improving simulated Italian ASR bootstrapping scenario

20. Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition [AAAI2021] [GitHub]

  1. Propose Adversarial Meta-Sampling (AMS) approach for multilingual meta-learning ASR (MML-ASR) and multilingual transfer learning ASR (MTL-ASR) by


1. Phonological Features for 0-shot Multilingual Speech Synthesis [INTERSPEECH2020]

  1. Utilize binary phonological features
  2. Mapping tables for phoneme to IPA, as well as an IPA-PF lookup dictionary are available at https://github.com/papercup-open-source/phonological-features

C. Speech Translation

1. Effectively Pretraining a Speech Translation Decoder with Machine Translation Data [EMNLP2020]

  1. Propose to use adversarial discriminator to train NMT and ASR systems simultaneously by aligning their encodings to the same latent space –> both the pre-trained ASR encoder and NMT deocder can be used to improve AST
  2. 1.5 BLEU improvements on En-De and En-Fr compared with conventional pretraining methods (ASR encoder only / ASR encoder + NMT deocder pre-trained separately)

D. Analysis

1. Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios [Arxiv2020]

  1. Train a LID model on the Wilderness dataset and analyze the learned embeddings by comparing with classical language family findings (Ethnologue, Glottolog, Wikipedia)
  2. Show that languages grouped by learned embeddings perform better than distance-based or phoneme-based approaches on zero-shot TTS

Curriculum Learning

1. When Do Curricula Work? [ICLR2021 (submitted)]

  1. Investigate the implicit curricula resulting from architectural and optimization bias and find that samples are learned in a highly consistent order

  2. Conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum and find that benefit is entirely due to the dynamic training set size rather than the order of examples

  3. Curriculum, but not anti-curriculum or random ordering can indeed improve the performance either with limited training time budget or in the existence of noisy data

Semi-Supervised Learning

1. Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

2. Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages

3. Semi-Supervised End-to-End Speech Recognition [INTERSPEECH2018] [GitHub]

4. Self-Training for End-to-End Speech Recognition [ICASSP2020] [Recipe]

  1. Train a base AM on limited paried data and LM on large-scale unparied text data, use beam search to generate peudo-labels for unlabelled speech data
  2. Filtering a) n-gram repeated more than c times (looping). b) hypotheses with an EOS probability below a threshold (early-stopping) or no EOS generated. c) length-normalized log likelihood as the confidence score for quality ranking: $\text{ConfidenceScore}(\hat{Y_i})=\frac{\log P_{AM}(\hat{Y_i}|X_i)}{|\hat{Y_i}|}$
  3. Propose sample ensemble: Combine pseudo samples generated by M models and average their loss during optimization

5. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures [SAS Workshop@ICML2020] [Recipe]

  1. Train AM + LM on Librispeech (960 hours) to generate pseudo labels for LibriVox (53.8k hours) –> (10k hours) can already obtain promising results as shown in the paper
  2. E2E AMs are implicit LMs, thus with enough unlabeled audio, decoding with an external LM doesn’t improve performance
  3. On Librispeech: achieve 2.27/4.8 WER on test-clean/other sets without LM, 2.09/4.11 with LM

Self-Supervised Learning

1. Unsupervised Pretraining Transfers Well Across Languages [ICASSP2020] [GitHub]

  1. Investigate into the CPC for cross-lingual tasks and evaluate the linear separability of the learned phoneme representation (Common Voice / LibriSpeech phoneme classification, Zerospeech2017)
  2. Introduce two modifications to improve CPC: a) replacing batch normalization with a channel-wise normalization to avoid information leakage and stablize training; b) Replace the linear classifier with a 1-layer Transformer to make the future prediction target more reasonable
  3. PER of modified CPC pretrained on LS-360 is comparable to the supervised model pretrained on LS-100

2. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [NeurIPS2020] [Paper Reading Slide]

  1. Self-supervised learning by masking + contrastive loss + quantized representations
  2. On Librispeech: achieve 1.8/3.3 WER on the clean/other test sets, 4.8/8.2 WER on 10 minutes of labeled data and 53k hours of unlabeled data. All experiments are with external LM

3. Unsupervised Cross-lingual Representation Learning for Speech Recognition [Arxiv2020]

  1. Investigate into self-supervised pre-training (wav2vec 2.0) for multilingual / cross-lingual ASR
  2. Publish XLSR-53, a large-scale multilingual wav2vec 2.0 pre-trained on 56K-hour combined corpora of Common Voice (38 languages) + BABEL (14 languages) + Multilingual LibriSpeech (MLS) (8 languages)

4. Unsupervised Learning of Disentangled Representations for Speech with Neural Variational Inference Models [MIT PhD Dissertation]

5. A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition

6. Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

7. Self-training and Pre-training are Complementary for Speech Recognition [Arxiv2020]

  1. Combine self-training (paper 5 in semi-supervised learning section) and unsupervised pre-training (wav2vec 2.0) to achieve new SOTA
  2. Use pre-trained wav2vec 2.0 for pseudo-labelling, then fine-tune wav2vec 2.0 (CTC) or train randomly-initialized Transformer model (S2S) as final model
  3. On Librispeech: achieve 1.8/3.3 (CTC) or 1.5/3.1 (S2S) WER on the clean/other test sets, 3.0/5.2 (CTC) or 3.1/5.4 (S2S) WER on 10 minutes of labeled data and 53k hours of unlabeled data. All experiments are with external LM

8. Vector-Quantized Autoregressive Predictive Coding [INTERSPEECH2020] [GitHub]

  1. Introduce one or more vector quantization layers to the APC model to explicitly control the amount of encoded information
  2. Probiing tasks show the APC model prefer to retain speaker information over phonetic information when the capacity is limited
  3. When the phonetic information is present, the learned VQ codes correspond well with English phones

9. Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks [INTERSPEECH2020]

  1. Apply XLNet (refer to NLP 13th) to ASR tasks (w/o segment recurrence mechanism)

10. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders [ICASSP2020] [GitHub]

  1. Apply BERT to speech with proposed additional consecutive masking (mask consecutive C frames to 0) and evaluate on phoneme classification, sentiment classification and speaker recognition

11. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech [TASLP2020 (submitted)] [S3PRL]

  1. Propose TERA (Transformer Encoder Representations from Alteration)
  2. Multi-target 3 auxiliary L1 reconstruction objectives: time (time masking, Mockingjay) / channel (frequency masking) / magnitude (+ Gaussian noise) alteration
  3. Comprehensive analysis on its application to ASR (representation / fine-tune) / phoneme classification / speaker classification

12. Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction [ICASSP2020]

  1. Inspied by BERT, pre-train bidirectional RNNs via masked reconstruction loss to improve ASR
  2. Inspired by SpecAugment, masking segments of sufficient width in both time and frequency

13. Improving Transformer-based Speech Recognition Using Unsupervised Pre-training [Arxiv2019]

  1. Almost same as paper 9 except it is evaluated on HKUST and AISHELL Mandarin ASR datasets

14. A Simple Framework for Contrastive Learning of Visual Representations [ICML2020] [GitHub]

  1. Propose SimCLR framework consisting of: input image $x$ –> two data augmentation to $x_i$, $x_j$ –> same fetaure extractor –> $h_i$, $h_j$ –> same two-layer DNN –> $z_i$, $z_j$ –> contrastive loss (positive samples are $(i, j)$ and $(j, i)$, negative samples are other $2(N-1)$ augmented examples
  2. Important factors: multiple data augmentation operations, larger batch sizes and longer training, normalize temperature for contrastive cross entropy, nonlinear transformation between representation and contrastive loss

15. Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning [Arxiv2020] [GitHub]

  1. Data augmentation: random pitch shift, speed perturbation, room reverberation and additive noise to the waveform; time and frequency masking to the spectrogram implemented with WavAugment toolkit
  2. Final loss = reconstruction loss (TERA) + contrastive loss (SimCLR)

16. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning [Arxiv2020]

  1. Propose BYOL,

17. The Zero Resource Speech Benchmark 2021: Metrics and Baselines for Unsupervised Spoken Language Modeling [NeurIPS2020 SAS]

  1. d

18. Multi-Format Contrastive Learning of Audio Representations [NeurIPS2020 SAS]

19. Similarity Analysis of Self-Supervised Speech Representations [NeurIPS2020 SAS]

  1. d

Transfer Learning / Domain Adaptation

1. Co-Tuning for Transfer Learning [NeurIPS2020] [GitHub]

  1. Propose Co-Tuning to fully transfer pre-trained models by utilizing pretraining-task-specific parameters
  2. Loss of Co-Tuning: $\text{CE}(\text{prediction of target head}, y_t) + \lambda \cdot \text{CE}(\text{prediction of pretrained head}, p(y_s|y_t))$
  3. Propose two category relationship learning approaches to translate target labels into probabilistic source labels

2. Self-training For Few-shot Transfer Across Extreme Task Differences [ICLR2021 (submitted)]

  1. Propose “Self Training to Adapt Representations To Unseen Problems (STARTUP)”
  2. Three stages to learn representations: train teacher model on base dataset –> contruct softly-labeled set on target unlabeled set –> train student model
  3. Loss for training student model: CE loss on base dataset + KL divergence on softly-labeled set + self-supervised loss on unlabeled set (SimCLR in this paper)

3. Adaptation Algorithms for Speech Recognition: An Overview [IEEE OJSP2021 (submitted)]

4. Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training [ICASSP2021 (submitted)]

  1. Propose Dropout-based Uncertainty-driven Self-Training (DUST) by filtering uncertain predictions from pseudo-labelled samples
  2. Confidence is computed based on the maximum edit distance between model outputs w/o dropout vs. model outputs w/ dropout with different random seeds

5. Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning [ICLR2021 (submitted)]

  1. dd

6. Domain Adaptation Using Class Similarity for Robust Speech Recognition [INTERSPEECH2020]

  1. Propose a novel adaptation method composed of two stages: a) train the source model and computing mean soft labels of every class over source samples; b) use the soft labels as a regularization term to train the target model on the target domain data
  2. Experiment on a) accent adaptation (Common Voice English) and b) noise adapation (CHiME-3). The proposed method is more robust and performs even better on the latter task

7. UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [Arxiv2020]

  1. d


1. Cross-lingual Language Model Pretraining [NeurIPS2019]

  1. Investigate into the cross-lingual language model (XLM) pretrained by Casual LM (classic LM), Masked LM (unsupervised) and Translation LM (supervised) tasks

2. Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks [EMNLP2019]

  1. Based on 2 tasks (MLM, TLM) in XLM, further introduces 3 new cross-lingual pretraining tasks: Cross-lingual Word Recovery (based on attention), Cross-lingual Paraphrase Classification (similar to Next Sentence Prediction but predicting meaning), Cross-lingual MLM (code-switch sentences)

3. Learning Deep Transformer Models for Machine Translation [ACL2019]

  1. Study the effects of Pre-Norm and Post-Norm in deep Transformer, the gradient of Post-Norm poses a higher risk of gradient vanishing or exploding
  2. Propose Dynamic Linear Combination of Layers (DLCL) to memorizing the features extracted from all preceding layers
  3. Sucessfully train a 30-layer deepest Transformer encoder and 6-layer decoder for NMT

4. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond [TACL2019]


5. A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT

6. Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model

7. Are All Languages Created Equal in Multilingual BERT?

8. Multilingual Neural Machine Translation with Language Clustering

9. Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders

10. Multilingual NMT with a Language-Independent Attention Bridge https://github.com/Helsinki-NLP/OpenNMT-py/tree/att-brg

11. Cross-lingual Spoken Language Understanding with Regularized Representation Alignment

12. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [ACL2019] [GitHub]

  1. Propose Transformer-XL (extra long) to capture longer-term dependency and resolve the context fragmentation problem
  2. Introduce segment-level recurrence mechanism to reuse previous segments as context
  3. Introduce relative positional encodings for the proposed recurrence mechanism

13. XLNet: Generalized Autoregressive Pretraining for Language Understanding [NeurIPS2019] [GitHub] [Mask分析]

  1. Introduce a permutation language modeling (PLM) pre-training objective
  2. Two-stream self-attention + partial prediction (only predict the lasttokens in a factorization order)
  3. Integrate Transformer-XL (relative positional encoding + segment recurrence mechanism)


1. Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

  1. Knowledge distillation from BERT to ASR pretrained module for Spoken Language Understanding

Deep Learning Architecture

1. Pay Less Attention with Lightweight and Dynamic Convolutions [ICLR2019]

  1. Depth-wise convolution independently performed on every channel
  2. Lightweight convolution (depth-wise + weight sharing + GLU)
  3. Dynamic convolution: compute weight based on input X

2. Dynamic Convolution: Attention over Convolution Kernels [CVPR2020]

  1. Aggregate multiple convolution kernels dynamically based on their attentions
  2. Tricks: Sum the attention to 1 + Near-uniform attention in early epochs (softmax + large temperature)

3. Rethinking the Value of Transformer Components [COLING2020]

  1. For decoder self-attention least important, FFN most important, higher encoder-attention > lower encoder-attention (closer to the output layer)
  2. For encoder, the lower components (self-attention, FFN) are more important.
  3. Two methods to improve Transformer NMT: a) Prune unimportant components and retrain the model. b) Rewind unimportant components and finetune the model

4. Understanding the Difficulty of Training Transformers [EMNLP2020]

  1. Strong dependencies of Post-Norm amplify fluctuations brought by parameter changes and destabilize the training
  2. Loose reliance on residual branches in Pre-Norm generally limits the algorithm’s potential and often produces inferior models (than Post-Norm)
  3. Propose Admin: an Adaptive model initialization method to stablize the early stage of training

5. MADE: Masked Autoencoder for Distribution Estimation [ICML2015]

  1. Modify the autoencoder to make it autoregssive by using masks to change hidden layer connectivity structures

6. Lite Transformer

7. Object-Centric Learning with Slot Attention

8. Deep Learning Recommendation Model for Personalization and Recommendation Systems

9. Multi-Head Attention:Collaborate Instead of Concatenat


1. On First-Order Meta-Learning Algorithms

2. Reptile: a Scalable Metalearning Algorithm

3. Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML [ICLR2020]

  1. Question: is the effectiveness of MAML due to the meta-initialization being primed for rapid learning (large, efficient changes in the representations) or due to feature reuse, with the meta-initialization already containing high quality features? Answer: feature reuse is the dominant factor
  2. Propose Almost No Inner Loop (ANIL) algorithm, a competitive simplification of MAML by removing the inner loop for all but the (task-specific) head of the underlying neural network
  3. Propose No Inner Loop (NIL) algorithm to classify the test sample based on cosine similarities of penultimate layer representations with the k labelled examples (support set)

4. Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters


1. Scaling Hidden Markov Language Models [EMNLP2020]


3. PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization

4. AdapterFusion:Non-Destructive Task Composition for Transfer Learning

5. Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation

6. Dynamic Fusion Network for Multi-Domain End-to-end Task-Oriented Dialog

7. Multi-Source Domain Adaptation with Mixture of Experts

8. Meta-Learning for Few-Shot NMT Adaptation

8. Simple, Scalable Adaptation for Neural Machine Translation

9. Parameter-Efficient Transfer Learning for NLP

11. Improving Target-side Lexical Transfer in Multilingual Neural Machine Translation

12. Large Memory Layers with Product Keys

13. An Analysis of Massively Multilingual Neural Machine Translation forLow-Resource Languages

14. Large Product Key Memory for Pretrained Language Models

15. Contextual Parameter Generation for Universal Neural Machine Translation

16. Multilingual Speech Recognition with Self-Attention Structured Parameterization

17. Experience Grounds Language [EMNLP2020]

  1. Propose the notion of a World Scope (WS) as a lens to audit progress in NLP
  2. Five levels of WS: Corpus, Internet, Perception, Embodiment, Social

18. Monolingual Adapters for Zero-Shot Neural Machine Translation [EMNLP2020]

  1. Propose language-specific adapter layers which are more parameter-efficient $2n$, $O(n)$ than bilingual ones $n\cdot (n-1)$, $O(n^2)$ and enable combining any encoder adapter with other decoder adapters