[Paper-PreTrain] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

14 minute read


Last Updated: 2020-06-13

This paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding is proposed by researchers from Google AI. This work is published at NAACL-HLT 2019.

Code: https://github.com/google-research/bert

This paper introduces BERT: Bidirectional Encoder Representations from Transformers.


  1. GLUE score: 80.5%
  2. MultiNLI accuracy: 86.7%
  3. SQuAD v1.1 Test F1: 93.2
  4. SQuAD v2.0 Test F1: 83.1

1. Introduction

Two existing strategies for applying pre-trained language representations to down-stream tasks: feature-based (e.g., ELMo) and fine-tuning (GPT). The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

Such unidirectional restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying fine-tuning based approaches to token-level tasks such as question answering


  1. Demonstrate the importance of bidirectional pre-training for language representations in contrast to ELMo and GPT.
  2. Pre-trained representations reduce the need for many heavily-engineered task-specific architectures
  3. BERT advances the state of the art for 11 NLP tasks.



During pre-training, the model is trained on unlabeled data over different pre-training tasks.

For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks.

The overall architecture is shown in Figure 1:


Model Architecture

In this work, we denote the number of layers(i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. In all cases we set the feed-forward/filter size to be 4H, i.e., 3072 for the H= 768 and 4096 for the H= 1024. We primarily report results on two model sizes: BERT_BASE (L=12, H=768, A=12, Total Parameters=110M) and BERT_LARGE (L=24, H=1024, A=16, Total Parameters=340M).

Note that BERT_BASE was chosen to have the same model size as OpenAI GPT for comparison purposes.

Input/Output Representations


We use WordPiece embeddings (Wu et al.,2016) with a 30,000 token vocabulary.

Use of [CLS] token:

The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.

Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B.

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.


3.1. Pre-training BERT

Task #1: Masked LM

In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. The final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask 15% of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than reconstructing the entire input.

There is a downside for the masking:

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning.

To mitigate this:

The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with

(1) the [MASK] token 80% of the time

(2) a random token 10% of the time

(3)the unchanged i-th token 10% of the time.

Then, T_i will be used to predict the original token with cross entropy loss.

The authors conducted ablation study for the variations in Appendix C.2:


From Table 8 it can be seen that fine-tuning is surprisingly robust to different masking strategies. However, as expected, using only the MASK strategy was problematic when applying the feature-based approach to NER.

Task #2: Next Sentence Prediction (NSP)

Motivation: Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling.

When choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext). C in Figure 1 is used for next sentence prediction (NSP).

Pre-training Data

BookCorpus (800M words) and English Wikipedia (2,500M words)

For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the BillionWord Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.

3.2. Fine-tuning BERT

For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional cross attention, such as Parikh et al. (2016); Seo et al. (2017). BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.

At the input, sentence A and sentence B from pre-training are analogous to

(1) sentence pairs in paraphrasing,

(2) hypothesis-premise pairs in entailment,

(3) question-passage pairs in question answering,

(4) a degenerate text-∅ pair in text classification or sequence tagging.

At the output, the token representations are fed into an output layer for token-level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

4. Experiments

In this section, we present BERT fine-tuning results on 11 NLP tasks.

4.1. GLUE

The General Language Understanding Evaluation(GLUE) benchmark (Wang et al., 2018a) is a collection of diverse natural language understanding tasks. Detailed descriptions of GLUE datasets are included in Appendix B.1. They are all classification tasks.

To fine-tune on GLUE, we represent the input sequence (for single sentence or sentence pairs) as described in Section 3, and use the final hidden vector C∈R^H corresponding to the first input token ([CLS]) as the aggregate representation. The only new parameters introduced during fine-tuning are classification layer weights W∈R^{K×H}, where K is the number of labels.

We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks. For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set. Additionally, for BERT_LARGE we found that fine-tuning was sometimes unstable on small datasets, so we ran several random restarts and selected the best model on the Dev set. With random restarts,we use the same pre-trained checkpoint but per-form different fine-tuning data shuffling and classifier layer initialization.

Results are presented in Table 1.


We can find:

  1. BERT_LARGE > BERT_BASE > all other systems on all tasks
  2. Note that BERT_BASE and OpenAI GPT are nearly identical in terms of model architecture apart from the attention masking.

4.2. SQuAD v1.1

The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowd-sourced question/answer pairs (Rajpurkar et al., 2016). Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.

As shown in Figure 1, in the question answering task, we represent the input question and passage as a single packed sequence, with the question using the A embedding and the passage using the B embedding.

Way to decide answer span:

We only introduce a start vector S∈R^H and an end vector E∈R^H during fine-tuning. The probability of word i being the start of the answer span is computed as a dot product between T_i and S followed by a softmax overall of the words in the paragraph. The analogous formula is used for the end of the answer span. The score of a candidate span from position i to position j is defined as S·T_i+E·T_j, and the maximum scoring span where j≥i is used as a prediction.

Training details:

The training objective is the sum of the log-likelihoods of the correct start and end positions. We fine-tune for 3 epochs with a learning rate of 5e-5 and a batch size of 32.



TriviaQA means using modest data augmentation by first fine-tuning on TriviaQA (Joshiet al., 2017) before fine-tuning on SQuAD.

4.3. SQuAD v2.0

The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic.

We treat questions that do not have an answer as having an answer span with start and end at the [CLS] token. The probability space for the start and end answer span positions is extended to include the position of the [CLS] token. For prediction, we compare the score of the no-answer span: s_null=S·C+E·C to the score of the best non-null span s_{i,j}=max_{j≥i}S·T_i+E·T_j. We predict a non-null answer when s_{i,j}> s_null+τ, where the threshold τ is selected on the dev set to maximize F1.

Training details:

We did not use TriviaQA data for this model. We fine-tuned for 2 epochs with a learning rate of 5e-5and a batch size of 48


4.4. SWAG

The Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded commonsense inference (Zellers et al., 2018). Given a sentence, the task is to choose the most plausible continuation among four choices.

When fine-tuning on the SWAG dataset, we construct four input sequences, each containing the concatenation of the given sentence (sentence A) and a possible continuation (sentence B). The only task-specific parameters introduced is a vector whose dot product with the [CLS] token representation C denotes a score for each choice which is normalized with a softmax layer.

Training details:

We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16.


5. Ablation Studies

In this section, we perform ablation experiments over a number of facets of BERT in order to better understand their relative importance. Additional ablation studies can be found in Appendix C.

5.1. Effect of Pre-training Tasks

No NSP: A bidirectional model which is trained using the “masked LM” (MLM) but without the “next sentence prediction” (NSP) task.

LTR & No NSP: A left-context-only model which is trained using a standard Left-to-Right (LTR) LM, rather than an MLM. The left-only constraint was also applied at fine-tuning, because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance. Additionally,this model was pre-trained without the NSP task.


We can find:

  1. removing NSP hurts performance significantly on QNLI, MNLI,and SQuAD 1.1.
  2. The LTR model performs worse than the MLM model on all tasks, with large drops on MRPC and SQuAD.
  3. Adding a randomly initialized BiLSTM on top significantly improve results on SQuAD, but the results are still far worse than those of the pre-trained bidirectional model. The BiLSTM hurts performance on the GLUE tasks.

5.2. Effect of Model Size

Results on selected GLUE tasks are shown in Table 6. In this table, the average Dev Set accuracy from 5 random restarts of fine-tuning is reported.


We can see:

  1. larger models lead to a strict accuracy improvement across all four datasets, even for MRPC which only has 3,600 labeled training examples, and is substantially different from the pre-training tasks.
  2. This is the first work to demonstrate convincingly that scaling to extreme model sizes also leads to large improvements on very small scale tasks.

5.3. Feature-based Approach with BERT

All of the BERT results presented so far have used the fine-tuning approach, where a simple classification layer is added to the pre-trained model, and all parameters are jointly fine-tuned on a down-stream task. However, the feature-based approach, where fixed features are extracted from the pre-trained model, has certain advantages.

  1. not all tasks can be easily represented by a Transformer encoder architecture, and therefore require a task-specific model architecture to be added
  2. there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation

In this section, we compare the two approaches by applying BERT to the CoNLL-2003 Named Entity Recognition (NER) task (Tjong Kim Sangand De Meulder, 2003).

In the input to BERT, we use a case-preserving WordPiece model, and we include the maximal document context provided by the data. Following standard practice, we formulate this as a tagging task but do not use a CRF layer in the output. We use the representation of the first sub-token as the input to the token-level classifier over the NER label set.

To ablate the fine-tuning approach, we apply the feature-based approach by extracting the activations from one or more layers without fine-tuning any parameters of BERT. These contextual embeddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer.

Results are presented in Table 7.


We can find:

  1. BERT_LARGE performs competitively with state-of-the-art methods.
  2. The best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer, which is only0.3 F1 behind fine-tuning the entire model.

6. Conclusion

Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pre-training is an integral part of many language understanding systems. Our major contribution is further generalizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.