[Paper-NLP] Get To The Point: Summarization with Pointer-Generator Networks

9 minute read

Published:

Last Updated: 2020-08-11

Slides used for my final presentation in Language Engineering at Tokyo Tech: [slides].

This paper: Get To The Point: Summarization with Pointer-Generator Networks is proposed by researchers from Stanford and Google.

Code: https://github.com/rohithreddy024/Text-Summarizer-Pytorch (PyTorch, not official), https://github.com/abisee/pointer-generator (TensorFlow 1.x, official)

This paper introduces Pointer-Generator Network which solves the two shortcomings of current models for abstractive text summarization:

  1. they are liable to reproduce factual details inaccurately
  2. they tend to repeat themselves

Novelty:

  1. use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator.
  2. use coverage to keep track of what has been summarized, which discourages repetition

1. Introduction

Two approaches to summarization: extractive and abstractive.

  1. Extractive methods assemble summaries exclusively from passages (usually whole sentences) taken directly from the source text
  2. abstractive methods may generate novel words and phrases not featured in the source text – as a human-written abstract usually does.

An comparison of proposed method and baseline Seq2Seq model with attention, from which we can find that the pointer-generator solve the out-of-vocabulary (OOV) problem while coverage eliminates the repetition problem:

2020-07-09-blog-post-22-1

2. Our Models

2.1. Sequence-to-sequence attentional model (baseline)

The model architecture is shown below:

2020-07-09-blog-post-22-2

Explanation:

  1. Attention distribution/weights (Blue): a^t at decoding time step t over the encoder hidden states h_i.
\[a^t=\text{softmax}(e^t), \quad \text{where} \quad e_i^t=v^T \tanh(W_h h_i + W_s s_t + b_{attn}),\]

​ where v, W_h, W_s, b_{attn} are learnable parameters.

  1. Context Vector: weighted average of encoder hidden states h_i with attention weights a^t used at decoding time step t.
\[h^*_t=\sum_{i}{a_i^t * h_i}\]
  1. Vocabulary distribution (Green): the probability distribution over all words in the vocabulary.
\[P_{vocab}=\text{softmax}(V'(V[s_t; h_t^*]+b)+b)\]

​ More intuitively,

\[P_{vocab}=\text{softmax}(\text{Linear(Linear([$s_t;h_t^*$]))})\]

​ We can see that the input is just concatenation of context vector and decoder state at time step t.

Loss function:

During training, the loss for time step t is the negative log likelihood (NLLLoss) of the target word w^∗_t for that timestep:

\[loss_t = -\log P(w_t^*)\]

and the overall loss for the whole sequence is:

\[loss=\frac{1}{T}\sum_{t=0}^{T-1} loss_t=-\frac{1}{T}\sum_{t=0}^{T-1}\log P(w^*_t)\]

2.2. Pointer-generator network

The model architecture is shown below:

2020-07-09-blog-post-22-3

Explanation:

The first three components are the same as the baseline model as explained in section 2.1.

  1. Attention distribution/weights (Blue): a^t at decoding time step t over the encoder hidden states h_i.
\[a^t=\text{softmax}(e^t), \quad \text{where} \quad e_i^t=v^T \tanh(W_h h_i + W_s s_t + b_{attn}),\]

​ where v, W_h, W_s, b_{attn} are learnable parameters.

  1. Context Vector: weighted average of encoder hidden states h_i with attention weights a^t used at decoding time step t.
\[h^*_t=\sum_{i}{a_i^t * h_i}\]
  1. Vocabulary distribution (Green): the probability distribution over all words in the vocabulary.
\[P_{vocab}=\text{softmax}(V'(V[s_t; h_t^*]+b)+b)\]

​ More intuitively,

\[P_{vocab}=\text{softmax}(\text{Linear(Linear([$s_t;h_t^*$]))})\]

​ which serves as the final distribution of the model output:

\[P(w)=P_{vocab}(w)\]

The components below are introduced by pointer-generator to bring improvements:

  1. Generation probability (Yellow): the probability of generating a word from the vocabulary at decoding time step t.
\[p_{gen}=\text{sigmoid}(w_{h^*}^T h_t^*+w_s^T s_t+w_x^T x_t + b_{ptr})\]
  1. Final distribution: probability distribution over the extended vocabulary composed of the vocabulary and words from the source text.
\[P(w)=p_{gen}P_{vocab}(w)+(1-p_{gen})\sum_{i:w_i=w}a_i^t\]

Loss function:

The loss function is as described in equations (5) and (6), but with respect to the modified probability distribution P(w) given in equation (13).

2.3. Coverage mechanism

Coverage mechanism is proposed to solve the problem of repetition in Seq2Seq models.

A coverage vector c^t is maintained, which is the sum of attention distributions over all previous decoder timesteps:

\[c^t=\sum_{t'=0}^{t-1}a^{t'}\]

Note that c^0 is a zero vector, because on the first timestep, none of the source document has been covered.

The coverage vector is used as extra input to the attention mechanism:

\[a^t=\text{softmax}(e^t), \quad \text{where} \quad e_i^t=v^T \tanh(W_h h_i + W_s s_t + w_c c_i^t + b_{attn})\]

where w_c is a learnable parameter vector of same length as v.

The authors also find it necessary (see section 5) to additionally define a coverage loss to penalize repeatedly attending to the same locations:

\[\text{covloss}_t=\sum_i \min(a_i^t, c_i^t)\]

Note that the coverage loss is bounded; in particular equal to or less than 1 (sum of softmax attention).

Finally, the coverage loss, reweighted by some hyperparameter λ, is added to the primary loss function to yield a new composite loss function:

\[loss=\frac{1}{T}\sum_{t=0}^{T-1} loss_t=\frac{1}{T}\sum_{t=0}^{T-1}\left(-\log P(w^*_t) +\lambda \sum_i \min(a_i^t, c_i^t) \right)\]

Skipped

4. Dataset

CNN/Daily Mail dataset (Hermannet al., 2015; Nallapati et al., 2016), which contains online news articles (781 tokens on average) paired with multi-sentence summaries (3.75 sentences or 56 tokens on average).

We use the scripts supplied by Nallapati et al. (2016) to obtain the same version of the the data, which has 287,226 training pairs, 13,368 validation pairs and 11,490 test pairs.

Both the dataset’s published results (Nallapati et al., 2016, 2017) use the anonymized version of the data, which has been pre-processed to replace each named entity, e.g.,The United Nations, with its own unique identifier for the example pair, e.g., @entity5. By contrast, we operate directly on the original text (or non-anonymized version of the data), which we believe is the favorable problem to solve because it requires no pre-processing.

5. Experiments

For all experiments, our model has 256-dimensional hidden states and 128-dimensional word embeddings. For the pointer-generator models, we use a vocabulary of 50k words for both source and target – note that due to the pointer network’s ability to handle OOV words, we can use a smaller vocabulary size than Nallapati et al.’s (2016) 150k source and 60k target vocabularies. For the baseline model, we also try a larger vocabulary size of 150k.

Note that the pointer and the coverage mechanism introduce very few additional parameters to the network: for the models with vocabulary size 50k, the baseline model has 21,499,600 parameters, the pointer-generator adds 1153 (256 * 2 * 2 + 128 + 1) extra parameters (w_{h^∗}, w_s, w_x, and b_{ptr} in equation 12), and coverage adds 512 (256 * 2 directions) extra parameters (w_c in equation 15).

Training details can be found in the paper.

About the coverage mechanism:

To obtain our final coverage model, we added the coverage mechanism with coverage loss weighted to λ=1 (as described in equation 17), and trained for a further 3000 iterations (about 2 hours).

We tried training the coverage model without the loss function but found this to be ineffective, with no discernible reduction in repetition. We also tried training with coverage from the first iteration rather than as a separate training phase,but found that in the early phase of training, the coverage objective interfered with the main objective, reducing overall performance.

6. Results

6.1. Preliminaries

Results are given in Table 1. The models are evaluated with the standard ROUGH metric, reporting the F1 scores for ROUGE-1, ROUGE-2 and ROUGE-L (which respectively measure the word-overlap, bigram-overlap, and longest common sequence between the reference summary and the summary to be evaluated). ROUGE scores are obtained using the pyrouge package. (https://arxiv.org/pdf/pypi.python.org/pypi/pyrouge/0.1.3)

We also evaluate with the METEOR metric, both in exact match mode (rewarding only exact matches between words) and full mode (which additionally rewards matching stems, synonyms and para-phrases). (http://www.cs.cmu.edu/~alavie/METEOR)

2020-07-09-blog-post-22-4

6.2. Observations

Baselines:

Both baseline models perform poorly with respect to ROUGE and METEOR, and the larger vocabulary size (150k) does not seem to help. (50k is even better)

Factual details are frequently reproduced incorrectly, often replacing an uncommon (but in-vocabulary) word with a more-common alternative. For example in Figure 1,the baseline model appears to struggle with the rare word thwart, producing destabilize instead, which leads to the fabricated phrase destabilize nigeria’s economy. Even more catastrophically, the summaries sometimes devolve into repetitive nonsense, such as the third sentence produced by the baseline model in Figure 1. In addition, the baseline model can’t reproduce OOV words (such as muhammadu buhariin Figure 1). Further examples of all these problems are provided in the supplementary material.

Pointer-generator:

Pointer-generator model achieves much better ROUGE and METEOR scores than the baseline, despite many fewer training epochs.

OOV words are handled easily, factual details are almost always copied correctly, and there are no fabrications (see Figure 1). However, repetition is still very common.

Pointer-generator with coverage:

Pointer-generator model with coverage improves the ROUGE and METEOR scores further,convincingly surpassing the best abstractive model of Nallapati et al. (2016)

Despite the brevity of the coverage training phase (about 1% of the total training time), the repetition problem is almost completely eliminated, which can be seen both qualitatively (Figure1) and quantitatively (Figure 4). However, our best model does not quite surpass the ROUGE scores of the lead-3 baseline, nor the current best extractive model (Nallapati et al., 2017).

2020-07-09-blog-post-22-1

2020-07-09-blog-post-22-5

7. Discussion

7.1. Comparison with extractive systems

Table 1 shows stronger performance by extractive systems (including lead-3 baseline which summarize using the first three sentences only).

Explanation:

  1. News articles tend to be structured with the most important information at the start
  2. ROUGE metric naturally prefers extractive approaches, but abstractive summaries are subjective

7.2. How abstractive is our model?

Figure 6 shows final model copies whole article sentences 35% of the time; by comparison the reference summaries do so only 1.3% of the time.

2020-07-09-blog-post-22-6

8. Conclusion

This paper introduces copying mechanism to abstractive summarization task and further achieve improvements by proposing coverage mechanism. It is an interesting attempt and obtains relatively promising results.

For me there are, however, still several drawbacks to overcome:

  1. For abstractive text generation, what is a more suitable metric apart from existing BLEU, ROUGE & METEROR, etc.?
  2. The copying mechanism improves model performance but degrades the abstractive capability, which remains a problem to solve.