[Paper-ST] Phone Features Improve Speech Translation

8 minute read


Last Updated: 2020-06-09

This paper: Phone Features Improve Speech Translation is proposed by researchers from JHU and CMU. It is accepted by ACL 2020. This paper is recommended for its comprehensive experiments and analysis.

Code: github.com/esalesky/xnmt-devel (seems not released yet)

The authors compared cascaded and end-to-end models across high, medium, and low-resource conditions, and showed that cascades remain stronger baselines.

Further, the authors introduced two methods to incorporate phone features into ST models which improves both architectures and closes the gap between end-to-end models and cascades.

1. Introduction

The authors propose two simple heuristics to integrate phoneme-level information into neural speech translation models:

(1) as a more robust intermediate representation in a cascade

(2) as a concatenated embedding factor.

Data: Fisher Spanish–English dataset

We compare to recent work using phone segmentation for end-to-end speech translation (Salesky et al., 2019), and show that our methods outperform this model by up to 20 BLEU on our lowest-resource condition.

Finally, we test model robustness by varying the quality of our phone features, which may indicate which models will better generalize across differently-resourced conditions

2. Models with Phone Supervision

Two proposed methods to incorporate phone features into cascaded and end-to-end models:


1. Phone cascade: uses phone labels as the ASR output and the machine translation input

2. Phone end-to-end: concatenates trainable phone embeddings to typical speech feature vector input. Note that this method maintains the same source sequence length as the original speech feature sequence.

3. Phone Segmentation (E2E baseline): uses phone boundaries to segment consecutive speech frames by averaging a variable number of features with the same phone label. This significantly reduces source sequence lengths (by∼80%), reducing the number of model parameters and memory

3. Data

Fisher Spanish-English Corpus containing 160 hours of Spanish telephone speech, split into 138k utterances. Standard dev/test sets are used. For medium / low resource experiments, 40 / 20 hours subsets of the data are randomly selected.

4. Generating Phone Supervision

We extract 40-dimensional Mel filter bank features with per-speaker mean and variance normalization using Kaldi (Povey et al., 2011). We train an HMM/GMM system on the full Fisher Spanish dataset with the Kaldi recipe (Povey et al., 2011), using the Spanish CALLHOME Lexicon (LDC96L16), and compute per-frame phone alignments with the triphone model (tri3a) with LDA+MLLT features. This yields 50 phone labels, including silence(), noise, and laughter.

To leverage our better-performing neural ASR models for phone generation, we create essentially a ‘2-pass’ alignment procedure:

  1. generating a transcript
  2. using this transcript to force align phones.

Table 1 shows the mapping between phone quality and the ASR models used for phone feature generation. This procedure enables us to both improve phone alignment quality and also match training and inference procedures for phone generation for our translation models.


5. Model & Training Procedure

The authors used xnmt (Neubig et al., 2018) to build encoder-decoder models.

Our pyramidal encoder uses 3-layer BiLSTMs with linear network-in-network(NiN) projections and batch normalization between layers (Sperber et al., 2019; Zhang et al., 2017).

We use single layer MLP attention (Bahdanau et al., 2015) with 128 units and 1 decoder layer as opposed to 3 or 4 in previous work – we did not see consistent benefits from additional depth.

Please refer to the original paper for other details.

6. Prior Work: Cascaded vs End-to-End Models on Fisher Spanish-English

The results are shown as Table 2. Please refer to the paper for detailed baseline settings (e.g. Parameter, Additional Data, etc.)


7. Results Using Phone Features

From Table 3, we can find that phone cascade is better than phone end-to-end, but phone end-to-end performs better than baseline cascade.

Hybrid cascade uses an ASR model with phone-informed downsampling and BPE targets (Salesky et al., 2019). This improves the WER of ASR model to 28.1 on dev and 23.2 on test, matching Weiss et al. (2017)’s state-of-the-art on test (23.2) and approaching it on dev (25.7). It is best-performing model on the full dataset. However, at lower-resource conditions, it does not perform as favorably compared to phone featured models – as shown in Figure 2. This suggests improving ASR may enable cascades to perform better at high-resource conditions, but under lower-resource conditions it is not as effective as utilizing phone features.



Training time


Comparing to previous work using additional data

Incorporating phone information makes model more efficient.

We note that our phone models further outperform previous work trained with additional corpora. The attention-passing model of Sperber et al. (2019) trained on additional parallel Spanish-English text yields 38.8 on test on the full dataset, which Salesky et al. (2019) matches on the full dataset and our proposed models exceed, with the phone cascade yielding a similar result (37.4) trained on only 40 hours.

Pre-training with 300 hours of English ASR data and fine-tuning on 20 hours of Spanish-English data, Stoianet al. (2020); Bansal et al. (2019) improve their end-to-end models from≈10 BLEU to 20.2. All three of our proposed models exceed this mark trained on 20 hours of Fisher.

8. Model Robustness & Further Analysis

8.1. Phone Cascade

Figure 3 compares the impact of different phone qualities for downstream MT. Note that with gold alignments, translation performance is similar to text-based translation



The authors collapsed adjacent consecutive phones with the same label in phone cascaded models.

For the phone cascade models compared in Figure 3, we collapse adjacent consecutive phones with the same label, i.e. when three consecutive frames have been aligned to the same phone label ‘BB B’ we have reduced the sequence to a single phone ‘B’ for translation.

Translating with full sequence of phones hurt the performance.

Translating the full sequence of redundant frame-level phone labels (e.g. the same sequence length as the number of frames), for the full 160hr dataset, all models performed on average 0.6 BLEU worse; for 40hr, 1.8 BLEU worse; and with 20 hours, 4.1 BLEU worse – a 13% decrease in performance solely from non-uniqued sequences.

8.2. Phone End-to-End

Our phone end-to-end model concatenates trainable embeddings for phone labels to frame-level filterbank features, associating similar feature vectors globally across the corpus, as opposed to locally within an utterance with the phone-averaged embeddings.


The model’s performance degradation compared to the phone cascade in lower-resource conditions is likely due in part to these sequence lengths, as shown by our additional experiments with input redundancy for the cascade. The greater reduction in performance here using lower quality phones suggests the noise of the labels and concatenated filterbank features compound, further detracting from performance. Perhaps further investigation into the relative weights placed on the two embedding factors over the training process could close this additional gap.

8.3. Phone Segmentation: Salesky et al. (2019)

That work introduced downsampling informed by phone segmentation– unlike our other models, the value of the phone label is not used, but rather, phone alignments are used only to determine the boundary between adjacent phones for variable-length downsampling. We hypothesize that the primary reason for their BLEU improvements is the reduction in local redundancy between similar frames, as discovered in the previous section.

8.4. Quality of Phone Labels

Two examples of phone sequences:


We see the primary difference in produced phones between different models is the label values, rather than the boundaries.

We note that differences in frame-level phone boundaries would not affect our phone cascaded models, where the speech features are discarded, while they would affect our phone end-to-end models, where the phone labels are concatenated to speech feature vectors and associate them across the corpus.

Skipped. Please refer to the paper.

10. Conclusion

We show that phone features significantly improve the performance and data efficiency of neural speech translation models.

Greatest improvements in low-resource settings (20 hours):

  1. E2E: 5 BLEU > baseline cascade
  2. cascade: 9 BLEU > prior work

Generating phone features uses the same data as auxiliary speech recognition tasks from prior work; our experiments suggest these features are a more effective use of this data, with our models matching the performance from previous works’ performance without additional training data.