[Paper-S2ST] Direct speech-to-speech translation with a sequence-to-sequence model

7 minute read


Last Updated: 2020-04-02

In this blog, I would like to introduce this paper: Direct speech-to-speech translation with a sequence-to-sequence model proposed by Google.

This paper is published at InterSpeech 2019.

Code: Not Available.

1. Background

There have been many systems focusing on the Speech-To-Speech Translation (S2ST) before, they are mostly broken down into three components: ASR, MT and TTS synthesis. Quite some of them have achieved promising results, so why do we still need the end-to-end S2ST model?

1.1 Why do we need end-to-end S2ST?

I have been thinking of such an condition. As claimed by Ethnologue.com, there are over 7000 languages in the world but over 3000 of them do not have a writing system, which makes the speech recognition hard.

The exact number of unwritten languages is hard to determine. Ethnologue (21st edition) has data to indicate that of the currently listed 7,117 living languages, 3,995 have a developed writing system. We don’t always know, however, if the existing writing systems are widely used. That is, while an alphabet may exist there may not be very many people who are literate and actually using the alphabet. The remaining 3,116 are likely unwritten.

Therefore, we can conclude two advantages of E2E model here.

  1. It can handle languages without text, while those need ASR to get intermediate text representations can’t.
  2. It saves time.

1.2 What is challenging in S2ST?

  1. It’s hard to collect parallel input/output speech pairs.
  2. Uncertain alignment between two spectrograms.

1.3 Previous research

The authors also investigated the previous research on S2ST by combining different sub-systems.

  1. [5, 6] gave MT access to the lattice of the ASR.
  2. [7,8] integrated acoustic and translation models using a stochastic finite-state transducer which can decode the translated text directly using Viterbi search.
  3. [9] used unsupervised clustering to find F0-based prosody features and transfer intonation from source speech and target.
  4. [10] augmented MT to jointly predict translated words and emphasis, in order to improve expressiveness of the synthesized speech.
  5. [11] used a neural network to transfer duration and power from the source speech to the target.
  6. [12] transfered source speaker’s voice to the synthesized translated speech by mapping hidden Markov model states from ASR to TTS.
  7. Recent work on neural TTS has focused on adapting to new voices with limited reference data [13–16].

as well as those on end-to-end Speech-To-Text Translation (ST)

  1. Initial approaches to end-to-end speech-to-text translation (ST) [17,18] performed worse than a cascade of an ASR model and an MT model.
  2. [19,20] achieved better end-to-end performance by leveraging weakly supervised data with multitask learning.
  3. [21] further showed that use of synthetic training data can work better than multitask training.

Finally, there is one existing research on end-to-end S2ST:

  1. [25] demonstrated an attention-based direct S2ST model on a toy dataset with a 100 word vocabulary.

In this work, the authors take advantage of both synthetic training targets and multitask training. And the model is trained on real speech, including spontaneous telephone conversations, at a much larger scale.

2. Translatotron (proposed model)

The authors leverage high level representations of source/target text by applying multi-task learning.

The model does not perform as well as a baseline cascaded system.

2.1 Model Architecture

Let’s first have an overview of the model architecture.


The model is composed of three separately trained components:

  1. (Blue) an attention-base sequence-to-sequence network to generate target spectrograms.

    The sequence-to-sequence encoder stack maps 80-channel log-mel spectrogram input features into hidden states which are passed through an attention-based alignment mechanism to condition an autoregressive decoder, which predicts 1025-dim log spectrogram frames corresponding to the translated speech.

    The spectrogram decoder uses an architecture similar to Tacotron 2 TTS model [26], including pre-net, autoregressive LSTM stack, and post-net components.

  2. (Red) a vocoder to convert spectrograms to time-domain waveforms.

  3. (Green, optional) a speaker encoder to enable cross-language voice conversion simultaneously with translation.

  4. (Light blue, optional) two individual auxiliary decoders to predict source and target phoneme sequences.

2.2 Experiments

The authors conduct experiments on 3 tasks:

  1. Conversational Spanish-to-English. (synthesized speech)
  2. Fisher Spanish-to-English. (synthesized speech)
  3. Cross language voice transfer. (real speech)


  1. objective: Waveform –> Pretrained ASR model –> BLEU
  2. MOS: speech naturalness & speaker similarity (for task 3)

Pretrained ASR model: 16k Word-Piece attention-based ASR model from [41] trained on the 960 hour LibriSpeech corpus[42], which obtained word error rates of 4.7% and 13.4% on the test-clean and test other sets, respectively.

2.2.1 Conversational Spanish-to-English

In this experiment, the authors use English Tacotron 2 + Griffin-Lim synthesized target speech with background noise and reverberation in the same manner as [21].

The resulting dataset contains 979k parallel utterance pairs, containing 1.4k hours of source speech and 619 hours of synthesized target speech. The total target speech duration is much smaller because the TTS output is better endpointed, and contains fewer pauses. 9.6k pairs are held out for testing.

Input feature frames are created by stacking 3 adjacent frames of an 80-channel log-mel spectrogram as in [21].

The table below shows the result of the model w/o different auxiliary losses (multitask learning). We can find that training without auxiliary losses leads to extremely poor performance (first row).

The model correctly synthesizes common words and simple phrases, e.g. translating “hola” to “hello”. However, it does not consistently translate full utterances. While it always generates plausible speech sounds in the target voice, the output can be independent of the input, composed of a string of nonsense syllables. This is consistent with failure to learn to attend to the input, and reflects the difficulty of the direct S2ST task.

Training using both auxiliary tasks achieved the best quality, but the performance difference between different combinations is small.


2.2.2 Fisher Spanish-to-English

This dataset contains about 120k parallel utterance pairs, spanning 127 hours of source speech.

Target speech is synthesized using Parallel WaveNet[43] using the same voice as the previous section. The result contains 96 hours of synthetic target speech. Following [19], input features were constructed by stacking 80-channel log-mel spectrograms, with deltas and accelerations.

  • Note that the data size is much smaller than the previous experiment, so the authors tune the parameter settings for regularization. Please refer to the paper for details.

The results are as follows:


2.2.3. Subjective evaluation of speech naturalness

To evaluate synthesis quality of the best performing models from Tables 2 and 3 we use the framework from [15] to crowdsource 5-point MOS evaluations based on subjective listening tests. 1k examples were rated for each dataset, each one by a single rater.

Here is the result:


2.2.4. Cross language voice transfer

This is not the core section in this paper. So I will just briefly introduce the result here.


3. Conclusion

In this paper, the authors found that it is important to use speech transcripts during training. The model achieves high translation quality on two Spanish-to-English datasets, although performance is not as good as a baseline cascade of ST and TTS models.

In addition, we demonstrate a variant which simultaneously transfers the source speaker’s voice to the translated speech.

The authors also state that potential strategies to improve voice transfer performance include improving the speaker encoder by adding a language adversarial loss, or by incorporating a cycle-consistency term [13] into the S2ST loss.

Other future work includes utilizing weakly supervision to scale up training with synthetic data [21] or multitask learning [19,20], and transferring prosody and other acoustic factors from the source speech to the translated speech following [45–47].

Finally, here is a paper list for speech translation: https://github.com/dqqcasia/speech_translation-papers.

I will try to do the best research from now on.