Google has introduced ‘Translatotron’, its first direct speech-to-speech translation system which can maintain a speaker’s voice and tempo while converting the speech into a different language.
According to Google’s AI blog, the new translation system provides “faster inference speed, naturally avoiding compounding errors between recognition and translation, making it straightforward to retain the voice of the original speaker after translation, and better handling of words that do not need to be translated.”
Translatotron uses a sequence-to-sequence network model that takes source spectrograms as input and generates a new one in a target language.
The system also uses two separately trained components such as a neural vocoder which takes charge of converting output spectrograms to time-domain waveforms; and an optional speaker encoder that can be used for maintaining the character of the source speaker’s voice in the translated speech, making it sound “more natural and less jarring.”
Google then concluded: “To the best of our knowledge, Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language. It is also able to retain the source speaker’s voice in the translated speech.”
‘We hope that this work can serve as a starting point for future research on end-to-end speech-to-speech translation systems,” it added.
(Photo source: medium.com/ ai.googleblog.com)