TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation
Unit2Lip (Unit-Based Audio-Visual Resynthesis)
TransFace (Zero-Shot Speech(A)-To-Speech(AV) Translation)

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. This approach circumvents delays and cascading errors associated with model cascading. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges compared to audio speech: (1) Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors. (2) Talking head translation has a limited set of reference frames. If the generated translation exceeds the length of the original speech, the video sequence needs to be supplemented by repeating frames, leading to jarring video transitions. In this work, we propose a model for talking head translation, \textbf{TransFace}, which can directly translate audio-visual speech into audio-visual speech in other languages. It consists of a speech-to-unit translation model to convert audio speech into discrete units and a unit-based audio-visual speech synthesizer, Unit2Lip, to re-synthesize synchronized audio-visual speech from discrete units in parallel. Furthermore, we introduce a Bounded Duration Predictor, ensuring isometric talking head translation and preventing duplicate reference frames. Experiments demonstrate that our proposed Unit2Lip model significantly improves synchronization (1.601 and 0.982 on LSE-C for the original and generated audio speech, respectively) and boosts inference speed by a factor of $\times$4.35 on LRS2. Additionally, TransFace achieves impressive BLEU scores of 61.93 and 47.55 for Es-En and Fr-En on LRS3-T and 100\% isochronous translations.

Unit2Lip (Unit-Based Audio-Visual Resynthesis)

English (training with 29h English corpus)

Audio(ori)+Video(ori)	Audio(ori)+Video(gen)	Audio(gen)+Video(ori)	Audio(gen)+Video(gen)

Spanish (training with 29h English corpus)

Audio(ori)+Video(ori)	Audio(ori)+Video(gen)	Audio(gen)+Video(ori)	Audio(gen)+Video(gen)

French (training with 29h English corpus)

Audio(ori)+Video(ori)	Audio(ori)+Video(gen)	Audio(gen)+Video(ori)	Audio(gen)+Video(gen)

TransFace (Zero-Shot Speech(A)-To-Speech(AV) Translation)

Es-En on LRS3-T

	Source Speech(A)	Target Speech(AV)	AVSR+NMT+TTS+Wav2Lip	ST+TTS+Wav2Lip	S2ST+Wav2Lip	TransFace	TransFace+bounded
Sample1
Reference/AVSR	No sabía que realmente tienes que tener control creativo	I didn't know you actually have to have creative control	I didn't know that you really have to have creative control	I didn't know that you really have the creative control	i didn't know that they had to have the creative control	i didn't know that they had to have the creative control	i didn't know that you really have to be creative control

Fr-En on LRS3-T

	Source Speech(A)	Target Speech(AV)	AVSR+NMT+TTS+Wav2Lip	ST+TTS+Wav2Lip	S2ST+Wav2Lip	TransFace	TransFace+bounded
Sample1
Reference/AVSR	Et tout cela est très important parce que la sécurité publique est pour moi la fonction la plus importante de	and all of this matters greatly because public safety to me is the most important function of	all this is very important because public safety is for me the most important function of	all this is very important because public safety is for me the most important function	and all this is very important because the public safety is for me the most important function	and all this is very important because the public safety is for me the most important function of	and all this is very important because public safety for me is the most important function of

En-Es on LRS3-T

	Source Speech(AV)	Target Speech(A)	TransFace	TransFace+bounded
Sample1
Sample2

En-Fr on LRS3-T

	Source Speech(AV)	Target Speech(A)	TransFace	TransFace+bounded
Sample1
Sample2