Hello everyone ๐, my name is Srikar Vardhan, and I'm a final-year student at NIT Silchar.
This project is my attempt at building an end-to-end speech-to-speech translation pipeline with lip-syncing.
The system translates English speech into Telugu (or other languages), preserves the speakerโs voice, and synchronizes lip movements for a natural dubbed video.
Iโll walk you through the idea, implementation, challenges, and how to run it yourself.
If you like this project, please โญ star the repo and give credit ๐.
you can read my medium blog about this project
The pipeline is a multi-stage process where the output of one model becomes the input for the next. This modular approach allows for flexibility and high-quality results at each step.
| Stage | Model/Tool | Input | Output |
|---|---|---|---|
| 1. Transcription | ๐ฃ๏ธ Whisper ASR | Video with English Speech | English Text |
| 2. Translation | ๐ NLLB NMT | English Text |
Telugu Text |
| 3. Speech Synthesis | ๐ MMS-TTS | Telugu Text |
Telugu Audio (Generic Voice) |
| 4. Voice Conversion | ๐งฌ RVC | Generic Audio + Speaker's Voice Sample |
Telugu Audio (Original Voice) |
| 5. Lip Syncing | ๐ Wav2Lip | Original Video + Final Audio |
Final Video (Synced) |
Click on the images below to watch the full videos hosted on Google Drive.
| Original Video (English) | Translated & Dubbed Video (Telugu) |
|---|---|
![]() |
![]() |
| Watch Original Video | Watch Dubbed Video |
๐ demo
โโโ 1.py # Translation pipeline (video โ translated audio)
โโโ 1.txt
โโโ 3.py
โโโ Advanced-RVC-Inference # Voice conversion (RVC)
โ โโโ config.py
โ โโโ configs/
โ โ โโโ 32k.json
โ โ โโโ 40k.json
โ โ โโโ 48k.json
โ โโโ hubert_base.pt
โ โโโ lib/
โ โ โโโ audio.py
โ โ โโโ commons.py
โ โ โโโ data_utils.py
โ โ โโโ ...
โ โโโ my_convert.py
โ โโโ requirements.txt
โ โโโ vc_infer_pipeline.py
โ โโโ weights/
โ โโโ modi.pth
โ โโโ model.index
โ โโโ ...
โโโ dubbed_test.mp4
โโโ lip.py # Lip-sync module (Wav2Lip)
โโโ main.py # Orchestration script (runs everything end-to-end)
โโโ models/
โ โโโ translation/
โ โ โโโ nllb-en-te/
โ โ โโโ config.json
โ โ โโโ model.safetensors
โ โ โโโ ...
โ โโโ tts/
โ โ โโโ mms-tel/
โ โ โโโ config.json
โ โ โโโ model.safetensors
โ โ โโโ ...
โ โโโ whisper/
โ โโโ base.en.pt
โโโ output.mp4
โโโ requirment.txt
โโโ temp/
โ โโโ result.avi
โโโ test.mp4
โโโ test_converted.wav
โโโ test_translated.wav
โโโ wav2Lip
โโโ checkpoints/
โ โโโ wav2lip.pth
โ โโโ wav2lip_gan.pth
โโโ face_detection/
โ โโโ api.py
โ โโโ core.py
โ โโโ sfd/
โ โโโ net_s3fd.py
โ โโโ s3fd.pth
โโโ inference.py
โโโ models/
โ โโโ conv.py
โ โโโ syncnet.py
โ โโโ wav2lip.py
โโโ preprocess.py
โโโ requirements.txtgit clone [https://github.com/M-SRIKAR-VARDHAN/speech-to-speech-with-lipsync.git](https://github.com/M-SRIKAR-VARDHAN/speech-to-speech-with-lipsync.git)
cd speech-to-speech-with-lipsyncItโs recommended to use Python 3.10+ in a virtual environment.
conda create -n lipsync python=3.10 -y
conda activate lipsyncpip install -r requirements.txtThe large model files (ASR, NMT, TTS, RVC, Wav2Lip) are not included in the repo. You must download them and extract them into the correct folders as shown in the repository structure.
To run the entire process, use main.py.
python main.py --video_file "test.mp4" --output_file "output.mp4"This command will:
- Extract audio and transcribe with Whisper.
- Translate English โ Telugu using NLLB.
- Generate Telugu speech with MMS-TTS.
- Convert to the original speakerโs voice with RVC.
- Sync lips with Wav2Lip and save the final video.
If you already have a dubbed video and the target audio, you can run the lip-sync module alone.
python lip.py --input_video dubbed_test.mp4 --audio translated_audio.wav --output_video final_output.mp4- Approach: Use a model like Google's Translatotron for direct audio-to-audio translation.
- Result: Failed in practice. The model was too complex and unreliable, especially with limited compute.
- Approach: A standard pipeline, but using a voice cloning model at the end.
- Result: This worked for English but failed when cloning for Telugu. Training a custom voice cloning model was not feasible.
- Approach: Instead of full voice cloning, I used RVC (Retrieval-based Voice Conversion).
- Result: Success! I trained a model on ~15 minutes of speech and it worked remarkably well. This approach is practical, efficient, and language-independent.
- The default resolution for lip-sync is 480p, which is CPU-friendly. You can increase it to 1080p if you have a GPU.
- The
wav2lip_gan.pthcheckpoint gives sharper facial results. - You can swap models to support any target language and retrain RVC for any speaker's voice.
A thank you to Revelli Eshwar for their support in this project.
- ๐ง Email
- ๐ป GitHub
- ๐ LinkedIn
- ๐ Google Scholar
- ๐ Resume
