🎤 Speech-to-Speech Translation with Lip Sync

Hello everyone 👋, my name is Srikar Vardhan, and I'm a final-year student at NIT Silchar.
This project is my attempt at building an end-to-end speech-to-speech translation pipeline with lip-syncing.
The system translates English speech into Telugu (or other languages), preserves the speaker’s voice, and synchronizes lip movements for a natural dubbed video.

I’ll walk you through the idea, implementation, challenges, and how to run it yourself.
If you like this project, please ⭐ star the repo and give credit 🙏.

you can read my medium blog about this project

📊 Pipeline Overview

The pipeline is a multi-stage process where the output of one model becomes the input for the next. This modular approach allows for flexibility and high-quality results at each step.

Stage	Model/Tool	Input	Output
1. Transcription	🗣️ Whisper ASR	Video with English Speech	`English Text`
2. Translation	🌐 NLLB NMT	`English Text`	`Telugu Text`
3. Speech Synthesis	🔊 MMS-TTS	`Telugu Text`	`Telugu Audio (Generic Voice)`
4. Voice Conversion	🧬 RVC	`Generic Audio` + `Speaker's Voice Sample`	`Telugu Audio (Original Voice)`
5. Lip Syncing	👄 Wav2Lip	`Original Video` + `Final Audio`	Final Video (Synced)

✨ Demo & Results

Click on the images below to watch the full videos hosted on Google Drive.

Original Video (English)	Translated & Dubbed Video (Telugu)

Watch Original Video	Watch Dubbed Video

📂 Repository Structure

📂 demo
├── 1.py                        # Translation pipeline (video → translated audio)
├── 1.txt
├── 3.py
├── Advanced-RVC-Inference      # Voice conversion (RVC)
│   ├── config.py
│   ├── configs/
│   │   ├── 32k.json
│   │   ├── 40k.json
│   │   └── 48k.json
│   ├── hubert_base.pt
│   ├── lib/
│   │   ├── audio.py
│   │   ├── commons.py
│   │   ├── data_utils.py
│   │   └── ...
│   ├── my_convert.py
│   ├── requirements.txt
│   ├── vc_infer_pipeline.py
│   └── weights/
│       ├── modi.pth
│       ├── model.index
│       └── ...
├── dubbed_test.mp4
├── lip.py                      # Lip-sync module (Wav2Lip)
├── main.py                     # Orchestration script (runs everything end-to-end)
├── models/
│   ├── translation/
│   │   └── nllb-en-te/
│   │       ├── config.json
│   │       ├── model.safetensors
│   │       └── ...
│   ├── tts/
│   │   └── mms-tel/
│   │       ├── config.json
│   │       ├── model.safetensors
│   │       └── ...
│   └── whisper/
│       └── base.en.pt
├── output.mp4
├── requirment.txt
├── temp/
│   └── result.avi
├── test.mp4
├── test_converted.wav
├── test_translated.wav
└── wav2Lip
    ├── checkpoints/
    │   ├── wav2lip.pth
    │   └── wav2lip_gan.pth
    ├── face_detection/
    │   ├── api.py
    │   ├── core.py
    │   └── sfd/
    │       ├── net_s3fd.py
    │       └── s3fd.pth
    ├── inference.py
    ├── models/
    │   ├── conv.py
    │   ├── syncnet.py
    │   └── wav2lip.py
    ├── preprocess.py
    └── requirements.txt

⚙️ Setup Instructions

1. Clone the Repository

git clone [https://github.com/M-SRIKAR-VARDHAN/speech-to-speech-with-lipsync.git](https://github.com/M-SRIKAR-VARDHAN/speech-to-speech-with-lipsync.git)
cd speech-to-speech-with-lipsync

2. Create a Python Environment

It’s recommended to use Python 3.10+ in a virtual environment.

conda create -n lipsync python=3.10 -y
conda activate lipsync

3. Install Dependencies

pip install -r requirements.txt

4. Download Pre-trained Models

The large model files (ASR, NMT, TTS, RVC, Wav2Lip) are not included in the repo. You must download them and extract them into the correct folders as shown in the repository structure.

Download All Models
Full Project

🚀 How to Run

Full End-to-End Pipeline

To run the entire process, use main.py.

python main.py --video_file "test.mp4" --output_file "output.mp4"

This command will:

Extract audio and transcribe with Whisper.
Translate English → Telugu using NLLB.
Generate Telugu speech with MMS-TTS.
Convert to the original speaker’s voice with RVC.
Sync lips with Wav2Lip and save the final video.

Lip-Sync Only

If you already have a dubbed video and the target audio, you can run the lip-sync module alone.

python lip.py --input_video dubbed_test.mp4 --audio translated_audio.wav --output_video final_output.mp4

🧠 My Development Journey

Idea 1: Direct Speech-to-Speech

Approach: Use a model like Google's Translatotron for direct audio-to-audio translation.
Result: Failed in practice. The model was too complex and unreliable, especially with limited compute.

Idea 2: ASR → NMT → TTS → Voice Cloning

Approach: A standard pipeline, but using a voice cloning model at the end.
Result: This worked for English but failed when cloning for Telugu. Training a custom voice cloning model was not feasible.

Idea 3: Voice Conversion with RVC (Breakthrough 💡)

Approach: Instead of full voice cloning, I used RVC (Retrieval-based Voice Conversion).
Result: Success! I trained a model on ~15 minutes of speech and it worked remarkably well. This approach is practical, efficient, and language-independent.

📌 Additional Notes

The default resolution for lip-sync is 480p, which is CPU-friendly. You can increase it to 1080p if you have a GPU.
The wav2lip_gan.pth checkpoint gives sharper facial results.
You can swap models to support any target language and retrain RVC for any speaker's voice.

Special Thanks to

A thank you to Revelli Eshwar for their support in this project.

📧 Email
💻 GitHub
🔗 LinkedIn

📬 Let’s Connect

📧 Email
💻 GitHub
🔗 LinkedIn
📚 Google Scholar
📄 Resume

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎤 Speech-to-Speech Translation with Lip Sync

📊 Pipeline Overview

✨ Demo & Results

📂 Repository Structure

⚙️ Setup Instructions

1. Clone the Repository

2. Create a Python Environment

3. Install Dependencies

4. Download Pre-trained Models

🚀 How to Run

Full End-to-End Pipeline

Lip-Sync Only

🧠 My Development Journey

Idea 1: Direct Speech-to-Speech

Idea 2: ASR → NMT → TTS → Voice Cloning

Idea 3: Voice Conversion with RVC (Breakthrough 💡)

📌 Additional Notes

Special Thanks to

📬 Let’s Connect

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Advanced-RVC-Inference		Advanced-RVC-Inference
models		models
wav2Lip		wav2Lip
.gitattributes		.gitattributes
.gitignore		.gitignore
1.py		1.py
3.py		3.py
README.md		README.md
image.png		image.png
lip.py		lip.py
main.py		main.py
requirment.txt		requirment.txt

Folders and files

Latest commit

History

Repository files navigation

🎤 Speech-to-Speech Translation with Lip Sync

📊 Pipeline Overview

✨ Demo & Results

📂 Repository Structure

⚙️ Setup Instructions

1. Clone the Repository

2. Create a Python Environment

3. Install Dependencies

4. Download Pre-trained Models

🚀 How to Run

Full End-to-End Pipeline

Lip-Sync Only

🧠 My Development Journey

Idea 1: Direct Speech-to-Speech

Idea 2: ASR → NMT → TTS → Voice Cloning

Idea 3: Voice Conversion with RVC (Breakthrough 💡)

📌 Additional Notes

Special Thanks to

📬 Let’s Connect

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages