Skip to content

M-SRIKAR-VARDHAN/speech-to-speech-with-lipsync

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽค Speech-to-Speech Translation with Lip Sync

GitHub LinkedIn Email

Hello everyone ๐Ÿ‘‹, my name is Srikar Vardhan, and I'm a final-year student at NIT Silchar.
This project is my attempt at building an end-to-end speech-to-speech translation pipeline with lip-syncing.
The system translates English speech into Telugu (or other languages), preserves the speakerโ€™s voice, and synchronizes lip movements for a natural dubbed video.

Iโ€™ll walk you through the idea, implementation, challenges, and how to run it yourself.
If you like this project, please โญ star the repo and give credit ๐Ÿ™.

you can read my medium blog about this project


๐Ÿ“Š Pipeline Overview

The pipeline is a multi-stage process where the output of one model becomes the input for the next. This modular approach allows for flexibility and high-quality results at each step.

Stage Model/Tool Input Output
1. Transcription ๐Ÿ—ฃ๏ธ Whisper ASR Video with English Speech English Text
2. Translation ๐ŸŒ NLLB NMT English Text Telugu Text
3. Speech Synthesis ๐Ÿ”Š MMS-TTS Telugu Text Telugu Audio (Generic Voice)
4. Voice Conversion ๐Ÿงฌ RVC Generic Audio + Speaker's Voice Sample Telugu Audio (Original Voice)
5. Lip Syncing ๐Ÿ‘„ Wav2Lip Original Video + Final Audio Final Video (Synced)

โœจ Demo & Results

Click on the images below to watch the full videos hosted on Google Drive.

Original Video (English) Translated & Dubbed Video (Telugu)
Original Video Thumbnail Dubbed Video Thumbnail
Watch Original Video Watch Dubbed Video

๐Ÿ“‚ Repository Structure

๐Ÿ“‚ demo
โ”œโ”€โ”€ 1.py                        # Translation pipeline (video โ†’ translated audio)
โ”œโ”€โ”€ 1.txt
โ”œโ”€โ”€ 3.py
โ”œโ”€โ”€ Advanced-RVC-Inference      # Voice conversion (RVC)
โ”‚   โ”œโ”€โ”€ config.py
โ”‚   โ”œโ”€โ”€ configs/
โ”‚   โ”‚   โ”œโ”€โ”€ 32k.json
โ”‚   โ”‚   โ”œโ”€โ”€ 40k.json
โ”‚   โ”‚   โ””โ”€โ”€ 48k.json
โ”‚   โ”œโ”€โ”€ hubert_base.pt
โ”‚   โ”œโ”€โ”€ lib/
โ”‚   โ”‚   โ”œโ”€โ”€ audio.py
โ”‚   โ”‚   โ”œโ”€โ”€ commons.py
โ”‚   โ”‚   โ”œโ”€โ”€ data_utils.py
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ my_convert.py
โ”‚   โ”œโ”€โ”€ requirements.txt
โ”‚   โ”œโ”€โ”€ vc_infer_pipeline.py
โ”‚   โ””โ”€โ”€ weights/
โ”‚       โ”œโ”€โ”€ modi.pth
โ”‚       โ”œโ”€โ”€ model.index
โ”‚       โ””โ”€โ”€ ...
โ”œโ”€โ”€ dubbed_test.mp4
โ”œโ”€โ”€ lip.py                      # Lip-sync module (Wav2Lip)
โ”œโ”€โ”€ main.py                     # Orchestration script (runs everything end-to-end)
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ translation/
โ”‚   โ”‚   โ””โ”€โ”€ nllb-en-te/
โ”‚   โ”‚       โ”œโ”€โ”€ config.json
โ”‚   โ”‚       โ”œโ”€โ”€ model.safetensors
โ”‚   โ”‚       โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ tts/
โ”‚   โ”‚   โ””โ”€โ”€ mms-tel/
โ”‚   โ”‚       โ”œโ”€โ”€ config.json
โ”‚   โ”‚       โ”œโ”€โ”€ model.safetensors
โ”‚   โ”‚       โ””โ”€โ”€ ...
โ”‚   โ””โ”€โ”€ whisper/
โ”‚       โ””โ”€โ”€ base.en.pt
โ”œโ”€โ”€ output.mp4
โ”œโ”€โ”€ requirment.txt
โ”œโ”€โ”€ temp/
โ”‚   โ””โ”€โ”€ result.avi
โ”œโ”€โ”€ test.mp4
โ”œโ”€โ”€ test_converted.wav
โ”œโ”€โ”€ test_translated.wav
โ””โ”€โ”€ wav2Lip
    โ”œโ”€โ”€ checkpoints/
    โ”‚   โ”œโ”€โ”€ wav2lip.pth
    โ”‚   โ””โ”€โ”€ wav2lip_gan.pth
    โ”œโ”€โ”€ face_detection/
    โ”‚   โ”œโ”€โ”€ api.py
    โ”‚   โ”œโ”€โ”€ core.py
    โ”‚   โ””โ”€โ”€ sfd/
    โ”‚       โ”œโ”€โ”€ net_s3fd.py
    โ”‚       โ””โ”€โ”€ s3fd.pth
    โ”œโ”€โ”€ inference.py
    โ”œโ”€โ”€ models/
    โ”‚   โ”œโ”€โ”€ conv.py
    โ”‚   โ”œโ”€โ”€ syncnet.py
    โ”‚   โ””โ”€โ”€ wav2lip.py
    โ”œโ”€โ”€ preprocess.py
    โ””โ”€โ”€ requirements.txt

โš™๏ธ Setup Instructions

1. Clone the Repository

git clone [https://github.com/M-SRIKAR-VARDHAN/speech-to-speech-with-lipsync.git](https://github.com/M-SRIKAR-VARDHAN/speech-to-speech-with-lipsync.git)
cd speech-to-speech-with-lipsync

2. Create a Python Environment

Itโ€™s recommended to use Python 3.10+ in a virtual environment.

conda create -n lipsync python=3.10 -y
conda activate lipsync

3. Install Dependencies

pip install -r requirements.txt

4. Download Pre-trained Models

The large model files (ASR, NMT, TTS, RVC, Wav2Lip) are not included in the repo. You must download them and extract them into the correct folders as shown in the repository structure.


๐Ÿš€ How to Run

Full End-to-End Pipeline

To run the entire process, use main.py.

python main.py --video_file "test.mp4" --output_file "output.mp4"

This command will:

  1. Extract audio and transcribe with Whisper.
  2. Translate English โ†’ Telugu using NLLB.
  3. Generate Telugu speech with MMS-TTS.
  4. Convert to the original speakerโ€™s voice with RVC.
  5. Sync lips with Wav2Lip and save the final video.

Lip-Sync Only

If you already have a dubbed video and the target audio, you can run the lip-sync module alone.

python lip.py --input_video dubbed_test.mp4 --audio translated_audio.wav --output_video final_output.mp4

๐Ÿง  My Development Journey

Idea 1: Direct Speech-to-Speech

  • Approach: Use a model like Google's Translatotron for direct audio-to-audio translation.
  • Result: Failed in practice. The model was too complex and unreliable, especially with limited compute.

Idea 2: ASR โ†’ NMT โ†’ TTS โ†’ Voice Cloning

  • Approach: A standard pipeline, but using a voice cloning model at the end.
  • Result: This worked for English but failed when cloning for Telugu. Training a custom voice cloning model was not feasible.

Idea 3: Voice Conversion with RVC (Breakthrough ๐Ÿ’ก)

  • Approach: Instead of full voice cloning, I used RVC (Retrieval-based Voice Conversion).
  • Result: Success! I trained a model on ~15 minutes of speech and it worked remarkably well. This approach is practical, efficient, and language-independent.

๐Ÿ“Œ Additional Notes

  • The default resolution for lip-sync is 480p, which is CPU-friendly. You can increase it to 1080p if you have a GPU.
  • The wav2lip_gan.pth checkpoint gives sharper facial results.
  • You can swap models to support any target language and retrain RVC for any speaker's voice.

Special Thanks to

A thank you to Revelli Eshwar for their support in this project.


๐Ÿ“ฌ Letโ€™s Connect


Star History Chart

About

End-to-end speech-to-speech translation pipeline with voice cloning (RVC) and automatic lip-sync (Wav2Lip).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

โšก