SesameAILabs-csm

A conversational speech model (CSM) implementation by Sesame AI Labs that enables text-to-speech generation with context awareness and consistent audio quality.

Description

SesameAILabs-csm is a powerful text-to-speech model that can generate natural-sounding speech with context awareness. It supports multiple speakers and maintains consistent audio quality across conversations. The model is fine-tuned to ensure that the audio remains consistent, even in long conversations.

Features

Text-to-speech generation with context awareness
Multi-speaker support
Natural-sounding speech output
Contextual conversation handling
Consistent audio quality across conversations
Support for custom audio input
GPU acceleration support

Installation

Clone the repository:

git clone https://github.com/SesameAILabs/csm.git
cd csm

Install the required packages:

pip install -r requirements.txt

Log in to Hugging Face (required for model download):

from huggingface_hub import login
login()

Requirements

Python 3.8+
CUDA-capable GPU (recommended)
PyTorch 2.4.0
torchaudio 2.4.0
transformers 4.49.0
huggingface_hub 0.28.1
And other dependencies listed in requirements.txt

Usage

Basic Usage

from generator import load_csm_1b
import torchaudio

# Initialize the generator
generator = load_csm_1b(device="cuda")

# Generate speech
audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

# Save the generated audio
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Contextual Conversation

from generator import load_csm_1b, Segment
import torchaudio

# Initialize the generator
generator = load_csm_1b(device="cuda")

# Define speakers, transcripts, and audio paths
speakers = [0]
transcripts = ["Hey how are you doing."]
audio_paths = ["conversational_b.wav"]

# Function to load and resample audio
def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

# Create segments
segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]

# Generate audio with context
audio = generator.generate(
    text="Your response text here",
    speaker=1,
    context=segments,
    max_audio_length_ms=50_000,
)

# Save the generated audio
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Model Details

The model is automatically downloaded from the Hugging Face Hub when first used. It includes:

Encoder model
Decoder model
Multiple speaker embeddings
Configuration files

Author

Nidhi Yashwanth (github.com/nidhiyashwanth)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Sesame AI Labs for developing and maintaining the model
Hugging Face for hosting the model and providing the transformers library
The PyTorch team for the deep learning framework

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
SesameAILabs_csm.ipynb		SesameAILabs_csm.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SesameAILabs-csm

Description

Features

Installation

Requirements

Usage

Basic Usage

Contextual Conversation

Model Details

Author

License

Acknowledgments

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SesameAILabs-csm

Description

Features

Installation

Requirements

Usage

Basic Usage

Contextual Conversation

Model Details

Author

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages