A Home Assistant custom integration that connects to the Cartesia Sonic text-to-speech API, giving HA access to Cartesia's library of 600+ voices across 42 languages with fine-grained control over speed, volume, and emotion.
Disclaimer: This is an unofficial, community-developed integration. It is not affiliated with, endorsed by, or supported by Cartesia AI. For Cartesia API support, visit cartesia.ai or their Discord. For integration issues, please open a GitHub issue on this repository.
- Full
tts.speaksupport in Home Assistant. - Three Cartesia models:
sonic-3.5(recommended),sonic-3, andsonic-turbo. - 600+ voices filterable by language in the config UI.
- 59 emotion presets for expressive speech.
- Speed (0.6 to 1.5) and volume (0.5 to 2.0) controls.
- All generation parameters overridable per
tts.speakcall. - SSML passthrough: embed Cartesia tags directly in message text.
- Config flow setup and reconfiguration entirely through the HA UI.
- API key can be changed at any time via Reconfigure or the automatic reauth prompt.
- Home Assistant 2025.1 or later.
- A Cartesia API key. Sign up and create a free key at play.cartesia.ai/keys.
Note
Only one instance of the integration is supported. If you attempt to add it again, HA will show a message directing you to the Configure button on the existing entry.
- Open HACS in your Home Assistant sidebar.
- Click the three-dot menu (top right) and choose Custom repositories.
- Paste
https://github.com/sfox38/cartesia_ttsand select Integration as the category. - Click Add, then find Cartesia Sonic TTS in the HACS Integration list and click Download.
- Restart Home Assistant.
- Download the latest release zip from this repository and unpack it.
- Copy the
cartesia_ttsfolder into yourconfig/custom_components/directory. The result should beconfig/custom_components/cartesia_tts/. - Restart Home Assistant.
The initial setup wizard has four steps.
Enter your Cartesia API key. The integration validates it against the Cartesia API before continuing.
Choose your default Cartesia model.
| Model | Latency | Languages | Emotion support |
|---|---|---|---|
| Sonic 3.5 (recommended) | ~100ms | 42 | Full (59 emotions) |
| Sonic 3 | 90ms | 42 | Full (59 emotions) |
| Sonic Turbo | 40ms | 15 | Limited |
Choose your default language, speed, volume, and emotion. The language list is filtered to only show languages supported by the model chosen in step 2.
Use the option at the bottom to go back to model selection if needed.
Choose your default voice. The dropdown shows only voices for the selected language. Voices are sorted alphabetically. Voice names include accent information where relevant (e.g. "Matilda - Australian Female").
Use the option at the bottom to go back to settings.
Go to Settings -> Devices and Services -> Cartesia Sonic TTS -> Configure.
The Configure dialog follows the same three-step flow (model, settings, voice) with your current values pre-filled. The voice list is always refreshed from the Cartesia API when you reach the voice step, so any voices Cartesia has added since your last session appear immediately.
Automatically: If your API key is revoked or expires, HA detects this the next time speech is synthesised and displays a repair notification prompting you to re-enter your key. Click the notification and enter the new key.
Manually: Go to Settings -> Devices and Services -> Cartesia Sonic TTS, click the three-dot menu, and choose Reconfigure. Enter a new API key. All other settings (model, voice, language, etc.) are preserved.
action: tts.speak
target:
entity_id: tts.cartesia_sonic_tts
data:
media_player_entity_id: media_player.living_room
message: "Hello from Cartesia."All generation parameters can be overridden for a single call via the options dict. Overrides take precedence over the defaults set in the Configure dialog.
action: tts.speak
target:
entity_id: tts.cartesia_sonic_tts
data:
media_player_entity_id: media_player.living_room
message: "This is urgent!"
options:
emotion: alarmed
speed: 1.3
volume: 1.5action: tts.speak
target:
entity_id: tts.cartesia_sonic_tts
data:
media_player_entity_id: media_player.kitchen
message: "Bonjour le monde."
language: fr
options:
model: sonic-3.5
voice_id: "ab636c8b-9960-4fb3-bb0c-b7b655fb9745"Cartesia SSML tags can be embedded directly in the message text. They are passed to the API as-is.
message: "<emotion value='angry'/> How dare you speak to me like that!"
message: "<speed ratio='1.5'/> I like to talk fast."
message: "<volume ratio='1.5'/> This part is louder."See the Cartesia SSML documentation for the full tag reference. Note that speed, volume, and emotion SSML tags are currently in beta.
The following keys are accepted in the options dict of tts.speak:
| Key | Type | Description |
|---|---|---|
model |
string | sonic-3.5, sonic-3, or sonic-turbo |
voice_id |
string | Cartesia voice UUID |
language |
string | ISO 639-1 language code (e.g. en, fr, ja) |
speed |
float | Speed multiplier. 0.6 slowest, 1.0 normal, 1.5 fastest |
volume |
float | Volume multiplier. 0.5 quietest, 1.0 normal, 2.0 loudest |
emotion |
string | Emotion name (see list below) |
Emotions are guidance to the model, not strict transformations. Results vary by voice and transcript. For best results use one of Cartesia's recommended emotive voices (tagged "Emotive" in the Cartesia voice library).
The primary emotions with the most training data are: angry, content, excited, neutral, sad, scared.
Full list (pass to the API or options dict):
affectionate, agitated, alarmed, amazed, angry, anticipation, anxious, apologetic, bored, calm, confident, confused, contemplative, contempt, content, curious, dejected, determined, disappointed, disgusted, distant, elated, enthusiastic, envious, euphoric, excited, flirtatious, frustrated, grateful, guilty, happy, hesitant, hurt, insecure, ironic, joking/comedic, mad, melancholic, mysterious, neutral, nostalgic, outraged, panicked, peaceful, proud, rejected, resigned, sad, sarcastic, scared, serene, skeptical, surprised, sympathetic, threatened, tired, triumphant, trust, wistful
Arabic, Bengali, Bulgarian, Chinese, Croatian, Czech, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Malay, Malayalam, Marathi, Norwegian, Polish, Portuguese, Punjabi, Romanian, Russian, Slovak, Spanish, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Vietnamese
Chinese, Dutch, English, French, German, Hindi, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, Turkish
The Cartesia API does not expose dialect codes (e.g. en-AU) in the synthesis request. Accent is a property of the voice itself. Many voices in the Cartesia library include accent information in their name (e.g. "Matilda - Australian Female"). Voice selection is effectively dialect selection.
"No voice configured": Open Configure and complete the voice selection step.
Voice browser shows no voices in HA: The voice list is fetched once when HA starts. If this fails (e.g. a transient network error at boot), open Configure and proceed to the voice step - this always triggers a fresh fetch.
Emotion has no effect: Not all voices respond well to emotion guidance. Try one of the recommended emotive voices from the Cartesia voice library (filter by "Emotive" tag). Emotion is not reliably supported on Sonic Turbo.
Wrong accent: The language code alone does not control accent. Select a voice whose name or description matches your desired accent.
SSML not working: The message string must contain valid Cartesia SSML. The speed, volume, and emotion SSML tags are currently in beta. Invalid or malformed tags are silently ignored by the Cartesia API.
No audio output or other unexpected behaviour: Check Settings -> System -> Logs in the HA UI, or open /config/home-assistant.log. Error and warning messages from this integration are always logged at standard level with no configuration needed. If you need more detail (such as the exact request being sent to Cartesia), add the following to your configuration.yaml and restart HA:
logger:
logs:
custom_components.cartesia_tts: debugNote
Debug logging includes the first 50 characters of the message being synthesised. Avoid enabling it long-term if your announcements contain sensitive information.