On-device voice transcription, grammar correction, and text-to-speech for macOS. Private, fast, runs on MLX.
Double-tap, speak, tap to stop. Text is ready. Multiple engines, pluggable grammar, all MLX-native on Apple Silicon. Nothing leaves your Mac. Optional text-to-speech reads any selection aloud with ⌥T. Multiple voices, streaming playback, same deal.
Apple Silicon required. Microphone and Accessibility permissions needed.
git clone https://github.com/gabrimatic/local-whisper.git
cd local-whisper
./setup.shOne command. Installs deps, downloads core local models, builds the UI, sets up auto-start, creates the wh alias.
| Action | Key |
|---|---|
| Start recording | Double-tap Right Option |
| Hold to record | Hold Right Option past double-tap threshold |
| Stop and transcribe | Tap Right Option or Space |
| Cancel | Esc |
| Read selected text aloud | ⌥T |
| Stop speech | ⌥T again or Esc |
- On-device transcription via MLX. Multiple engines, up to 20 minutes per recording.
- Grammar correction with pluggable backends: Apple Intelligence, Ollama, LM Studio. Or disable it.
- Text-to-speech reads any selected text aloud. Works in any app, multiple voices, streaming playback, fully offline via Kokoro MLX.
- Text replacements for custom spoken-to-correct mappings.
- Audio processing: VAD, silence trimming, noise reduction, normalization.
- Keyboard shortcuts for proofreading, rewriting, prompt engineering on selected text.
- CLI:
wh whisper,wh listen,wh transcribefor scripting and automation. - Native macOS UI: menu bar, Liquid Glass overlay, settings window.
- Auto-backup of every recording and transcription.
| Shortcut | Action |
|---|---|
| ⌥T | Read selected text aloud (again or Esc to stop) |
| Ctrl+Shift+G | Proofread selected text |
| Ctrl+Shift+R | Rewrite selected text |
| Ctrl+Shift+P | Optimize selected text as an LLM prompt |
Results go to clipboard. TTS plays through speakers.
- Sounds: Pop on start, Glass on success, Basso on failure
- Menu bar: animated waveform (recording), speaker icon (speech)
- Overlay:
0.0recording ····processing ·Copieddone ·Failederror ·Speaking...
Switch via Settings, wh engine <name>, or config.
In-process via parakeet-mlx. No server, no network. Multilingual: English plus 24 European languages. Tops the HuggingFace Open ASR Leaderboard and beats much larger Whisper and Parakeet variants. Long audio handled via overlapping chunks.
| Setting | Default | Notes |
|---|---|---|
model |
mlx-community/parakeet-tdt-0.6b-v3 |
Downloaded by setup.sh |
timeout |
0 |
No limit |
chunk_duration |
120.0 |
Seconds per chunk for long audio. 0 disables chunking (pair with local attention). |
overlap_duration |
15.0 |
Overlap between consecutive chunks. |
decoding |
"greedy" |
Or "beam" for a small quality bump at higher cost. |
beam_size / length_penalty / patience / duration_reward |
5 / 0.013 / 3.5 / 0.67 |
Beam-only tuning knobs. |
local_attention |
false |
Reduces peak RAM for unchunked long audio. |
In-process via qwen3-asr-mlx. No server, no network. Long audio native (up to 20 minutes in a single pass). English-only. Switch with wh engine qwen3_asr.
| Setting | Default | Notes |
|---|---|---|
model |
mlx-community/Qwen3-ASR-1.7B-bf16 |
Downloaded by setup.sh |
timeout |
0 |
No limit |
repetition_penalty |
1.2 |
Higher suppresses repetition loops |
repetition_context_size |
100 |
Tokens considered for repetition penalty |
chunk_duration |
1200 |
Seconds per internal chunk for very long audio |
max_tokens |
0 |
0 = auto-scale from duration. Cap for faster short-clip decode. |
Whisper on Apple Neural Engine via Argmax. Install with brew install whisperkit-cli, switch with wh engine whisperkit.
| Model | Notes |
|---|---|
tiny / tiny.en |
Fastest, lowest accuracy |
base / base.en |
|
small / small.en |
|
whisper-large-v3-v20240930 |
Best accuracy (default) |
Kokoro-82M via kokoro-mlx. Runs in-process, no server, no network. Streaming playback starts before full synthesis completes.
Toggle from the menu bar or Settings -> Voice. Activating the feature downloads the Kokoro voice model (~170 MB) and uses the spaCy en_core_web_sm dictionary plus system espeak-ng. Running ./setup.sh while the toggle is on pre-fetches everything so the first speak has no wait.
Usage:
- ⌥T on selected text in any app. Press ⌥T again, Esc, or start a recording to stop.
- CLI:
wh whisper "text",wh whisper --voice af_bella "text", or pipe stdin withecho "hello" | wh whisper.
The overlay shows "Generating speech..." during synthesis, then "Speaking..." during playback. The shortcut is configurable via tts.speak_shortcut in config.
Multiple presets available. Default is Sky (af_sky).
| Voice | ID | Type |
|---|---|---|
| Heart | af_heart |
American female |
| Bella | af_bella |
American female |
| Nova | af_nova |
American female |
| Sky | af_sky (default) |
American female |
| Sarah | af_sarah |
American female |
| Nicole | af_nicole |
American female |
| Alice | bf_alice |
British female |
| Emma | bf_emma |
British female |
| Adam | am_adam |
American male |
| Echo | am_echo |
American male |
| Eric | am_eric |
American male |
| Liam | am_liam |
American male |
| Daniel | bm_daniel |
British male |
| George | bm_george |
British male |
Optional. Pick a grammar backend or disable it:
| Backend | Requirements | Notes |
|---|---|---|
| Apple Intelligence | macOS 15+, Apple Silicon, Apple Intelligence enabled | Fastest, best quality |
| Ollama | Ollama installed and running | Works on any Mac |
| LM Studio | LM Studio with a model loaded and the local server started | Works on any Mac |
| Disabled | None | Transcription only |
Switch from menu bar (instant), wh backend <name> (restarts), or Settings.
Ollama setup
- Download from ollama.com
- Pull a model and start the server:
ollama pull gemma3:4b-it-qat
ollama serveLM Studio setup
- Download from lmstudio.ai
- Download and load a model (e.g.,
google/gemma-3-4b) - Start the local server: Developer tab > Start Server
Loading a model does not start the server. Start it from Developer tab.
wh controls everything:
wh # Status and help
wh status # Service status, PID, grammar backend
wh start # Launch the service
wh stop # Stop the service
wh restart # Restart (rebuilds Swift UI if sources changed)
wh build # Rebuild Swift UI app
wh engine # Show current engine and list available
wh engine whisperkit # Switch transcription engine
wh backend # Show current grammar backend and list available
wh backend ollama # Switch grammar backend
wh replace # Show text replacement rules
wh replace add "gonna" "going to"
wh replace remove "gonna"
wh replace on|off # Enable or disable replacements
wh replace import rules.csv # Bulk-import rules (CSV, TSV, "a"="b", or a -> b)
wh whisper "text" # Speak text aloud via Kokoro TTS
wh whisper --voice af_bella "text"
echo "hello" | wh whisper
wh listen # Record until silence, output transcription
wh listen 30 # Record up to 30 seconds
wh listen --raw # Raw transcription, no grammar
wh transcribe recording.wav
wh transcribe --raw audio.wav
wh stats # Show usage statistics computed from history
wh export # Export history to ~/Desktop/local-whisper-history.md
wh export --format json --out ~/Downloads/history.json
wh export --format txt --limit 50
wh config # Interactive config editor (static summary when piped)
wh config edit # Open config.toml in $EDITOR
wh config path # Print config file path
wh doctor # Check system health
wh doctor --fix # Auto-repair issues
wh doctor --report # Write a shareable diagnostic report
wh log # Tail service log
wh update # Pull, upgrade deps, warm up models, rebuild, restart
wh version # Show version
wh uninstall # Completely remove Local WhisperSpeak these phrases anywhere in a dictation and Local Whisper replaces them with the literal punctuation or whitespace:
| Spoken | Inserted |
|---|---|
| "new line" | newline |
| "new paragraph" | blank line |
| "period" | . |
| "comma" | , |
| "question mark" | ? |
| "exclamation mark" | ! |
| "colon" | : |
| "semicolon" | ; |
| "dash" | - (space, hyphen, space) |
| "open paren" / "close paren" | ( / ) |
| "scratch that" | deletes the current sentence fragment |
Custom commands go under [dictation.commands] in ~/.whisper/config.toml. The pass runs before grammar correction, so grammar sees well-punctuated sentences.
| Item | What it does |
|---|---|
| Status | Current state with active engine and backend subtitle |
| Engine | Switch transcription engine in-place |
| Grammar | Switch grammar backend in-place |
| Replacements | Toggle, shows rule count |
| Retry Last / Copy Last | Re-transcribe or re-copy |
| Transcriptions | Recent entries, click to copy |
| Recordings | Audio files, click to reveal in Finder |
| Settings… | Full sidebar settings window |
| Service | Restart, Check for Updates, Open Service Log |
| Quit Local Whisper | Exit |
Sidebar layout with focused panels:
| Panel | Covers |
|---|---|
| Recording | Trigger key, double-tap window, audio cleanup, duration limits |
| Transcription | Engine picker plus per-engine sampling and decoding parameters |
| Grammar | Master toggle, backend picker, per-backend connection and limits |
| Voice | Text-to-speech voice and shortcut, dictation command help |
| Vocabulary | Searchable replacement editor with import / export |
| Output | Overlay, sounds, notifications, paste-at-cursor, history limit |
| Shortcuts | Proofread / rewrite / prompt-engineer keybindings, full cheatsheet |
| Activity | Sessions, words, 30-day chart, top words, top replacement triggers |
| Advanced | Storage paths, service log, doctor, restart, update |
| About | Version, credits, replay tutorial |
Saves to ~/.whisper/config.toml. Restart-required fields warn and offer immediate restart.
~/.whisper/config.toml. Edit via Settings, wh config, or directly.
Full config reference
[hotkey]
key = "alt_r" # alt_r, alt_l, ctrl_r, ctrl_l, cmd_r, cmd_l,
# shift_r, shift_l, caps_lock, f1-f12
double_tap_threshold = 0.4 # seconds
[transcription]
engine = "parakeet_v3" # "parakeet_v3" (default), "qwen3_asr", or "whisperkit"
[parakeet_v3]
model = "mlx-community/parakeet-tdt-0.6b-v3"
timeout = 0 # 0 = no limit
chunk_duration = 120.0 # 0 disables chunking
overlap_duration = 15.0
decoding = "greedy" # "greedy" or "beam"
beam_size = 5
length_penalty = 0.013
patience = 3.5
duration_reward = 0.67
local_attention = false
local_attention_context_size = 256
[qwen3_asr]
model = "mlx-community/Qwen3-ASR-1.7B-bf16"
timeout = 0 # 0 = no limit
temperature = 0.0
top_p = 1.0
top_k = 0
repetition_context_size = 100
repetition_penalty = 1.2
chunk_duration = 1200.0 # max chunk length in seconds
max_tokens = 0 # 0 = auto from duration
[whisper]
model = "whisper-large-v3-v20240930"
language = "auto"
url = "http://localhost:50060/v1/audio/transcriptions"
check_url = "http://localhost:50060/"
timeout = 0
temperature = 0.0
compression_ratio_threshold = 2.4
no_speech_threshold = 0.6
logprob_threshold = -1.0
temperature_fallback_count = 5
prompt_preset = "none" # "none", "technical", "dictation", or "custom"
prompt = "" # used only when prompt_preset = "custom"
[grammar]
backend = "apple_intelligence" # "apple_intelligence", "ollama", or "lm_studio"
enabled = false
[ollama]
url = "http://localhost:11434/api/generate"
check_url = "http://localhost:11434/"
model = "gemma3:4b-it-qat"
keep_alive = "60m"
timeout = 0
max_chars = 0
max_predict = 0
num_ctx = 0
unload_on_exit = false
[apple_intelligence]
max_chars = 0
timeout = 0
[lm_studio]
url = "http://localhost:1234/v1/chat/completions"
check_url = "http://localhost:1234/"
model = "google/gemma-3-4b"
max_chars = 0
max_tokens = 0
timeout = 0
[replacements]
enabled = false
[replacements.rules]
# "gonna" = "going to"
# "wanna" = "want to"
[audio]
sample_rate = 16000
min_duration = 0
max_duration = 0 # 0 = no limit
min_rms = 0.005 # silence threshold (0.0-1.0)
vad_enabled = true
noise_reduction = true
normalize_audio = true
pre_buffer = 0.0 # seconds before hotkey (0.0 = disabled)
[backup]
directory = "~/.whisper"
history_limit = 100 # max entries for text and audio history (1-1000)
[ui]
show_overlay = true
overlay_opacity = 0.92
sounds_enabled = true
notifications_enabled = false
auto_paste = false # paste at cursor, preserving clipboard
[shortcuts]
enabled = true
proofread = "ctrl+shift+g"
rewrite = "ctrl+shift+r"
prompt_engineer = "ctrl+shift+p"
[tts]
enabled = false # toggle from Settings -> Voice or the menu bar
provider = "kokoro"
speak_shortcut = "alt+t"
[kokoro_tts]
model = "mlx-community/Kokoro-82M-bf16"
voice = "af_sky" # See voice table in README for all available presetsZero network calls. Every component runs on-device or localhost.
| Component | Runs at |
|---|---|
| Parakeet-TDT v3 | In-process MLX |
| Qwen3-ASR | In-process MLX |
| Kokoro TTS | In-process MLX |
| WhisperKit | localhost:50060 |
| Apple Intelligence | On-device |
| Ollama | localhost:11434 |
| LM Studio | localhost:1234 |
Models cached at ~/.whisper/models/. Config and backups at ~/.whisper/.
Python headless service (LaunchAgent). Swift owns all UI.
Python (LaunchAgent, headless)
├── Recording, transcription, grammar, replacements, clipboard, hotkeys
├── Text-to-Speech (Kokoro-82M, in-process)
├── IPC server at ~/.whisper/ipc.sock (Swift UI communication)
├── Command server at ~/.whisper/cmd.sock (CLI commands)
├── Pipeline watchdog (per-stage timeouts, skip-don't-freeze)
└── Crash recovery (~/.whisper/processing.marker + current_session.jsonl)
Swift (subprocess, all UI)
├── Menu bar with grammar submenus and transcription history
├── Floating overlay pill (recording, processing, speaking states)
├── NSWorkspace sleep/wake observer → resync_audio IPC
└── Settings window (auto-fetches Ollama / LM Studio models on open)
The LaunchAgent uses KeepAlive={SuccessfulExit=false} with
ThrottleInterval=10 so real crashes auto-restart while clean stops
(wh stop, wh restart, user-error exits) don't relaunch. Recordings
longer than five minutes take the chunked pipeline: each VAD segment is
transcribed, grammar-corrected, and persisted to
~/.whisper/current_session.jsonl before the next chunk runs, so a
crash mid-lecture recovers the completed chunks on next boot instead of
losing everything.
Data flow
┌───────────────────────────────────────────────────────────┐
│ Microphone → pre-buffer (ring) + live capture │
└──────────────────────────┬────────────────────────────────┘
▼
┌───────────────────────────────────────────────────────────┐
│ Audio Processing │
│ VAD → silence trim → noise reduction → normalize │
└──────────────────────────┬────────────────────────────────┘
▼
┌───────────────────────────────────────────────────────────┐
│ Transcription Engine │
│ │
│ Parakeet-TDT v3 (default) │ Qwen3-ASR │ WhisperKit │
│ Multilingual, MLX │ English, MLX│ localhost:50060│
│ 120s chunked │ Long audio │ Split at 28s │
└──────────────────────────┬────────────────────────────────┘
▼
┌───────────────────────────────────────────────────────────┐
│ Grammar Correction │
│ │
│ Apple Intelligence │ Ollama │ LM Studio │
│ On-device │ localhost LLM │ OpenAI-compatible │
└──────────────────────────┬────────────────────────────────┘
▼
┌───────────────────────────────────────────────────────────┐
│ Text Replacements │
│ Case-insensitive, word-boundary-aware regex │
└──────────────────────────┬────────────────────────────────┘
▼
┌───────────────────────────────────────────────────────────┐
│ Clipboard · Saved to ~/.whisper/ │
│ (auto_paste: pasted at cursor, clipboard preserved) │
└───────────────────────────────────────────────────────────┘
"This process is not trusted"
Grant Accessibility to the wh process, not your terminal app. System Settings opens automatically on first run.
If it didn't:
open x-apple.systempreferences:com.apple.preference.security?Privacy_AccessibilityEnable wh, then wh restart.
Double-tap not working
Tap twice within 0.4s (default). Adjust double_tap_threshold in config.
Apple Intelligence not working
Verify:
- macOS 15 (Sequoia) or later
- Apple Silicon (M1/M2/M3/M4)
- Apple Intelligence enabled in System Settings > Apple Intelligence & Siri
Ollama not working
Verify:
- Ollama installed: ollama.com
- Model pulled:
ollama pull gemma3:4b-it-qat - Server running:
ollama serve
LM Studio not working
Verify:
- LM Studio installed: lmstudio.ai
- A model is downloaded and loaded
- Local server is running (most common issue): Developer tab > Start Server
- Confirm with:
curl http://localhost:1234/v1/models
Loading a model does not start the server.
Slow first transcription
setup.sh pre-downloads and warms the built-in local models used by transcription and TTS. It does not pull Ollama or LM Studio models for you. Skip setup and the first transcription loads them on demand. After that, loaded from disk.
Empty transcription
- Speak clearly, close to the microphone
- Check microphone permissions in System Settings
- Confirm the correct input device is selected
Overlay not showing
Check show_overlay = true in ~/.whisper/config.toml.
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
wh build # Build Swift UI (one-time)
wh # Run the service
pytest tests/ # Run the full test suiteEngines: implement TranscriptionEngine in engines/, register in ENGINE_REGISTRY.
Grammar backends: implement GrammarBackend in backends/, register in BACKEND_REGISTRY.
Menu, CLI, and Settings auto-generate from the registries.
Project structure
local-whisper/
├── pyproject.toml
├── setup.sh
├── tests/
│ ├── test_flow.py
│ └── fixtures/
├── LocalWhisperUI/ # Swift UI app
│ ├── Package.swift
│ └── Sources/LocalWhisperUI/
│ ├── AppMain.swift # @main entry point
│ ├── AppState.swift # Observable state, IPC handler, ConnectionState
│ ├── IPCClient.swift # Unix socket client + connection-state publishing
│ ├── IPCMessages.swift # Codable message types
│ ├── Theme.swift # Typography, spacing, radii, tones, accents
│ ├── MenuBarView.swift # Menu bar dropdown (status + connection state)
│ ├── OverlayWindowController.swift
│ ├── OverlayView.swift # Floating pill: waveform + state
│ ├── OnboardingView.swift # First-launch + replay tutorial
│ ├── SettingsView.swift # Sidebar root + section enum
│ ├── RecordingPanel.swift # Trigger key + audio cleanup
│ ├── TranscriptionPanel.swift # Engine picker + per-engine params
│ ├── GrammarPanel.swift # Backend picker + Ollama / LM Studio / AI
│ ├── VoicePanel.swift # TTS + dictation commands
│ ├── VocabularyPanel.swift # Replacements editor (search, import / export)
│ ├── OutputPanel.swift # Overlay, sounds, notifications, paste, history
│ ├── ShortcutsPanel.swift # Text-transform keybindings + cheatsheet
│ ├── ActivityPanel.swift # Usage stats with 30-day chart
│ ├── AdvancedPanel.swift # Live status, storage, diagnostics, service control
│ ├── AboutView.swift # Hero, credits, links, replay tutorial
│ ├── SharedViews.swift # DeferredText fields, StatusPill, InlineNotice, headers
│ └── Constants.swift
└── src/whisper_voice/
├── app.py # App class + service_main (imports mixins)
├── app_ipc.py # IPCMixin: IPC send/receive
├── app_recording.py # RecordingMixin: keyboard + recording lifecycle
├── app_pipeline.py # PipelineMixin: transcription pipeline
├── app_commands.py # CommandsMixin: CLI command handlers
├── app_switching.py # SwitchingMixin: engine/backend switching
├── cli/ # CLI package (wh)
│ ├── constants.py # Colors, path constants
│ ├── lifecycle.py # start/stop/status
│ ├── build.py # Swift UI build, restart
│ ├── settings.py # engine/backend/replace commands
│ ├── editor.py # Interactive config TUI
│ ├── client.py # whisper/listen/transcribe socket client
│ ├── doctor.py # wh doctor + wh update
│ └── main.py # help, version, cli_main dispatcher
├── config/ # Config package
│ ├── schema.py # Dataclasses + DEFAULT_CONFIG
│ ├── loader.py # load_config, get_config, singleton
│ ├── toml_helpers.py # _find/_replace_in_section, _serialize_toml_value
│ └── mutations.py # add/remove_replacement, update_config_field
├── ipc_server.py # IPC server (Swift UI)
├── cmd_server.py # Command server (CLI)
├── audio.py # Recording and pre-buffer
├── audio_processor.py # VAD, noise reduction, normalization
├── backup.py # History persistence
├── grammar.py # Grammar backend factory
├── transcriber.py # Engine routing
├── utils.py # Helpers
├── shortcuts.py # Text transformation shortcuts
├── key_interceptor.py # CGEvent tap
├── tts_processor.py # TTS shortcut handler
├── tts/
│ ├── base.py # TTSProvider base
│ └── kokoro_tts.py # Kokoro provider (MLX)
├── engines/
│ ├── base.py # TranscriptionEngine base
│ ├── parakeet.py # Parakeet-TDT v3 (MLX, default)
│ ├── qwen3_asr.py # Qwen3-ASR (MLX)
│ ├── whisperkit.py # WhisperKit (localhost)
│ ├── status.py # Cache status + on-disk size reporting
│ └── download_progress.py # HF preflight + inline progress bar IPC
└── backends/
├── base.py # Backend base
├── modes.py # Transformation modes
├── ollama/
├── lm_studio/
└── apple_intelligence/
Data stored in ~/.whisper/:
~/.whisper/
├── config.toml # Settings
├── ipc.sock # Python/Swift IPC
├── cmd.sock # CLI commands
├── LocalWhisperUI.app # Swift UI (built by setup.sh)
├── last_recording.wav
├── last_raw.txt # Before grammar
├── last_transcription.txt # Final text
├── audio_history/
├── history/ # Last 100 transcriptions
└── models/ # Parakeet-TDT, Qwen3-ASR, Kokoro TTS
parakeet-mlx (MLX port of NVIDIA Parakeet) · Parakeet-TDT by NVIDIA NeMo · qwen3-asr-mlx (MLX port of Qwen3-ASR) · kokoro-mlx (MLX port of Kokoro-82M) · Qwen3-ASR by Qwen Team · Kokoro-82M · WhisperKit by Argmax · Apple Intelligence · Apple FM SDK · Ollama · LM Studio · SwiftUI
Legal notices
"Whisper" is a trademark of OpenAI. "Apple Intelligence" is a trademark of Apple Inc. "WhisperKit" is a trademark of Argmax, Inc. "Qwen" is a trademark of Alibaba Cloud. "Parakeet" and "NeMo" are trademarks of NVIDIA Corporation. "Ollama" and "LM Studio" are trademarks of their respective owners.
This project is not affiliated with, endorsed by, or sponsored by OpenAI, Apple, Argmax, Alibaba Cloud, NVIDIA, or any other trademark holder. All trademark names are used solely to describe compatibility with their respective technologies.
This project depends on pynput, licensed under LGPL-3.0. When installed via pip (the default), pynput is dynamically linked and fully compatible with this project's MIT license.
All other dependencies use MIT, BSD, or Apache 2.0 licenses. See each package for details.
MIT License. See LICENSE for details.
Created by Soroush Yousefpour




