The blazing fast pure-Swift LLM/VLM server for Apple Silicon.
It is written in Swift, and optimized for Apple silicon.
No Python. No cloud. No limits.
Run 50+ model families — Llama, Qwen, Gemma, DeepSeek, Mistral — natively on your Mac.
100% Swift. Zero Python dependencies. OpenAI & Anthropic compatible. Native menu bar app.
Option 1: Homebrew (recommended)
brew tap cnshsliu/novamlx
brew install novamlx
brew services start novamlxOption 2: Download DMG
Go to Releases and download the latest NovaMLX-X.X.X-arm64.dmg:
- Open the
.dmgfile - Drag NovaMLX to your Applications folder
- Launch NovaMLX — the menu bar icon appears and the server starts on
localhost:8080
Option 3: Build from source
git clone https://github.com/cnshsliu/novamlx.git
cd novamlx
./build.sh -c releaseRequires macOS 15 (Sequoia), Apple Silicon, and Xcode 16+.
Launch NovaMLX from your Applications folder (or Spotlight).
A menu bar icon appears. The server runs on localhost:8080.
The nova CLI is bundled inside the app. Symlink it for easy access:
sudo ln -s /Applications/NovaMLX.app/Contents/MacOS/nova /usr/local/bin/novanova download mlx-community/Meta-Llama-3.1-8B-Instruct-4bitnova load mlx-community/Meta-Llama-3.1-8B-Instruct-4bit# Interactive chat
nova chat
# Or via API
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/Meta-Llama-3.1-8B-Instruct-4bit", \
"messages":[{"role":"user","content":"Write a haiku about coding."}]}'That's it. You're running an LLM locally.
NovaMLX is fully OpenAI API compatible. Point any tool at http://localhost:8080/v1.
ANTHROPIC_BASE_URL=http://localhost:8080/v1 \
ANTHROPIC_API_KEY=unused \
claudeOr in your shell profile:
export ANTHROPIC_BASE_URL=http://localhost:8080/v1
export ANTHROPIC_API_KEY=unusedAdd to your opencode config (~/.config/opencode/config.json):
{
"provider": {
"name": "openai",
"baseURL": "http://localhost:8080/v1",
"apiKey": "unused",
"models": {
"default": "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"
}
}
}When using local models with AI coding agents (Claude Code, OpenCode, OpenClaw, Hermes), the model's context window is often smaller than Anthropic's 200K. NovaMLX auto-detects agent tools from their HTTP headers and scales reported token counts so that auto-compact triggers at the right time — before your local model runs out of context. Normal chat clients (curl, Python SDK, web UI) always get real token counts.
No setup needed. Detection is automatic via Anthropic-Version header (Claude Code) or User-Agent substring matching (OpenCode, OpenClaw, Hermes).
Set contextScalingTarget in ~/.nova/config.json to enable:
{
"server": {
"host": "127.0.0.1",
"port": 8080,
"adminPort": 8081,
"apiKeys": [],
"contextScalingTarget": 200000
}
}If your model has a 128K context window and contextScalingTarget is 200000, token counts are scaled by 200000 / 128000 = 1.56× — but only for detected agent tools. If contextScalingTarget is omitted, no scaling occurs (default).
How it works: Claude Code and other agents auto-compact conversation history at ~80% of what they believe the context window to be. By scaling the usage numbers, NovaMLX ensures that 80% of the virtual window maps to the actual limit of your local model, preventing context overflow errors.
Settings → Models → OpenAI API Compatible:
| Field | Value |
|---|---|
| Base URL | http://localhost:8080/v1 |
| API Key | unused |
| Model ID | mlx-community/Meta-Llama-3.1-8B-Instruct-4bit |
Add to ~/.continue/config.json:
{
"models": [
{
"title": "NovaMLX Local",
"provider": "openai",
"apiBase": "http://localhost:8080/v1",
"apiKey": "unused",
"model": "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"
}
]
}from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
model="mlx-community/Meta-Llama-3.1-8B-Instruct-4bit",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")import anthropic
client = anthropic.Anthropic(base_url="http://localhost:8080", api_key="unused")
response = client.messages.create(
model="mlx-community/Meta-Llama-3.1-8B-Instruct-4bit",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)# Chat
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"my-model","messages":[{"role":"user","content":"Hi"}]}'
# Streaming
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"my-model","messages":[{"role":"user","content":"Hi"}],"stream":true}'
# Embeddings
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model":"my-embed-model","input":"Hello world"}'The nova CLI lets you manage everything from the terminal:
# Find models
nova search "llama 3.1 4bit"
# Download
nova download mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
# Load into memory
nova load mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
# List loaded models
nova models
# Unload (free memory)
nova unload mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
# Delete downloaded files
nova delete mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
# Chat interactively
nova chat
# Server status & GPU memory
nova statusCompress the KV cache to serve longer contexts:
# Enable 4-bit KV quantization (recommended)
nova turboquant my-model 4
# Enable 2-bit (maximum compression)
nova turboquant my-model 2
# Disable
nova turboquant my-model off
# Check status
nova turboquant# Sessions
nova sessions # List active sessions
nova sessions delete ID # Delete a session
# Cache
nova cache my-model # Show cache stats
nova cache my-model clear # Clear cache
# LoRA adapters
nova adapters # List loaded adapters
nova adapters load /path/to/adapter
nova adapters unload my-adapter
# Benchmark
nova bench start my-model # Run performance benchmark
nova bench status # Check benchmark progressWhen you start NovaMLX, a menu bar icon appears showing:
- Server status (running/stopped)
- Loaded models
- GPU memory usage
- Active requests
- Tokens per second
Click the icon to open the Dashboard window for detailed monitoring.
Works with any SafeTensors model from HuggingFace — Llama 3, Qwen 2/2.5/3, Gemma 2/3, Phi 3.5/4, Mistral, Mixtral, DeepSeek, StarCoder2, and many more.
Send images with your messages — supports Qwen2-VL, Gemma3, LLaVA, Phi-3-Vision, Pixtral, Molmo, and others:
response = client.chat.completions.create(
model="mlx-community/Qwen2.5-VL-7B-Instruct-4bit",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
]
}]
)Force the model to output valid JSON matching your schema:
response = client.chat.completions.create(
model="my-model",
messages=[{"role": "user", "content": "Who won the 2022 World Cup?"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "answer",
"schema": {
"type": "object",
"properties": {
"winner": {"type": "string"},
"score": {"type": "string"}
},
"required": ["winner", "score"]
}
}
}
)
# Returns: {"winner": "Argentina", "score": "3-3 (4-2 pens)"}Also supports: JSON mode, Regex patterns, and GBNF grammars.
Automatic tool call detection across 7 format families — works with any model without fine-tuning.
# Embeddings for RAG/semantic search
curl http://localhost:8080/v1/embeddings \
-d '{"model":"my-embed-model","input":"Hello world"}'
# Rerank documents
curl http://localhost:8080/v1/rerank \
-d '{"model":"my-reranker","query":"What is MLX?","documents":["doc1","doc2"]}'Uses Apple's built-in on-device speech recognition and synthesis:
# Speech-to-text
curl http://localhost:8080/v1/audio/transcriptions -F "file=@recording.wav"
# Text-to-speech
curl http://localhost:8080/v1/audio/speech \
-d '{"model":"tts","input":"Hello!","voice":"Samantha"}'Same server, both APIs:
| API | Endpoint |
|---|---|
| OpenAI Chat | POST /v1/chat/completions |
| OpenAI Completions | POST /v1/completions |
| OpenAI Responses | POST /v1/responses |
| OpenAI Embeddings | POST /v1/embeddings |
| Anthropic Messages | POST /v1/messages |
Automatically detects AI coding agents (Claude Code, OpenCode, OpenClaw, Hermes) from request headers and scales reported token counts so auto-compact triggers at the right time for local model context windows. Normal chat clients get real token counts — no configuration needed. See details →
Any SafeTensors model from HuggingFace in 4-bit, 8-bit, or FP16. Popular choices:
| Model | Size | Download Command |
|---|---|---|
| Llama 3.1 8B | ~4.5 GB | nova download mlx-community/Meta-Llama-3.1-8B-Instruct-4bit |
| Qwen 2.5 7B | ~4.5 GB | nova download mlx-community/Qwen2.5-7B-Instruct-4bit |
| Gemma 2 9B | ~5.5 GB | nova download mlx-community/gemma-2-9b-it-4bit |
| Phi 3.5 Mini | ~2 GB | nova download mlx-community/Phi-3.5-mini-instruct-4bit |
| Mistral 7B | ~4 GB | nova download mlx-community/Mistral-7B-Instruct-v0.3-4bit |
| Qwen 2.5 VL 7B | ~4.5 GB | nova download mlx-community/Qwen2.5-VL-7B-Instruct-4bit |
Search for more: nova search "your model name"
# API authentication (optional — no auth when empty)
export NOVAMLX_API_KEYS='["sk-your-key"]'# Via API
curl -X PUT http://localhost:8081/admin/models/my-model/settings \
-d '{"temperature": 0.7, "max_context_window": 8192, "kv_bits": 4}'~/.config/opencode/config.json:
{
"host": "127.0.0.1",
"port": 8080,
"adminPort": 8081,
"apiKeys": []
}NovaMLX supports routing requests to cloud API providers (OpenAI, Anthropic, Groq, etc.) for models you don't have locally. Access via the TokenHub tab in the menu bar GUI.
NovaMLX integrates with tknet.ai for automatic nova model discovery and provider provisioning:
-
Configure API Key:
- Open Settings → tknet.ai section
- Enter your tknet.ai API Key (format:
sk-xxxxx) - Click "Verify & Fetch Models"
-
Auto-Provisioning:
- Nova providers (tagged with ⭐) are automatically created for each nova-tagged model
- Providers inherit API Key from Settings (not stored per-provider)
- On app launch, nova providers sync with tknet.ai model catalog
-
Provider Features:
- AWS-style masking: API keys display as
sk-a...456(first 4 + ... + last 3) - Visibility toggle: Eye icon to show/hide API keys in edit form
- Managed protection: Nova providers cannot be deleted or have their endpoints modified
- Unlimited slots: Valid tknet.ai API Key unlocks unlimited third-party provider slots (vs 3 for free users)
- AWS-style masking: API keys display as
-
Usage:
# Use tknet provider via CLI (prefix: tknet:) nova chat --model tknet:deepseek-v4-flash # Or via API curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"tknet:deepseek-v4-flash","messages":[{"role":"user","content":"Hello!"}]}'
Add custom providers directly in TokenHub:
| Field | Example |
|---|---|
| Name | my-openai |
| Endpoint | https://api.openai.com/v1 |
| API Key | sk-... |
| Model | gpt-4o |
Use provider catalog presets for popular services (OpenAI, Anthropic, Groq, Together, Fireworks, Mistral, DeepSeek, OpenRouter, Gemini, xAI, DashScope, GLM).
- macOS 15.0 (Sequoia) or later
- Apple Silicon Mac (M1, M2, M3, M4)
- 16 GB RAM recommended (8 GB works for smaller models)
See DEVELOPMENT.md for:
- Architecture overview (11-module design)
- Building from source
- Running tests
- Creating releases
- API reference (all 40+ endpoints)
