NovaMLX

The blazing fast pure-Swift LLM/VLM server for Apple Silicon.
It is written in Swift, and optimized for Apple silicon.
No Python. No cloud. No limits.

Run 50+ model families — Llama, Qwen, Gemma, DeepSeek, Mistral — natively on your Mac.
100% Swift. Zero Python dependencies. OpenAI & Anthropic compatible. Native menu bar app.

Install

Option 1: Homebrew (recommended)

brew tap cnshsliu/novamlx
brew install novamlx
brew services start novamlx

Option 2: Download DMG

Go to Releases and download the latest NovaMLX-X.X.X-arm64.dmg:

Open the .dmg file
Drag NovaMLX to your Applications folder
Launch NovaMLX — the menu bar icon appears and the server starts on localhost:8080

Option 3: Build from source

git clone https://github.com/cnshsliu/novamlx.git
cd novamlx
./build.sh -c release

Requires macOS 15 (Sequoia), Apple Silicon, and Xcode 16+.

Quick Start

1. Start the server

Launch NovaMLX from your Applications folder (or Spotlight).

A menu bar icon appears. The server runs on localhost:8080.

2. (Optional) Add `nova` CLI to your PATH

The nova CLI is bundled inside the app. Symlink it for easy access:

sudo ln -s /Applications/NovaMLX.app/Contents/MacOS/nova /usr/local/bin/nova

3. Download a model

nova download mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

4. Load it

nova load mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

5. Use it

# Interactive chat
nova chat

# Or via API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Meta-Llama-3.1-8B-Instruct-4bit", \
       "messages":[{"role":"user","content":"Write a haiku about coding."}]}'

That's it. You're running an LLM locally.

Use with Your Tools

NovaMLX is fully OpenAI API compatible. Point any tool at http://localhost:8080/v1.

Claude Code

ANTHROPIC_BASE_URL=http://localhost:8080/v1 \
ANTHROPIC_API_KEY=unused \
claude

Or in your shell profile:

export ANTHROPIC_BASE_URL=http://localhost:8080/v1
export ANTHROPIC_API_KEY=unused

OpenCode

Add to your opencode config (~/.config/opencode/config.json):

{
  "provider": {
    "name": "openai",
    "baseURL": "http://localhost:8080/v1",
    "apiKey": "unused",
    "models": {
      "default": "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"
    }
  }
}

Agent Context Scaling (automatic)

When using local models with AI coding agents (Claude Code, OpenCode, OpenClaw, Hermes), the model's context window is often smaller than Anthropic's 200K. NovaMLX auto-detects agent tools from their HTTP headers and scales reported token counts so that auto-compact triggers at the right time — before your local model runs out of context. Normal chat clients (curl, Python SDK, web UI) always get real token counts.

No setup needed. Detection is automatic via Anthropic-Version header (Claude Code) or User-Agent substring matching (OpenCode, OpenClaw, Hermes).

Set contextScalingTarget in ~/.nova/config.json to enable:

{
  "server": {
    "host": "127.0.0.1",
    "port": 8080,
    "adminPort": 8081,
    "apiKeys": [],
    "contextScalingTarget": 200000
  }
}

If your model has a 128K context window and contextScalingTarget is 200000, token counts are scaled by 200000 / 128000 = 1.56× — but only for detected agent tools. If contextScalingTarget is omitted, no scaling occurs (default).

How it works: Claude Code and other agents auto-compact conversation history at ~80% of what they believe the context window to be. By scaling the usage numbers, NovaMLX ensures that 80% of the virtual window maps to the actual limit of your local model, preventing context overflow errors.

Cursor

Settings → Models → OpenAI API Compatible:

Field	Value
Base URL	`http://localhost:8080/v1`
API Key	`unused`
Model ID	`mlx-community/Meta-Llama-3.1-8B-Instruct-4bit`

Continue.dev

Add to ~/.continue/config.json:

{
  "models": [
    {
      "title": "NovaMLX Local",
      "provider": "openai",
      "apiBase": "http://localhost:8080/v1",
      "apiKey": "unused",
      "model": "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"
    }
  ]
}

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

response = client.chat.completions.create(
    model="mlx-community/Meta-Llama-3.1-8B-Instruct-4bit",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Python (Anthropic SDK)

import anthropic

client = anthropic.Anthropic(base_url="http://localhost:8080", api_key="unused")
response = client.messages.create(
    model="mlx-community/Meta-Llama-3.1-8B-Instruct-4bit",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)

cURL / Any HTTP Client

# Chat
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"my-model","messages":[{"role":"user","content":"Hi"}]}'

# Streaming
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"my-model","messages":[{"role":"user","content":"Hi"}],"stream":true}'

# Embeddings
curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"my-embed-model","input":"Hello world"}'

Managing Models with `nova`

The nova CLI lets you manage everything from the terminal:

# Find models
nova search "llama 3.1 4bit"

# Download
nova download mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

# Load into memory
nova load mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

# List loaded models
nova models

# Unload (free memory)
nova unload mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

# Delete downloaded files
nova delete mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

# Chat interactively
nova chat

# Server status & GPU memory
nova status

KV Quantization (TurboQuant)

Compress the KV cache to serve longer contexts:

# Enable 4-bit KV quantization (recommended)
nova turboquant my-model 4

# Enable 2-bit (maximum compression)
nova turboquant my-model 2

# Disable
nova turboquant my-model off

# Check status
nova turboquant

Other Commands

# Sessions
nova sessions              # List active sessions
nova sessions delete ID    # Delete a session

# Cache
nova cache my-model        # Show cache stats
nova cache my-model clear  # Clear cache

# LoRA adapters
nova adapters              # List loaded adapters
nova adapters load /path/to/adapter
nova adapters unload my-adapter

# Benchmark
nova bench start my-model  # Run performance benchmark
nova bench status          # Check benchmark progress

Managing via GUI

macOS Menu Bar App

When you start NovaMLX, a menu bar icon appears showing:

Server status (running/stopped)
Loaded models
GPU memory usage
Active requests
Tokens per second

Click the icon to open the Dashboard window for detailed monitoring.

What Can NovaMLX Do?

50+ Model Architectures

Works with any SafeTensors model from HuggingFace — Llama 3, Qwen 2/2.5/3, Gemma 2/3, Phi 3.5/4, Mistral, Mixtral, DeepSeek, StarCoder2, and many more.

Vision (VLM)

Send images with your messages — supports Qwen2-VL, Gemma3, LLaVA, Phi-3-Vision, Pixtral, Molmo, and others:

response = client.chat.completions.create(
    model="mlx-community/Qwen2.5-VL-7B-Instruct-4bit",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
        ]
    }]
)

Structured Output

Force the model to output valid JSON matching your schema:

response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Who won the 2022 World Cup?"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "answer",
            "schema": {
                "type": "object",
                "properties": {
                    "winner": {"type": "string"},
                    "score": {"type": "string"}
                },
                "required": ["winner", "score"]
            }
        }
    }
)
# Returns: {"winner": "Argentina", "score": "3-3 (4-2 pens)"}

Also supports: JSON mode, Regex patterns, and GBNF grammars.

Tool Calling

Automatic tool call detection across 7 format families — works with any model without fine-tuning.

Embeddings & Reranking

# Embeddings for RAG/semantic search
curl http://localhost:8080/v1/embeddings \
  -d '{"model":"my-embed-model","input":"Hello world"}'

# Rerank documents
curl http://localhost:8080/v1/rerank \
  -d '{"model":"my-reranker","query":"What is MLX?","documents":["doc1","doc2"]}'

Audio (STT/TTS)

Uses Apple's built-in on-device speech recognition and synthesis:

# Speech-to-text
curl http://localhost:8080/v1/audio/transcriptions -F "file=@recording.wav"

# Text-to-speech
curl http://localhost:8080/v1/audio/speech \
  -d '{"model":"tts","input":"Hello!","voice":"Samantha"}'

Both OpenAI and Anthropic APIs

Same server, both APIs:

API	Endpoint
OpenAI Chat	`POST /v1/chat/completions`
OpenAI Completions	`POST /v1/completions`
OpenAI Responses	`POST /v1/responses`
OpenAI Embeddings	`POST /v1/embeddings`
Anthropic Messages	`POST /v1/messages`

Agent-Aware Token Scaling

Automatically detects AI coding agents (Claude Code, OpenCode, OpenClaw, Hermes) from request headers and scales reported token counts so auto-compact triggers at the right time for local model context windows. Normal chat clients get real token counts — no configuration needed. See details →

Supported Models

Any SafeTensors model from HuggingFace in 4-bit, 8-bit, or FP16. Popular choices:

Model	Size	Download Command
Llama 3.1 8B	~4.5 GB	`nova download mlx-community/Meta-Llama-3.1-8B-Instruct-4bit`
Qwen 2.5 7B	~4.5 GB	`nova download mlx-community/Qwen2.5-7B-Instruct-4bit`
Gemma 2 9B	~5.5 GB	`nova download mlx-community/gemma-2-9b-it-4bit`
Phi 3.5 Mini	~2 GB	`nova download mlx-community/Phi-3.5-mini-instruct-4bit`
Mistral 7B	~4 GB	`nova download mlx-community/Mistral-7B-Instruct-v0.3-4bit`
Qwen 2.5 VL 7B	~4.5 GB	`nova download mlx-community/Qwen2.5-VL-7B-Instruct-4bit`

Search for more: nova search "your model name"

Configuration

Environment Variables

# API authentication (optional — no auth when empty)
export NOVAMLX_API_KEYS='["sk-your-key"]'

Per-Model Settings

# Via API
curl -X PUT http://localhost:8081/admin/models/my-model/settings \
  -d '{"temperature": 0.7, "max_context_window": 8192, "kv_bits": 4}'

Config File

~/.config/opencode/config.json:

{
  "host": "127.0.0.1",
  "port": 8080,
  "adminPort": 8081,
  "apiKeys": []
}

Cloud Providers (TokenHub)

NovaMLX supports routing requests to cloud API providers (OpenAI, Anthropic, Groq, etc.) for models you don't have locally. Access via the TokenHub tab in the menu bar GUI.

tknet.ai Integration

NovaMLX integrates with tknet.ai for automatic nova model discovery and provider provisioning:

Configure API Key:
- Open Settings → tknet.ai section
- Enter your tknet.ai API Key (format: sk-xxxxx)
- Click "Verify & Fetch Models"
Auto-Provisioning:
- Nova providers (tagged with ⭐) are automatically created for each nova-tagged model
- Providers inherit API Key from Settings (not stored per-provider)
- On app launch, nova providers sync with tknet.ai model catalog
Provider Features:
- AWS-style masking: API keys display as sk-a...456 (first 4 + ... + last 3)
- Visibility toggle: Eye icon to show/hide API keys in edit form
- Managed protection: Nova providers cannot be deleted or have their endpoints modified
- Unlimited slots: Valid tknet.ai API Key unlocks unlimited third-party provider slots (vs 3 for free users)

Usage:

# Use tknet provider via CLI (prefix: tknet:)
nova chat --model tknet:deepseek-v4-flash

# Or via API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"tknet:deepseek-v4-flash","messages":[{"role":"user","content":"Hello!"}]}'

Manual Provider Configuration

Add custom providers directly in TokenHub:

Field	Example
Name	`my-openai`
Endpoint	`https://api.openai.com/v1`
API Key	`sk-...`
Model	`gpt-4o`

Use provider catalog presets for popular services (OpenAI, Anthropic, Groq, Together, Fireworks, Mistral, DeepSeek, OpenRouter, Gemini, xAI, DashScope, GLM).

Requirements

macOS 15.0 (Sequoia) or later
Apple Silicon Mac (M1, M2, M3, M4)
16 GB RAM recommended (8 GB works for smaller models)

For Developers

See DEVELOPMENT.md for:

Architecture overview (11-module design)
Building from source
Running tests
Creating releases
API reference (all 40+ endpoints)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.github/workflows		.github/workflows
Formula		Formula
NovaMLXUITestRunner		NovaMLXUITestRunner
Scripts		Scripts
Sources		Sources
Tests		Tests
docs		docs
sdk/python		sdk/python
test-reports		test-reports
tmp		tmp
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
DEVELOPMENT.md		DEVELOPMENT.md
EXO_DNet_Technical_Report.md		EXO_DNet_Technical_Report.md
LICENSE		LICENSE
NovaMLX.entitlements		NovaMLX.entitlements
PATCHES.md		PATCHES.md
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md
architecture.md		architecture.md
build.sh		build.sh
chat-upgrade-test.png		chat-upgrade-test.png
features.md		features.md
features.zh-CN.md		features.zh-CN.md
install.sh		install.sh
logo.png		logo.png
optimize.txt		optimize.txt
out.txt		out.txt
suggested-models.json		suggested-models.json
suggested-searches.txt		suggested-searches.txt
t1.html		t1.html
t1.sh		t1.sh
test_write.txt		test_write.txt
todo1.txt		todo1.txt
tps.md		tps.md
v1-responses-spec.md		v1-responses-spec.md

Folders and files

Latest commit

History

Repository files navigation

NovaMLX

Install

Quick Start

1. Start the server

2. (Optional) Add nova CLI to your PATH

3. Download a model

4. Load it

5. Use it

Use with Your Tools

Claude Code

OpenCode

Agent Context Scaling (automatic)

Cursor

Continue.dev

Python (OpenAI SDK)

Python (Anthropic SDK)

cURL / Any HTTP Client

Managing Models with nova

KV Quantization (TurboQuant)

Other Commands

Managing via GUI

macOS Menu Bar App

What Can NovaMLX Do?

50+ Model Architectures

Vision (VLM)

Structured Output

Tool Calling

Embeddings & Reranking

Audio (STT/TTS)

Both OpenAI and Anthropic APIs

Agent-Aware Token Scaling

Supported Models

Configuration

Environment Variables

Per-Model Settings

Config File

Cloud Providers (TokenHub)

tknet.ai Integration

Manual Provider Configuration

Requirements

For Developers

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. (Optional) Add `nova` CLI to your PATH

Managing Models with `nova`

Packages