Skip to content

Latest commit

 

History

History
920 lines (670 loc) · 23 KB

File metadata and controls

920 lines (670 loc) · 23 KB

Local LLM Tools - Complete Guide

A comprehensive guide to tools for running large language models locally on your own hardware.

Table of Contents


Why Run LLMs Locally?

Privacy

  • No data leaves your machine - Sensitive information stays local
  • No logging by third parties - Your queries are never stored externally
  • Compliance friendly - Meet GDPR, HIPAA, SOC2 requirements
  • Trade secrets protected - Proprietary code and data remain secure

Cost

  • No API fees - Eliminate per-token costs
  • Unlimited usage - Run as many queries as you want
  • Predictable costs - One-time hardware investment
  • ROI over time - Heavy users save significantly

Performance

  • Lower latency - No network round-trips
  • Consistent speed - No rate limiting or queuing
  • Offline capability - Work without internet
  • Full control - Adjust parameters, quantization, and more

Development

  • Rapid iteration - No API rate limits
  • Debugging - Full visibility into model behavior
  • Experimentation - Try different models freely
  • Custom fine-tuning - Train on your own data

Tool Comparison

Quick Reference

Tool Interface Best For Ease of Use Model Library API Server
Ollama CLI Developers, automation ★★★★★ 100+ OpenAI-compatible
LM Studio GUI Beginners, experimentation ★★★★★ Extensive OpenAI-compatible
GPT4All GUI Privacy-focused users ★★★★☆ Curated Basic
Jan GUI ChatGPT replacement ★★★★☆ Good OpenAI-compatible
LocalAI API OpenAI drop-in ★★★☆☆ Very flexible OpenAI-compatible
text-gen-webui Web Advanced customization ★★★☆☆ Any GGUF/GPTQ OpenAI-compatible
Haplo AI App iOS/macOS users ★★★★☆ Limited Native
Msty GUI Desktop users ★★★★☆ Good Basic

Feature Matrix

Feature Ollama LM Studio GPT4All Jan LocalAI
macOS
Windows
Linux
GPU Support
Apple Silicon ✅ Native ✅ Native ✅ Native ✅ Native
Docker
RAG Built-in
Model Fine-tuning
Open Source

Detailed Tool Reviews

Ollama

The Developer's Choice

Overview

Ollama is the most user-friendly command-line tool for running LLMs locally. It handles model downloads, GPU acceleration, and API serving with simple commands.

Key Features

  • One-line commands - ollama run llama3.1 to start chatting
  • 100+ models - Extensive model library
  • OpenAI-compatible API - Drop-in replacement
  • Automatic GPU detection - Optimal hardware utilization
  • Model management - Easy pull, list, remove
  • Modelfile - Custom model configurations
  • Excellent performance - Highly optimized inference

Strengths

  • Extremely easy to install and use
  • Best CLI experience
  • Native Apple Silicon support
  • Large community and ecosystem
  • Integrates with most frameworks
  • Actively maintained

Limitations

  • No GUI (CLI only)
  • Basic chat interface
  • No built-in RAG
  • No fine-tuning support

Installation

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

# Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

Basic Usage

# Pull a model
ollama pull llama3.1:8b

# Chat with model
ollama run llama3.1:8b

# List models
ollama list

# Run with parameters
ollama run llama3.1:8b --verbose

# API request
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?"
}'

Popular Models

Model Size RAM Best For
llama3.1:8b 4.7GB 8GB General use
llama3.1:70b 40GB 48GB+ Complex tasks
qwen2.5:14b 9GB 16GB Reasoning
mistral:7b 4.1GB 8GB Fast inference
codellama:13b 7.4GB 16GB Code generation
deepseek-coder-v2 8.9GB 16GB Advanced coding

Integration Example

from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3.1:8b")
response = llm.invoke("Explain quantum computing")
print(response)

LM Studio

The Power User's GUI

  • Website: lmstudio.ai
  • License: Proprietary (Free to use)
  • Platforms: macOS, Windows, Linux

Overview

LM Studio provides a beautiful desktop application for discovering, downloading, and running LLMs. It's ideal for users who prefer graphical interfaces and want extensive customization options.

Key Features

  • Model discovery - Browse and search Hugging Face models
  • One-click download - Easy model acquisition
  • Chat interface - Clean, intuitive UI
  • Parameter control - Extensive inference settings
  • Local API server - OpenAI-compatible endpoint
  • Multi-model - Run different models simultaneously
  • Conversation history - Save and export chats

Strengths

  • Best GUI experience
  • Extensive model discovery
  • Detailed parameter controls
  • Good for experimentation
  • Nice conversation management
  • Supports fine-tuning

Limitations

  • Not open source
  • No headless/server mode
  • Heavier resource usage
  • Limited automation capabilities

Installation

  1. Visit lmstudio.ai
  2. Download for your platform
  3. Install and launch
  4. Browse models and download
  5. Start chatting!

Usage

  1. Discover Models: Use the search to find models
  2. Download: Click download and wait
  3. Load: Select model in chat interface
  4. Configure: Adjust temperature, top-p, etc.
  5. Chat: Start interacting
  6. API Server: Enable in settings for external access

API Server Setup

  1. Go to Settings > Local Server
  2. Enable server
  3. Default: http://localhost:1234/v1
  4. Use with any OpenAI-compatible client

Integration Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

GPT4All

Privacy-First Desktop AI

Overview

GPT4All is built on principles of privacy, security, and offline capability. It provides a polished desktop application with built-in RAG for document chat.

Key Features

  • 100% offline - No internet required after download
  • LocalDocs - Chat with your documents
  • Curated models - Quality-tested selection
  • Simple UI - Beginner-friendly interface
  • Plugin system - Extend capabilities
  • Cross-platform - Works everywhere
  • Python bindings - Use in your code

Strengths

  • Very privacy-focused
  • Built-in document chat
  • Simple, clean interface
  • Well-curated models
  • Low barrier to entry
  • Good for beginners

Limitations

  • Smaller model selection
  • Less customization
  • Basic API
  • Some commercial restrictions

Installation

  1. Download from gpt4all.io
  2. Install the application
  3. Launch and download a model
  4. Start chatting

LocalDocs Feature

  1. Go to Settings > LocalDocs
  2. Add folder with your documents
  3. Wait for indexing
  4. Enable LocalDocs in chat
  5. Ask questions about your documents

Python Usage

from gpt4all import GPT4All

model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")

with model.chat_session():
    response = model.generate("What is machine learning?")
    print(response)

Jan

Open-Source ChatGPT Alternative

  • Website: jan.ai
  • GitHub: janhq/jan
  • License: AGPL-3.0
  • Platforms: macOS, Windows, Linux

Overview

Jan aims to be a complete, open-source replacement for ChatGPT. It runs 100% offline and provides a familiar chat interface with additional features like extensions and API server.

Key Features

  • ChatGPT-like UI - Familiar interface
  • 100% offline - Complete privacy
  • Extension system - Add capabilities
  • API server - OpenAI-compatible
  • Model hub - Easy model downloads
  • Thread management - Organize conversations
  • Open source - Full transparency

Strengths

  • Fully open source
  • Familiar interface
  • Good model selection
  • Active development
  • Privacy-focused
  • Extension ecosystem

Limitations

  • Some stability issues
  • Fewer models than Ollama
  • Resource usage can be high
  • Younger project

Installation

  1. Download from jan.ai
  2. Install for your platform
  3. Launch application
  4. Download a model from the hub
  5. Start chatting

API Server

  1. Go to Settings
  2. Enable API Server
  3. Default: http://localhost:1337/v1
  4. Use with external tools

Configuration

Jan stores configuration in ~/jan:

  • models/ - Downloaded models
  • threads/ - Conversation history
  • extensions/ - Installed extensions

LocalAI

OpenAI API Drop-in Replacement

Overview

LocalAI is a local OpenAI-compatible API that supports various model formats. It's designed as a drop-in replacement for OpenAI's API.

Key Features

  • OpenAI API compatible - Full API support
  • Multiple formats - GGUF, GPTQ, AWQ, etc.
  • Multi-modal - Text, audio, images
  • Embeddings - Vector generation
  • Speech - TTS and STT
  • Docker-first - Easy containerized deployment
  • GPU support - CUDA and ROCm

Strengths

  • Best OpenAI compatibility
  • Supports many model formats
  • Multi-modal capabilities
  • Good for production
  • Container-friendly

Limitations

  • More complex setup
  • Docker recommended
  • Higher learning curve
  • Less user-friendly

Installation

# Docker (recommended)
docker run -p 8080:8080 \
  -v $PWD/models:/models \
  localai/localai:latest

# Or with GPU
docker run --gpus all -p 8080:8080 \
  -v $PWD/models:/models \
  localai/localai:latest-cublas-cuda12

Usage

# Download a model
curl http://localhost:8080/models/apply -H "Content-Type: application/json" \
  -d '{"url": "github:go-skynet/model-gallery/mistral.yaml"}'

# Chat completion
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \
  -d '{
    "model": "mistral",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Text Generation WebUI

The Swiss Army Knife

Overview

Text Generation WebUI (also known as oobabooga) is a highly flexible web interface for running LLMs. It supports numerous model formats and has an extensive extension ecosystem.

Key Features

  • Multiple backends - llama.cpp, ExLlamav2, Transformers
  • Many formats - GGUF, GPTQ, AWQ, EXL2, etc.
  • Extension system - Tons of add-ons
  • Training - LoRA fine-tuning
  • Character chat - Role-play modes
  • API - OpenAI-compatible endpoint
  • Notebooks - Interactive mode

Strengths

  • Most flexible tool
  • Huge extension library
  • Supports fine-tuning
  • Many loading options
  • Active community
  • Good for advanced users

Limitations

  • Complex setup
  • Can be overwhelming
  • Performance varies by backend
  • Steeper learning curve

Installation

# Clone repository
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui

# Run installer (creates conda environment)
# Linux/macOS:
./start_linux.sh
# Windows:
start_windows.bat

Model Download

  1. Go to "Model" tab
  2. Enter model name (e.g., TheBloke/Llama-2-7B-Chat-GGUF)
  3. Click "Download"
  4. Select and load model

Haplo AI

Native iOS/macOS Experience

  • Platform: App Store (iOS/macOS)
  • License: Proprietary
  • Platforms: iOS, macOS

Overview

Haplo AI provides a native Apple experience for running LLMs locally on iPhone, iPad, and Mac.

Key Features

  • Native app - True Apple integration
  • On-device - Models run locally
  • Privacy - Data stays on device
  • Siri integration - Voice commands
  • Continuity - Sync across devices
  • Apple Silicon optimized - Best performance

Strengths

  • Best iOS experience
  • Native Apple integration
  • Very user-friendly
  • Privacy-focused
  • Good performance on Apple Silicon

Limitations

  • Apple ecosystem only
  • Limited model selection
  • Less customization
  • Paid app

Msty

Modern Desktop LLM Client

  • Website: msty.app
  • Platforms: macOS, Windows, Linux

Overview

Msty is a modern desktop application for running local LLMs with a focus on user experience and productivity features.

Key Features

  • Clean UI - Modern interface
  • Multi-model - Use different models
  • Workspaces - Organize conversations
  • RAG - Document chat
  • Snippets - Save and reuse prompts
  • Branching - Explore conversation paths

Strengths

  • Beautiful interface
  • Good organization features
  • RAG support
  • Active development

Limitations

  • Newer tool
  • Smaller community
  • Still maturing

Model Recommendations

For General Conversation

Model Parameters Quantization RAM Quality
Llama 3.1 8B Instruct 8B Q4_K_M 8GB Excellent
Qwen 2.5 7B Instruct 7B Q4_K_M 8GB Excellent
Mistral 7B Instruct 7B Q4_K_M 8GB Very Good

For Coding

Model Parameters Quantization RAM Quality
DeepSeek Coder V2 16B Q4_K_M 12GB Excellent
CodeLlama 13B Instruct 13B Q4_K_M 10GB Very Good
Qwen 2.5 Coder 7B 7B Q4_K_M 8GB Very Good

For Reasoning

Model Parameters Quantization RAM Quality
Qwen 2.5 14B Instruct 14B Q4_K_M 12GB Excellent
Llama 3.1 70B Instruct 70B Q4_K_M 48GB Outstanding
Mistral Large 123B Q2_K 48GB+ Outstanding

For Speed

Model Parameters Quantization RAM Tokens/sec
Phi-3 Mini 3.8B Q4_K_M 4GB Very Fast
Mistral 7B 7B Q4_K_S 6GB Fast
Llama 3.2 3B 3B Q4_K_M 4GB Very Fast

Hardware Requirements

Minimum Requirements

  • CPU: Modern multi-core (Intel i5/AMD Ryzen 5 or better)
  • RAM: 8GB (for 7B models)
  • Storage: 10GB+ free space
  • GPU: Optional but recommended

Recommended Configurations

Entry Level (7-8B models)

  • CPU: Intel i5-12400 / AMD Ryzen 5 5600
  • RAM: 16GB DDR4
  • GPU: RTX 3060 12GB / RX 6700 XT
  • Storage: NVMe SSD

Mid Range (13-14B models)

  • CPU: Intel i7-13700 / AMD Ryzen 7 7700
  • RAM: 32GB DDR5
  • GPU: RTX 4070 12GB / RX 7800 XT
  • Storage: NVMe SSD

High End (30-70B models)

  • CPU: Intel i9-14900K / AMD Ryzen 9 7950X
  • RAM: 64GB+ DDR5
  • GPU: RTX 4090 24GB / 2x RTX 4080
  • Storage: Fast NVMe SSD

Apple Silicon

Apple Silicon Macs are excellent for local LLMs due to unified memory:

Chip Unified Memory Max Model Size
M1 8-16GB 7-13B
M1 Pro/Max 16-64GB 30-70B
M2 8-24GB 7-14B
M2 Pro/Max 16-96GB 30-70B+
M3 8-24GB 7-14B
M3 Pro/Max 18-128GB 30-70B+
M4 16-32GB 14-30B
M4 Pro/Max 24-128GB 30-70B+

Installation Guides

Complete Ollama Setup

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Verify installation
ollama --version

# 3. Start service (usually automatic)
ollama serve

# 4. Pull models
ollama pull llama3.1:8b
ollama pull nomic-embed-text
ollama pull codellama:13b

# 5. Test
ollama run llama3.1:8b "Hello, how are you?"

# 6. Configure for external access (optional)
# Edit /etc/systemd/system/ollama.service
# Add: Environment="OLLAMA_HOST=0.0.0.0"
# Then: systemctl daemon-reload && systemctl restart ollama

Complete LM Studio Setup

  1. Download: Visit lmstudio.ai
  2. Install: Run installer for your platform
  3. Launch: Open LM Studio
  4. Search Models: Go to Discover tab
  5. Download Model: Search "llama 3.1 8b instruct", download Q4_K_M
  6. Load Model: Go to Chat, select model
  7. Configure: Adjust context length, temperature
  8. Enable Server: Settings > Local Server > Enable
  9. Test API:
curl http://localhost:1234/v1/models

Complete GPT4All Setup

  1. Download: Get from gpt4all.io
  2. Install: Run installer
  3. Launch: Open GPT4All
  4. Download Model: Click "Download models", choose one
  5. Wait: Model downloads automatically
  6. Chat: Start conversing
  7. Setup LocalDocs:
    • Settings > LocalDocs
    • Add your document folders
    • Wait for indexing
    • Enable "Use LocalDocs" in chat

Complete Jan Setup

  1. Download: Get from jan.ai
  2. Install: Run installer
  3. Launch: Open Jan
  4. Model Hub: Browse available models
  5. Download: Click download on chosen model
  6. Select: Choose model in chat interface
  7. Enable API:
    • Settings > Advanced
    • Enable "API Server"
    • Note port (default 1337)

Complete LocalAI Setup

# 1. Create directory
mkdir localai && cd localai

# 2. Run with Docker
docker run -p 8080:8080 \
  --name local-ai \
  -v $PWD/models:/models \
  -e DEBUG=true \
  localai/localai:latest

# 3. Download a model gallery entry
curl http://localhost:8080/models/apply -H "Content-Type: application/json" \
  -d '{"url": "github:go-skynet/model-gallery/llama3-8b-instruct.yaml"}'

# 4. Wait for download, then test
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-8b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# 5. With GPU support
docker run --gpus all -p 8080:8080 \
  --name local-ai-gpu \
  -v $PWD/models:/models \
  localai/localai:latest-cublas-cuda12

Troubleshooting

Common Issues

Model Won't Load

  • Check available RAM
  • Try smaller quantization (Q4_K_S instead of Q4_K_M)
  • Close other applications
  • Check GPU VRAM if using GPU

Slow Performance

  • Enable GPU acceleration
  • Use smaller model or quantization
  • Increase context window only as needed
  • Check thermal throttling

API Connection Failed

  • Verify service is running
  • Check port is not in use
  • Confirm firewall settings
  • Check correct URL and port

Out of Memory

  • Use smaller model
  • Use lower quantization
  • Reduce context length
  • Enable GPU offloading


November 2025 Model Releases

Latest Flagship Models

Model Provider Release Key Features Parameters
Llama 4 Scout/Maverick Meta Nov 2025 Natively multimodal, advanced reasoning 8B-405B
Qwen3-Next/Omni Alibaba Nov 2025 Next-gen architecture, omni-modal Various
Qwen3-Coder-480B Alibaba Nov 2025 Agentic coding, massive scale 480B
DeepSeek V3.2-Exp DeepSeek Nov 2025 Latest experimental improvements Various
DeepSeek R1 DeepSeek Nov 2025 Advanced reasoning, CoT Various
GPT-OSS OpenAI Nov 2025 First open-source from OpenAI Various
Gemini 3 Google Nov 2025 Next-gen multimodal Various
Grok 3/4 xAI Nov 2025 Latest Grok iterations Various
Claude 4 Anthropic Nov 2025 Advanced reasoning and safety Various
Phi 4 Microsoft Nov 2025 Small but powerful 3-14B

Running Latest Models Locally

Llama 4 with Ollama

# Pull Llama 4 Scout (efficient variant)
ollama pull llama4:scout

# Pull Llama 4 Maverick (advanced variant)
ollama pull llama4:maverick

# Run with multimodal support
ollama run llama4:scout --verbose

Qwen3 Series

# Qwen3 base models
ollama pull qwen3:7b
ollama pull qwen3:14b
ollama pull qwen3:32b

# Qwen3-Coder for agentic coding
ollama pull qwen3-coder:14b

DeepSeek Latest

# DeepSeek V3.2 experimental
ollama pull deepseek-v3.2:latest

# DeepSeek R1 for reasoning
ollama pull deepseek-r1:latest

November 2025 Tool Updates

Ollama 2.x Improvements

The latest Ollama releases bring significant enhancements:

  • Improved vision support - Better multimodal handling
  • Faster downloads - Optimized model pulling
  • Enhanced API - New endpoints for advanced features
  • Memory optimization - Better RAM management
  • Multi-GPU - Improved distribution across GPUs

LM Studio 2.0 Features

  • Model comparison - Side-by-side model testing
  • Batch inference - Process multiple prompts
  • Advanced presets - Save complex configurations
  • Integration hub - Connect to external tools
  • Performance profiling - Detailed benchmarks

Jan Improvements

  • Plugin ecosystem - Growing extension library
  • Improved stability - Better memory management
  • Enhanced API - Full OpenAI compatibility
  • Model recommendations - Smart model suggestions
  • Team features - Shared workspaces (beta)

LocalAI Enhancements

  • Faster startup - Reduced initialization time
  • New backends - Additional model format support
  • GPU detection - Automatic optimal configuration
  • Metrics dashboard - Built-in monitoring
  • Cluster mode - Distributed inference

Conclusion

For most users, we recommend:

  1. Ollama - Best for developers and automation
  2. LM Studio - Best for GUI users and experimentation
  3. GPT4All - Best for privacy-focused users
  4. Jan - Best as ChatGPT replacement

Start with Ollama if you're comfortable with command line, or LM Studio if you prefer a visual interface. Both provide excellent performance and are well-maintained.

Remember to choose models based on your hardware - a well-optimized 7B model often outperforms a poorly-running 70B model!


Last Updated: November 2025