This document outlines how to add embedded LLM capabilities to your refactor-tool tool.
# Download a recommended model
refactor-tool --download-model codellama-7b
# Use embedded model for analysis
refactor-tool --analyze --ai --ai-provider embedded --ai-model codellama-7b
# List available models
refactor-tool --list-models
# Check system compatibility
refactor-tool --check-systemUpdate main.rs with new CLI options:
#[derive(Parser, Debug)]
struct Args {
// ... existing fields ...
/// Download a model for embedded use
#[arg(long = "download-model")]
download_model: Option<String>,
/// List available models for download
#[arg(long = "list-models")]
list_models: bool,
/// Check system requirements for running models
#[arg(long = "check-system")]
check_system: bool,
/// Use embedded LLM (no external server needed)
#[arg(long = "embedded")]
use_embedded: bool,
}[dependencies]
# Existing dependencies...
# For embedded LLM support
llama-cpp-2 = { version = "0.1", optional = true }
# OR
candle-core = { version = "0.3", optional = true }
candle-transformers = { version = "0.3", optional = true }
hf-hub = { version = "0.3", optional = true }
tokenizers = { version = "0.15", optional = true }
[features]
default = []
embedded-llm = ["llama-cpp-2"] # or candle dependencies~/.refactor-tool/
├── models/
│ ├── codellama-7b.gguf
│ ├── qwen2.5-coder-7b.gguf
│ └── llama3.1-8b.gguf
├── config.toml
└── cache/
Create ~/.refactor-tool/config.toml:
[embedded]
default_model = "codellama-7b"
models_dir = "~/.refactor-tool/models"
gpu_layers = 0 # Number of layers to run on GPU
context_size = 4096
threads = 4
[models.codellama-7b]
path = "codellama-7b.gguf"
description = "Code Llama 7B for code analysis"
size_gb = 3.8
min_ram_gb = 8
[models.qwen2.5-coder-7b]
path = "qwen2.5-coder-7b.gguf"
description = "Qwen 2.5 Coder 7B - excellent for Rust"
size_gb = 4.1
min_ram_gb = 8- Complete Privacy: Code never leaves your machine
- No API Costs: Free to run after initial download
- Offline Operation: Works without internet
- No Rate Limits: Analyze as much code as you want
- Predictable Performance: No network latency
- Self-Contained: No external dependencies
- Storage: Models are 3-20GB each
- RAM Usage: Requires 8-32GB RAM depending on model
- Setup Time: Initial download and configuration
- Quality: Slightly lower than latest cloud models
- Hardware Requirements: Better GPU = faster inference
| Model | Size | RAM | Quality | Best For |
|---|---|---|---|---|
| CodeLlama-7B | 3.8GB | 8GB | Good | General code analysis |
| Qwen2.5-Coder-7B | 4.1GB | 8GB | Very Good | Rust/systems code |
| CodeLlama-13B | 7.3GB | 16GB | Excellent | Complex analysis |
| Qwen2.5-Coder-14B | 8.2GB | 16GB | Excellent | Best quality |
- High Priority: llama.cpp bindings (most mature)
- Medium Priority: Candle integration (pure Rust)
- Low Priority: ONNX runtime (broader compatibility)
# Download models once per team
refactor-tool --download-model codellama-13b
# Use for private code analysis
refactor-tool --analyze --ai --ai-provider embedded --ai-model codellama-13b# Use smaller model for quick analysis
refactor-tool --analyze --ai --ai-provider embedded --ai-model codellama-7b
# Unlimited analysis without API costs
refactor-tool --analyze --ai --ai-provider embedded --ai-max-functions 100# Automated code quality checks
refactor-tool --analyze --ai --ai-provider embedded --quiet > analysis.json- Choose implementation approach (llama.cpp vs Candle)
- Add CLI commands for model management
- Implement embedded LLM client
- Add configuration system
- Create model download/management system
- Test with real models
- Document setup and usage
This would make your tool completely self-contained while maintaining high-quality AI analysis capabilities!