| Layer | What It Handles | Relevant to 70B Model |
|---|---|---|
| Ollama | Model loading, quantization, GPU memory, inference speed | ✅ Primary |
| ConvertIt | Prompt efficiency, chunking, routing | ✅ Secondary |
Ollama handles the heavy lifting:
- Quantization: Use Q4 or Q5 quantized versions to fit in VRAM
- GPU Offloading: Ollama automatically uses GPU layers
- Context Window: Ollama manages KV cache
ConvertIt optimizations still help:
- Smaller chunks (6K chars) = fewer tokens per call = faster inference
- Prompt compression = less input to process
- Hybrid routing won't matter much if you're running everything local
# Q4 quantized - good balance of quality and VRAM usage
ollama pull llama3:70b-q4_K_M
# Or Q5 for slightly better quality (needs more VRAM)
ollama pull llama3:70b-q5_K_MUpdate your .env or use the Settings modal:
LLM_PROVIDER=local
OLLAMA_BASE_URL=http://localhost:11434In the Settings modal, set Ollama Model to:
llama3:70b-q4_K_M
ollama serve| Model | Quantization | Est. VRAM | 5090 (32GB) |
|---|---|---|---|
| 70B | Q4_K_M | ~40GB | |
| 70B | Q4_K_S | ~35GB | |
| 70B | Q3_K_M | ~30GB | ✅ Should fit |
| 34B | Q4_K_M | ~20GB | ✅ Comfortable |
Note: If 70B doesn't fully fit, Ollama will automatically offload some layers to CPU (slower but works).
- Close other GPU apps before running to maximize available VRAM
- Use Q3 quantization if Q4 causes OOM errors
- Consider 34B models (like
codellama:34b) for faster inference with good quality - Lower context length if needed - edit
Modelfile:PARAMETER num_ctx 4096
If 70B is too slow for your workflow:
| Model | Size | Quality | Speed |
|---|---|---|---|
llama3.1:8b |
8B | Good | ⚡ Fast |
mistral:7b |
7B | Good | ⚡ Fast |
llama3:70b-q4 |
70B | Excellent | 🐢 Slower |
qwen2.5:32b |
32B | Very Good | ⚡ Good balance |
TL;DR: Ollama does the GPU optimization. ConvertIt's optimizations reduce token count/prompt size, which indirectly helps but isn't GPU-specific.