Skip to content

notactuallytreyanastasio/bumblebee_quantized

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BumblebeeQuantized

4-bit quantized LLM inference with LoRA adapters for Apple Silicon.

Run 8B parameter models in ~5GB RAM with full fine-tuning support.

Features

  • 4-bit Quantized Inference - Run quantized models using MLX's fused Metal kernels
  • Runtime LoRA Adapters - Load and apply fine-tuned adapters at inference time
  • Training Integration - Train your own LoRA adapters via mlx_lm
  • Apple Silicon Optimized - Uses unified memory for zero-copy GPU access

Requirements

  • Apple Silicon Mac (M1/M2/M3/M4)
  • Elixir 1.15+
  • Python 3.10+ with mlx_lm (for training only)

Installation

def deps do
  [
    {:bumblebee_quantized, "~> 0.1.0"},
    # REQUIRED: EMLX with quantization ops (not on Hex yet)
    {:emlx, github: "notactuallytreyanastasio/emlx", branch: "feat/quantization-ops"}
  ]
end

Note: The EMLX quantization ops are pending upstream merge (PR #95). Once merged, you'll only need {:bumblebee_quantized, "~> 0.1.0"}.

Quick Start

# Load a quantized model
{:ok, model} = BumblebeeQuantized.load_model(
  "/path/to/Qwen3-8B-MLX-4bit"
)

# Load a LoRA adapter (optional)
{:ok, adapter} = BumblebeeQuantized.load_adapter("/path/to/adapter")

# Load tokenizer via Bumblebee
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})

# Create a serving and generate text
serving = BumblebeeQuantized.Serving.new(model, tokenizer,
  adapter: adapter,
  max_new_tokens: 100,
  temperature: 0.8
)

Nx.Serving.run(serving, "Write a post about Elixir")

Full Training Workflow

# 1. Prepare training data
posts = ["First post...", "Second post...", ...]

BumblebeeQuantized.Training.prepare_data(posts, "/path/to/data",
  prompt: "Write a post in my style",
  min_length: 160
)

# 2. Train adapter (calls Python mlx_lm)
{:ok, adapter_path} = BumblebeeQuantized.Training.train(
  base_model: "lmstudio-community/Qwen3-8B-MLX-4bit",
  training_data: "/path/to/data",
  output_path: "/path/to/adapter",
  iterations: 25_000,
  rank: 8,
  scale: 20.0
)

# 3. Load and use
{:ok, model} = BumblebeeQuantized.load_model("/path/to/Qwen3-8B-4bit")
{:ok, adapter} = BumblebeeQuantized.load_adapter(adapter_path)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})

serving = BumblebeeQuantized.Serving.new(model, tokenizer, adapter: adapter)
Nx.Serving.run(serving, "Write a post")

Performance

Tested on Apple Silicon:

Metric Value
Model Qwen3-8B-4bit
Memory Usage ~5GB
Model Load Time 4-6 seconds
Single Token Latency ~7ms (135 tok/s)
Generation Throughput ~21 tok/s

Modules

Module Description
BumblebeeQuantized.Loader Load quantized models from safetensors
BumblebeeQuantized.Adapters Load, apply, and train LoRA adapters
BumblebeeQuantized.Serving Nx.Serving for text generation
BumblebeeQuantized.Training LoRA training workflow
BumblebeeQuantized.Models.Qwen3 Qwen3 quantized model definition

Supported Models

Currently supported:

  • Qwen3 (8B, other sizes should work)

Planned:

  • LLaMA 2/3
  • Mistral

How It Works

  1. Quantized Weights: Models are stored in MLX 4-bit format with weight triplets (packed uint32, scales, biases)

  2. EMLX Backend: Uses our EMLX fork with quantized_matmul NIF

  3. Runtime LoRA: Adapters are applied at inference time: output = base_output + scale * (x @ A @ B)

  4. Bumblebee Tokenizer: Uses Bumblebee's tokenizer for text encoding/decoding

Related Projects

License

MIT

About

4-bit quantized LLM inference with LoRA adapters for Apple Silicon

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages