BumblebeeQuantized

4-bit quantized LLM inference with LoRA adapters for Apple Silicon.

Run 8B parameter models in ~5GB RAM with full fine-tuning support.

Features

4-bit Quantized Inference - Run quantized models using MLX's fused Metal kernels
Runtime LoRA Adapters - Load and apply fine-tuned adapters at inference time
Training Integration - Train your own LoRA adapters via mlx_lm
Apple Silicon Optimized - Uses unified memory for zero-copy GPU access

Requirements

Apple Silicon Mac (M1/M2/M3/M4)
Elixir 1.15+
Python 3.10+ with mlx_lm (for training only)

Installation

def deps do
  [
    {:bumblebee_quantized, "~> 0.1.0"},
    # REQUIRED: EMLX with quantization ops (not on Hex yet)
    {:emlx, github: "notactuallytreyanastasio/emlx", branch: "feat/quantization-ops"}
  ]
end

Note: The EMLX quantization ops are pending upstream merge (PR #95). Once merged, you'll only need {:bumblebee_quantized, "~> 0.1.0"}.

Quick Start

# Load a quantized model
{:ok, model} = BumblebeeQuantized.load_model(
  "/path/to/Qwen3-8B-MLX-4bit"
)

# Load a LoRA adapter (optional)
{:ok, adapter} = BumblebeeQuantized.load_adapter("/path/to/adapter")

# Load tokenizer via Bumblebee
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})

# Create a serving and generate text
serving = BumblebeeQuantized.Serving.new(model, tokenizer,
  adapter: adapter,
  max_new_tokens: 100,
  temperature: 0.8
)

Nx.Serving.run(serving, "Write a post about Elixir")

Full Training Workflow

# 1. Prepare training data
posts = ["First post...", "Second post...", ...]

BumblebeeQuantized.Training.prepare_data(posts, "/path/to/data",
  prompt: "Write a post in my style",
  min_length: 160
)

# 2. Train adapter (calls Python mlx_lm)
{:ok, adapter_path} = BumblebeeQuantized.Training.train(
  base_model: "lmstudio-community/Qwen3-8B-MLX-4bit",
  training_data: "/path/to/data",
  output_path: "/path/to/adapter",
  iterations: 25_000,
  rank: 8,
  scale: 20.0
)

# 3. Load and use
{:ok, model} = BumblebeeQuantized.load_model("/path/to/Qwen3-8B-4bit")
{:ok, adapter} = BumblebeeQuantized.load_adapter(adapter_path)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})

serving = BumblebeeQuantized.Serving.new(model, tokenizer, adapter: adapter)
Nx.Serving.run(serving, "Write a post")

Performance

Tested on Apple Silicon:

Metric	Value
Model	Qwen3-8B-4bit
Memory Usage	~5GB
Model Load Time	4-6 seconds
Single Token Latency	~7ms (135 tok/s)
Generation Throughput	~21 tok/s

Modules

Module	Description
`BumblebeeQuantized.Loader`	Load quantized models from safetensors
`BumblebeeQuantized.Adapters`	Load, apply, and train LoRA adapters
`BumblebeeQuantized.Serving`	Nx.Serving for text generation
`BumblebeeQuantized.Training`	LoRA training workflow
`BumblebeeQuantized.Models.Qwen3`	Qwen3 quantized model definition

Supported Models

Currently supported:

Qwen3 (8B, other sizes should work)

Planned:

LLaMA 2/3
Mistral

How It Works

Quantized Weights: Models are stored in MLX 4-bit format with weight triplets (packed uint32, scales, biases)
EMLX Backend: Uses our EMLX fork with quantized_matmul NIF
Runtime LoRA: Adapters are applied at inference time: output = base_output + scale * (x @ A @ B)
Bumblebee Tokenizer: Uses Bumblebee's tokenizer for text encoding/decoding

Related Projects

bobby_posts - The project that spawned this library
EMLX Fork - EMLX with quantization ops
safetensors_ex - Safetensors parser for Elixir

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
lib		lib
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BumblebeeQuantized

Features

Requirements

Installation

Quick Start

Full Training Workflow

Performance

Modules

Supported Models

How It Works

Related Projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BumblebeeQuantized

Features

Requirements

Installation

Quick Start

Full Training Workflow

Performance

Modules

Supported Models

How It Works

Related Projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages