Skip to content

Support for Zyphra/ZAYA1-base#1261

Open
kyr0 wants to merge 2 commits intoml-explore:mainfrom
kyr0:feat/zaya-support
Open

Support for Zyphra/ZAYA1-base#1261
kyr0 wants to merge 2 commits intoml-explore:mainfrom
kyr0:feat/zaya-support

Conversation

@kyr0
Copy link
Copy Markdown

@kyr0 kyr0 commented May 9, 2026

I found this novel architecture quite interesting, so I clean room implemented support for it after reading and debugging the vLLM patch pushed 2 days ago.

This implementation works well with full precision, 8 bit quantization and 4 bit quantization on a Macbook Air M4 24GB.

Stats:

  • 8 bit quant: ~20t/s generated @ 9.1 GB VRAM (weights) + KV cache
  • 4 bit quant: ~30t/s generated @ 4.99 GB VRAM (weights) + KV cache

I covered testing with manual tests of mlx_lm.benchmark, mlx_lm.convert and mlx_lm.server as well.

Novel features added:

  • Compressed Convolutional Attention (CCA)
  • Residual Scaling
  • Odd/Even Layers and Router1 technique
  • Quantized expert layer switching (SwiGLU)
    ...as described in the technical report.

The converted/quantized models are available on my HF account: https://huggingface.co/kyr0
e.g. https://huggingface.co/kyr0/zaya1-base-8b-MLX

https://huggingface.co/kyr0/zaya1-base-8b-8bit-MLX
https://huggingface.co/kyr0/zaya1-base-8b-4bit-MLX

Enjoy!

Transparency: I'm not affiliated with Zyphra


Repro for convert/quants:

mlx_lm.convert \
  --hf-path Zyphra/ZAYA1-8B \
  --mlx-path "./zaya1-base-8b-MLX" \
  --dtype bfloat16

Quantization

Tested with 8 and 4 bits, group size 64. Lower quants lead to garbage results.

A quick test with AWQ quantization support led to OOM in all cases. Dynamic quant as well. I'm too GPU poor on my Mac machine guys... Does anyone have a Mac Pro?

mlx_lm.convert \
    --hf-path Zyphra/ZAYA1-8B \
    --mlx-path "./zaya1-base-8b-8bit-MLX" \
    --dtype bfloat16 \
    -q --q-bits 8 --q-group-size 64

Server + Test

mlx_lm.server \
  --model "./zaya1-base-8b-8bit-MLX" \
  --host 127.0.0.1 \
  --port 8080 \
  --temp 0.0 \
  --top-p 1.0 \
  --max-tokens 8192 \
  --prefill-step-size 512 \
  --prompt-cache-size 0
curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "./zaya1-base-8b-8bit-MLX",
    "messages": [{"role": "user", "content": "Solve x+2=7. Answer only."}],
    "temperature": 0,
    "max_tokens": 1024
  }'

kyr0 added 2 commits May 9, 2026 03:38
…ttention (CCA) and Residual Scaling as well as quantized expert layer switching in SwiGLU scenarios, and general convert/quantization support for ZAYA.
…cing an empty self.cache list; so we correctly return True in this case generically
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant