Support for Zyphra/ZAYA1-base by kyr0 · Pull Request #1261 · ml-explore/mlx-lm

kyr0 · 2026-05-09T02:18:13Z

I found this novel architecture quite interesting, so I clean room implemented support for it after reading and debugging the vLLM patch pushed 2 days ago.

This implementation works well with full precision, 8 bit quantization and 4 bit quantization on a Macbook Air M4 24GB.

Stats:

8 bit quant: ~20t/s generated @ 9.1 GB VRAM (weights) + KV cache
4 bit quant: ~30t/s generated @ 4.99 GB VRAM (weights) + KV cache

I covered testing with manual tests of mlx_lm.benchmark, mlx_lm.convert and mlx_lm.server as well.

Novel features added:

Compressed Convolutional Attention (CCA)
Residual Scaling
Odd/Even Layers and Router1 technique
Quantized expert layer switching (SwiGLU)
...as described in the technical report.

The converted/quantized models are available on my HF account: https://huggingface.co/kyr0
e.g. https://huggingface.co/kyr0/zaya1-base-8b-MLX

https://huggingface.co/kyr0/zaya1-base-8b-8bit-MLX
https://huggingface.co/kyr0/zaya1-base-8b-4bit-MLX

Enjoy!

Transparency: I'm not affiliated with Zyphra

Repro for convert/quants:

mlx_lm.convert \
  --hf-path Zyphra/ZAYA1-8B \
  --mlx-path "./zaya1-base-8b-MLX" \
  --dtype bfloat16

Quantization

Tested with 8 and 4 bits, group size 64. Lower quants lead to garbage results.

A quick test with AWQ quantization support led to OOM in all cases. Dynamic quant as well. I'm too GPU poor on my Mac machine guys... Does anyone have a Mac Pro?

mlx_lm.convert \
    --hf-path Zyphra/ZAYA1-8B \
    --mlx-path "./zaya1-base-8b-8bit-MLX" \
    --dtype bfloat16 \
    -q --q-bits 8 --q-group-size 64

Server + Test

mlx_lm.server \
  --model "./zaya1-base-8b-8bit-MLX" \
  --host 127.0.0.1 \
  --port 8080 \
  --temp 0.0 \
  --top-p 1.0 \
  --max-tokens 8192 \
  --prefill-step-size 512 \
  --prompt-cache-size 0

curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "./zaya1-base-8b-8bit-MLX",
    "messages": [{"role": "user", "content": "Solve x+2=7. Answer only."}],
    "temperature": 0,
    "max_tokens": 1024
  }'

…ttention (CCA) and Residual Scaling as well as quantized expert layer switching in SwiGLU scenarios, and general convert/quantization support for ZAYA.

…cing an empty self.cache list; so we correctly return True in this case generically

kyr0 added 2 commits May 9, 2026 03:38

feat: added support for Zyphra/ZAYA1-base, Compressed Convolutional A…

5da651b

…ttention (CCA) and Residual Scaling as well as quantized expert layer switching in SwiGLU scenarios, and general convert/quantization support for ZAYA.

fix: ZAYA creates ArraysCache(size=0) for non-attention layers, produ…

0a6a958

…cing an empty self.cache list; so we correctly return True in this case generically

kyr0 mentioned this pull request May 9, 2026

[Model] Support for Zyphra ZAYA1 models vllm-project/vllm#41999

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Zyphra/ZAYA1-base#1261

Support for Zyphra/ZAYA1-base#1261
kyr0 wants to merge 2 commits intoml-explore:mainfrom
kyr0:feat/zaya-support

kyr0 commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kyr0 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repro for convert/quants:

Quantization

Server + Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kyr0 commented May 9, 2026 •

edited

Loading