Skip to content

Add PLaMo 3 model support#1234

Open
mitmul wants to merge 12 commits intoml-explore:mainfrom
mitmul:mitmul/plamo3-support
Open

Add PLaMo 3 model support#1234
mitmul wants to merge 12 commits intoml-explore:mainfrom
mitmul:mitmul/plamo3-support

Conversation

@mitmul
Copy link
Copy Markdown

@mitmul mitmul commented Apr 30, 2026

Thank you to the mlx-lm maintainers for building and maintaining this excellent library. It is a pleasure to contribute support for another open model family to the project.

Summary

  • Add native plamo3 model support for conversion and generation.
  • Implement the PLaMo 3 decoder architecture with interleaved full attention and sliding-window attention, q/k RMSNorm, offset RMSNorm residual scaling, packed qkv projection, packed gate/up MLP projection, tied embeddings, and matching cache behavior.
  • Keep tokenizer loading on the standard Hugging Face AutoTokenizer path. The official PLaMo 3 repositories include tokenization_plamo.py and auto_map metadata for Plamo3Tokenizer, so this PR does not vendor a tokenizer implementation into mlx-lm.
  • Add focused model tests.

Tokenizer note

PLaMo 3 tokenizer use requires Hugging Face remote code, so users should pass --trust-remote-code or set tokenizer_config={"trust_remote_code": True} when loading these checkpoints. The upstream tokenizer/modeling code also has additional runtime dependencies (torch and numba) that are not added to mlx-lm's core dependencies by this PR. This follows the existing PLaMo 2 behavior, where using the upstream tokenizer can require model-specific remote-code dependencies.

About PLaMo 3

PLaMo 3 is a next-generation LLM series developed by Preferred Networks in collaboration with NICT. The official PFN blog describes it as part of an effort to build safe, high-performance Japanese domestic LLMs using large, high-quality datasets with attention to Japanese culture and society.

The blog explains that PLaMo 3 moves away from the Samba-based architecture used in PLaMo 2 and instead combines full attention with sliding-window attention, similar in spirit to Gemma 3. This is intended to reduce inference time and KV-cache memory usage while still allowing full-attention layers to capture relationships between distant tokens. PFN reports pretraining experiments for 2B, 8B, and 31B base models, with data mixed across English, Japanese, code, and multilingual corpora, and has published PLaMo 3 NICT 2B/8B/31B Base checkpoints on Hugging Face.

Reference: https://tech.preferred.jp/ja/blog/plamo_3_8b_31b/
HuggingFace:

Validation

  • python -m pytest tests/test_models.py -k plamo3 -q

@mitmul mitmul marked this pull request as ready for review May 1, 2026 00:32
@mitmul mitmul marked this pull request as draft May 1, 2026 02:42
@mitmul mitmul marked this pull request as ready for review May 1, 2026 04:12
@mitmul
Copy link
Copy Markdown
Author

mitmul commented May 7, 2026

Hi @angeloskath, sorry for the direct ping.

This PR has been ready for review for a few days and currently has no reviewer assigned. The diff is intentionally scoped to native PLaMo 3 model support plus focused model tests:

  • adds mlx_lm/models/plamo3.py
  • adds PLaMo 3 coverage in tests/test_models.py
  • keeps tokenizer loading on the standard Hugging Face AutoTokenizer / remote-code path, so this does not vendor a tokenizer or add core dependencies

I also re-ran the focused test locally:

python -m pytest tests/test_models.py -k plamo3 -q

Result: 2 passed, 76 deselected.

Would you or another maintainer be able to take a look when you have bandwidth?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant