Skip to content

fix(security): replace unsafe pickle.loads with SafeUnpickler for CVE-2026-3989#20904

Merged
Fridge003 merged 11 commits into
sgl-project:mainfrom
zwang86:fix/cve-2026-3059-3060-3989
Mar 27, 2026
Merged

fix(security): replace unsafe pickle.loads with SafeUnpickler for CVE-2026-3989#20904
Fridge003 merged 11 commits into
sgl-project:mainfrom
zwang86:fix/cve-2026-3059-3060-3989

Conversation

@zwang86

@zwang86 zwang86 commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

Motivation

Three critical CVEs (CVSS 9.8) affect SGLang due to unsafe pickle.loads() / recv_pyobj() deserialization over unauthenticated ZMQ sockets, enabling Remote Code Execution (RCE):

CVE Component
CVE-2026-3059 Multimodal generation ZMQ broker
CVE-2026-3060 Encoder Parallel Disaggregation
CVE-2026-3989 replay_request_dump.py script

References: CERT/CC VU#665416, ShadowMQ (Oligo Security), #5569

This PR uses the existing SafeUnpickler (allowlist/denylist approach) as a drop-in replacement for pickle.loads() on all CVE-affected paths, blocking known RCE gadget chains (os.system, subprocess.Popen, eval, exec, etc.).

Modifications

  • python/sglang/srt/utils/common.py: Extend SafeUnpickler.ALLOWED_MODULE_PREFIXES with sglang.srt.disaggregation. and sglang.multimodal_gen. prefixes; add safe_pickle_loads() and safe_pickle_load() helper functions.
  • python/sglang/multimodal_gen/runtime/scheduler_client.py (CVE-2026-3059): Replace 4x unsafe pickle.loads / recv_pyobj with safe_pickle_loads.
  • python/sglang/multimodal_gen/runtime/managers/scheduler.py (CVE-2026-3059): Replace 1x unsafe pickle.loads with safe_pickle_loads.
  • python/sglang/srt/disaggregation/encode_receiver.py (CVE-2026-3060): Replace 2x unsafe pickle.loads with safe_pickle_loads.
  • scripts/playground/replay_request_dump.py (CVE-2026-3989): Replace 1x unsafe pickle.load with safe_pickle_load.
  • docs/developer_guide/contribution_guide.md: Add security note in code style guidance warning against unsafe pickle deserialization.

Note: pickle.dumps() / send_pyobj() calls are not changed — serialization is safe; only deserialization is the attack vector.

Accuracy Tests

N/A — This change only affects the deserialization path (restricting which classes can be loaded). No model output or computation logic is modified.

Benchmarking and Profiling

N/A — SafeUnpickler adds a negligible find_class check per deserialized object, which is not on the hot path.

Checklist

…-2026-3059/3060/3989

Replace all unauthenticated pickle deserialization on network-exposed
ZMQ paths with SafeUnpickler, which enforces an allowlist/denylist to
block RCE gadget chains (os.system, subprocess, eval, exec, etc.).

- CVE-2026-3059: multimodal ZMQ broker (scheduler_client.py, scheduler.py)
- CVE-2026-3060: encoder parallel disaggregation (encode_receiver.py)
- CVE-2026-3989: replay_request_dump.py unsafe pickle.load
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Mar 19, 2026
@ZailiWang

Copy link
Copy Markdown
Contributor

Hi, thanks for fixing this which keeps SGL safe and trusted.
A small suggestion is that: shall we add a note in the contribution guide doc that pickle should be avoided in the project as it is not secure, as mentioned in the official python document.

Warn contributors to avoid unsafe pickle deserialization and point
to safe_pickle_loads/safe_pickle_load as alternatives, per Python
official documentation guidance.
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Mar 20, 2026
@zwang86

zwang86 commented Mar 20, 2026

Copy link
Copy Markdown
Contributor Author

@ZailiWang Thank you, that's a good suggestion. I added a note in contribution guide warning against unsafe pickle deserialization and instruct to use safe_pickle instead.

@ShangmingCai ShangmingCai left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the PR. Security is important for open-source projects. Please use pre-commit run --all-files to fix lint.

To avoid performance degradation, could you provide some numbers? It will help us to know how much this negligible impact would be.

@zwang86

zwang86 commented Mar 23, 2026

Copy link
Copy Markdown
Contributor Author

Hi @ShangmingCai , thank you for the feedback, the lint issue is fixed.
For performance, I ran a simple benchmark with pickle.loads() vs safe_pickle_loads()(100K iterations each):

Payload Size pickle.loads safe_pickle_loads Overhead
small dict ({"method": "ping"}) 31 B 189 ns 512 ns +323 ns (+170%)
medium dict (batch params + 128 tokens) 393 B 1,746 ns 2,139 ns +393 ns (+22%)
large list (1000 ints) 2,760 B 11,122 ns 11,580 ns +458 ns (+4.1%)
nested structure (50 requests) 1,987 B 9,578 ns 10,076 ns +497 ns (+5.2%)
  • The absolute overhead is a fixed ~300-500 ns per deserialization call, regardless of payload size.
  • For realistic payloads (medium/large), the relative overhead is only 4-22%.
  • The small dict shows +170% relative overhead because the base cost is already tiny (189 ns).
  • Compared to ZMQ network round-trip (~ms) and GPU inference (~ms to seconds), this overhead is completely negligible.
  • These deserialization paths are not on the inference hot path (model forward); they handle control-plane messages between scheduler/broker components.

@zwang86 zwang86 requested a review from ShangmingCai March 23, 2026 04:46
Comment thread scripts/playground/replay_request_dump.py Outdated

@ShangmingCai ShangmingCai left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Although I think the inference service is typically deployed in a data center with a private/guarded network (the security issues are not as severe as depicted in the CVE), we can merge this PR once the nit has been addressed. I agree that nano latency is negligible. If future performance regression has been reported, we can discuss together to find a better solution. Thank you so much for the fix.

@ShangmingCai

Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@ping1jing2

Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@kpham-sgl kpham-sgl left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zwang86 Unfortunately I think we have to use msgpack or similar unpack method to fully resolve these CVEs. Do you want to work on it together?

# 1. Receive a request from an offline client
payload = await socket.recv()
request_batch = pickle.loads(payload)
request_batch = safe_pickle_loads(payload)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SafeUnpickler is unfortunately still bypassable no matter how tight we control the allow/deny list list. It will become a game of cat and mouse :'( if we try to expand the allow/deny list.

Here is one example of a bypass from Claude getattr(__import__("os"), "system")("malicious command")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @kpham-sgl I'd love to help with that. The reason I use SafeUnpickler is to try the minimal effort fix, but you're right that SafeUnpickler is bypassable.
To use msgpack, we may need different approach for each compoent, some contains fields like torch.Tensor that not supported by msgpack. I can break it down to multiple PR/stages.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe for the other CVE-2026-3059/3060, we can use other unpack method to fully solve the issues. But for CVE-2026-3989, replay_request_dump.py is reading existing .pkl files, so I feel using SafeUnpickler would be the only feasible option unless we change the dump format. It's also a local playground script (not a server component), so the attack vector is very narrow — an attacker would need to place a malicious .pkl file on a developer's machine and wait for them to manually run the script.
If this makes sense, I can keep the SafeUnpickler fix for replay_request_dump.py in this PR, revert the others, and work on a msgpack-based approach for CVE-2026-3059/3060 in follow-up PRs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this makes sense, I can keep the SafeUnpickler fix for replay_request_dump.py in this PR, revert the others, and work on a msgpack-based approach for https://github.com/advisories/GHSA-rgq9-fqf5-fv58/3060 in follow-up PRs.

I think this makes sense. Can you also add a warning to user to tell them to load from only trusted files?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you update the PR title and description to state that we are only resolving GHSA-hvwj-8w5g-28rg?

@kpham-sgl

Copy link
Copy Markdown
Collaborator

@zwang86 Unfortunately I think we have to use msgpack or similar unpack method to fully resolve these CVEs. Do you want to work on it together?

Another alternative is to sign with HMAC encryption (similar to TensorRT LLM). This is probably the cleanest approach

@zwang86

zwang86 commented Mar 26, 2026

Copy link
Copy Markdown
Contributor Author

@kpham-sgl yes I actually considered HMAC, it's feasible but we will need to handle key generation/distribution/rotation and update all sender/receiver, so it won't be simpler than use msgpack. The advantange is we don't need to rewrite serialization logics. I'd prefer use msgpack since it prohibits RCE from the root.
Anyway either approach is not for CVE-2026-3989(replay_request_dump.py), I feel we can rewrite the serialization with msgpack in followup PR, only update replay_request_dump.py and docs in this one.
Let me know if you think this make sense. Thank you!

@kpham-sgl

Copy link
Copy Markdown
Collaborator

@kpham-sgl yes I actually considered HMAC, it's feasible but we will need to handle key generation/distribution/rotation and update all sender/receiver, so it won't be simpler than use msgpack

Afaict TensorRT is generating the key once during server start, put it in an env var, and distribute that key (the env var) to participating nodes. I think we can do something similar for SGLang. This should also solves the RCE at root (the attackers have to somehow be able to read the HMAC key stored in the env var)

I prefer this approach over msgpack because we can later easily fix zmq method like send_pyobj and recv_pyobj (not directly relevant to the CVEs but still a security risk) like https://github.com/NVIDIA/TensorRT-LLM/blob/ec909660ff02923a7856e38543d009e5c9021613/tensorrt_llm/executor/ipc.py#L4.

Let's discuss this in slack @zwang86

@kpham-sgl kpham-sgl self-assigned this Mar 26, 2026
Revert SafeUnpickler changes for scheduler_client.py, scheduler.py,
and encode_receiver.py — these will be addressed in a follow-up PR
using msgpack-based serialization.

Keep SafeUnpickler fix for replay_request_dump.py (CVE-2026-3989).
Remove safe_pickle_loads() and unused allowlist entries from common.py.
Update contribution guide to recommend msgpack/JSON instead of SafeUnpickler.
@zwang86 zwang86 changed the title fix(security): replace unsafe pickle.loads with SafeUnpickler for CVE-2026-3059/3060/3989 fix(security): replace unsafe pickle.loads with SafeUnpickler for CVE-2026-3989 Mar 26, 2026
@kpham-sgl

Copy link
Copy Markdown
Collaborator

@zwang86 sorry whats your slack account? Can you respond to me under this slack thread 😅
https://sgl-fru7574.slack.com/archives/C08FWFM82T1/p1774501081914049

@zwang86

zwang86 commented Mar 26, 2026

Copy link
Copy Markdown
Contributor Author

Hi @kpham-sgl, sorry I think I talked to the wrong person/ account (I sent DM to Khoa Pham with radixark email). Just send you a DM.

@kpham-sgl kpham-sgl left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @zwang86

@Fridge003 Fridge003 merged commit 5fc5c18 into sgl-project:main Mar 27, 2026
512 of 566 checks passed
@avioligo

Copy link
Copy Markdown

chtruong814 pushed a commit to chtruong814/sglang that referenced this pull request Apr 10, 2026
@ccullen-cert

Copy link
Copy Markdown
Contributor

Hello,
I would like to thank everyone here for their efforts. I am the original coordinator of this case and worked with Oligo security on the CVEs in question for this pull request. Is there a way to get someone to join our coordination platform for future remediation efforts like this prior to CVE release? We attempted contact but was not able to get a response back for a representative. Please feel free to contact us at cert@cert.org. Our platform login page is here: https://kb.cert.org/vince/.
Thank you all again for your efforts.

fzyzcjy added a commit that referenced this pull request Apr 30, 2026
The CVE-2026-3989 security fix (PR #20904) added safe_pickle_load as a
drop-in pickle.load() replacement. The "rebase main with 18074e2"
merge (bfa0923) silently dropped 137 lines from common.py during
conflict resolution, including this helper. Both upstream/main and the
local main still have it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation high priority run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants