fix(security): replace unsafe pickle.loads with SafeUnpickler for CVE-2026-3989#20904
Conversation
…-2026-3059/3060/3989 Replace all unauthenticated pickle deserialization on network-exposed ZMQ paths with SafeUnpickler, which enforces an allowlist/denylist to block RCE gadget chains (os.system, subprocess, eval, exec, etc.). - CVE-2026-3059: multimodal ZMQ broker (scheduler_client.py, scheduler.py) - CVE-2026-3060: encoder parallel disaggregation (encode_receiver.py) - CVE-2026-3989: replay_request_dump.py unsafe pickle.load
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Hi, thanks for fixing this which keeps SGL safe and trusted. |
Warn contributors to avoid unsafe pickle deserialization and point to safe_pickle_loads/safe_pickle_load as alternatives, per Python official documentation guidance.
|
@ZailiWang Thank you, that's a good suggestion. I added a note in contribution guide warning against unsafe pickle deserialization and instruct to use safe_pickle instead. |
ShangmingCai
left a comment
There was a problem hiding this comment.
Thx for the PR. Security is important for open-source projects. Please use pre-commit run --all-files to fix lint.
To avoid performance degradation, could you provide some numbers? It will help us to know how much this negligible impact would be.
|
Hi @ShangmingCai , thank you for the feedback, the lint issue is fixed.
|
ShangmingCai
left a comment
There was a problem hiding this comment.
LGTM. Although I think the inference service is typically deployed in a data center with a private/guarded network (the security issues are not as severe as depicted in the CVE), we can merge this PR once the nit has been addressed. I agree that nano latency is negligible. If future performance regression has been reported, we can discuss together to find a better solution. Thank you so much for the fix.
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
| # 1. Receive a request from an offline client | ||
| payload = await socket.recv() | ||
| request_batch = pickle.loads(payload) | ||
| request_batch = safe_pickle_loads(payload) |
There was a problem hiding this comment.
SafeUnpickler is unfortunately still bypassable no matter how tight we control the allow/deny list list. It will become a game of cat and mouse :'( if we try to expand the allow/deny list.
Here is one example of a bypass from Claude getattr(__import__("os"), "system")("malicious command")
There was a problem hiding this comment.
hi @kpham-sgl I'd love to help with that. The reason I use SafeUnpickler is to try the minimal effort fix, but you're right that SafeUnpickler is bypassable.
To use msgpack, we may need different approach for each compoent, some contains fields like torch.Tensor that not supported by msgpack. I can break it down to multiple PR/stages.
There was a problem hiding this comment.
I believe for the other CVE-2026-3059/3060, we can use other unpack method to fully solve the issues. But for CVE-2026-3989, replay_request_dump.py is reading existing .pkl files, so I feel using SafeUnpickler would be the only feasible option unless we change the dump format. It's also a local playground script (not a server component), so the attack vector is very narrow — an attacker would need to place a malicious .pkl file on a developer's machine and wait for them to manually run the script.
If this makes sense, I can keep the SafeUnpickler fix for replay_request_dump.py in this PR, revert the others, and work on a msgpack-based approach for CVE-2026-3059/3060 in follow-up PRs.
There was a problem hiding this comment.
If this makes sense, I can keep the SafeUnpickler fix for replay_request_dump.py in this PR, revert the others, and work on a msgpack-based approach for https://github.com/advisories/GHSA-rgq9-fqf5-fv58/3060 in follow-up PRs.
I think this makes sense. Can you also add a warning to user to tell them to load from only trusted files?
There was a problem hiding this comment.
Also can you update the PR title and description to state that we are only resolving GHSA-hvwj-8w5g-28rg?
Another alternative is to sign with HMAC encryption (similar to TensorRT LLM). This is probably the cleanest approach |
|
@kpham-sgl yes I actually considered HMAC, it's feasible but we will need to handle key generation/distribution/rotation and update all sender/receiver, so it won't be simpler than use |
Afaict TensorRT is generating the key once during server start, put it in an env var, and distribute that key (the env var) to participating nodes. I think we can do something similar for SGLang. This should also solves the RCE at root (the attackers have to somehow be able to read the HMAC key stored in the env var) I prefer this approach over Let's discuss this in slack @zwang86 |
Revert SafeUnpickler changes for scheduler_client.py, scheduler.py, and encode_receiver.py — these will be addressed in a follow-up PR using msgpack-based serialization. Keep SafeUnpickler fix for replay_request_dump.py (CVE-2026-3989). Remove safe_pickle_loads() and unused allowlist entries from common.py. Update contribution guide to recommend msgpack/JSON instead of SafeUnpickler.
|
@zwang86 sorry whats your slack account? Can you respond to me under this slack thread 😅 |
|
Hi @kpham-sgl, sorry I think I talked to the wrong person/ account (I sent DM to Khoa Pham with radixark email). Just send you a DM. |
|
@ShangmingCai @kpham-sgl @ByronHsu @hnyls2002 @mickqian @yhyang201 @ping1jing2 Please see following NEW security advisories ASAP
This is urgent. Thank you. |
…-2026-3989 (sgl-project#20904) Signed-off-by: Charlie Truong <chtruong@nvidia.com>
|
Hello, |
The CVE-2026-3989 security fix (PR #20904) added safe_pickle_load as a drop-in pickle.load() replacement. The "rebase main with 18074e2" merge (bfa0923) silently dropped 137 lines from common.py during conflict resolution, including this helper. Both upstream/main and the local main still have it.
Motivation
Three critical CVEs (CVSS 9.8) affect SGLang due to unsafe
pickle.loads()/recv_pyobj()deserialization over unauthenticated ZMQ sockets, enabling Remote Code Execution (RCE):CVE-2026-3059Multimodal generation ZMQ brokerCVE-2026-3060Encoder Parallel Disaggregationreplay_request_dump.pyscriptReferences: CERT/CC VU#665416, ShadowMQ (Oligo Security), #5569
This PR uses the existing
SafeUnpickler(allowlist/denylist approach) as a drop-in replacement forpickle.loads()on all CVE-affected paths, blocking known RCE gadget chains (os.system,subprocess.Popen,eval,exec, etc.).Modifications
python/sglang/srt/utils/common.py: ExtendSafeUnpickler.ALLOWED_MODULE_PREFIXESwithsglang.srt.disaggregation.andsglang.multimodal_gen.prefixes; addsafe_pickle_loads()andsafe_pickle_load()helper functions.python/sglang/multimodal_gen/runtime/scheduler_client.py(CVE-2026-3059): Replace 4x unsafepickle.loads/recv_pyobjwithsafe_pickle_loads.python/sglang/multimodal_gen/runtime/managers/scheduler.py(CVE-2026-3059): Replace 1x unsafepickle.loadswithsafe_pickle_loads.python/sglang/srt/disaggregation/encode_receiver.py(CVE-2026-3060): Replace 2x unsafepickle.loadswithsafe_pickle_loads.scripts/playground/replay_request_dump.py(CVE-2026-3989): Replace 1x unsafepickle.loadwithsafe_pickle_load.docs/developer_guide/contribution_guide.md: Add security note in code style guidance warning against unsafe pickle deserialization.Note:
pickle.dumps()/send_pyobj()calls are not changed — serialization is safe; only deserialization is the attack vector.Accuracy Tests
N/A — This change only affects the deserialization path (restricting which classes can be loaded). No model output or computation logic is modified.
Benchmarking and Profiling
N/A —
SafeUnpickleradds a negligiblefind_classcheck per deserialized object, which is not on the hot path.Checklist