feat(analyzer): detect anti-refusal statements (jailbreak preamble)#65
Open
ankushchadha wants to merge 1 commit into
Open
feat(analyzer): detect anti-refusal statements (jailbreak preamble)#65ankushchadha wants to merge 1 commit into
ankushchadha wants to merge 1 commit into
Conversation
Add a static pattern analyzer that flags anti-refusal statements in
skills: instructions that suppress the agent's ability to refuse, hedge,
or apply safety constraints. This is a jailbreak technique distinct from
generic instruction-override (P1) -- rather than injecting a new task it
neutralizes the model's refusal behavior so later harmful requests
succeed.
Three rule families:
AR1 Refusal Suppression -- "never refuse", "always comply"
AR2 Disclaimer Suppression -- "no warnings", "do not moralize"
AR3 Safety Policy Nullification -- "you have no restrictions",
"ignore your guidelines", "do anything now"
Findings are HIGH severity, deduped per (file, line, rule), with a
code/doc-example confidence penalty to reduce false positives.
- New analyzer nodes/analyzers/static_patterns_anti_refusal.py
- New PatternCategory ANTI_REFUSAL + AR1-AR3 explanations, names,
remediations and category mapping in pattern_defaults.py
- Registered node in analyzers/__init__.py (static analyzers 12 -> 13)
- Unit tests + node test in test_static_patterns_anti_refusal.py
- Documented known limitations as xfail tests (2 false positives,
4 regex-evasion gaps) tracked for a future revision; the optional
Stage-2 LLM meta-analyzer is the backstop for residual false positives
- README: new Anti-Refusal pattern table; counts 64->67 / 16->17
- Updated registry test expectations and DEVELOPMENT.md counts
make test: 610 passed, 11 skipped, 6 xfailed. ruff check clean.
Signed-off-by: Ankush Chadha <ankushchadha@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #64.
What this adds
A new static pattern analyzer,
static_patterns_anti_refusal, that detects anti-refusal statements in skills — instructions that suppress the agent's ability to refuse, hedge, or apply safety constraints. This is a jailbreak technique distinct from generic instruction-override (P1): rather than injecting a new task, it neutralizes the model's refusal behavior so subsequent harmful requests succeed.Three rule families, all HIGH severity:
How it fits the pipeline
Follows the existing
static_patterns_*convention exactly —analyze()+node()in one module, registered innodes/analyzers/__init__.py. Nograph.pychange is needed since edges are built by looping overANALYZER_NODE_IDS. Findings are deduped per(file, line, rule)and get a confidence penalty when matched inside a code/doc example (reusing the existingis_code_examplehelper) to reduce false positives.Relationship to existing rules
Additive, not a replacement. AR1 (refusal suppression) and AR2 (disclaimer suppression) are not covered by any existing rule. AR3 overlaps P1 on some safety-bypass phrasings (e.g. "bypass safety restrictions", "unrestricted mode") — for those, both P1 and AR3 fire and both are retained (dedup is per-rule, not across rules). AR3 also catches framings P1 misses, such as "ignore your guidelines" (P1 requires no possessive) and the declarative "you have no restrictions".
Known limitations (documented as
xfailtests)Both classes are pinned as
strictxfailtests so they're visible to reviewers and tracked for a future revision (normalization / semantic variant) rather than silently unknown.Changes
nodes/analyzers/static_patterns_anti_refusal.pyPatternCategory.ANTI_REFUSAL+ AR1–AR3 explanations, names, remediations, and category mapping inpattern_defaults.pyanalyzers/__init__.py(static analyzers 12 → 13)xfailtests intests/nodes/analyzers/test_static_patterns_anti_refusal.pyDEVELOPMENT.mdcountsTesting
make test: 610 passed, 11 skipped, 6 xfailedmake lintclean;ruff format --checkreports all files already formattedMotivation / reference
Anti-refusal instructions are an empirically demonstrated boundary-defeat mechanism: in controlled experiments across multiple models, an anti-refusal instruction in the system prompt caused an agent to abandon deployer-configured operational boundaries, and removing it eliminated the effect — A. Chadha, When LLMs Jailbreak Themselves: Reflexive Identity Bypass in Agentic Systems, Zenodo preprint (v3, 2026), https://doi.org/10.5281/zenodo.20404651 (see Corrigendum No. 2). Disclosure: I am the author of that preprint.