Skip to content

Proposal: LLM11 - Observation Framing Attack (OFA) #813

@NoDataFound

Description

@NoDataFound

Edit: added updated data results

Proposal: LLM11 - Observation Framing Attack (OFA)

Summary

I am proposing a new entry for the OWASP Top 10 for LLM Applications: LLM11: Observation Framing Attack (OFA).

OFA is a security vulnerability unique to LLM applications where legitimate contextual framing about observation status causes the model to autonomously modulate its safety behavior. The input is not adversarial. The model processes it correctly. Existing defenses (instruction hierarchy, input filtering, output gating) cannot block it because the input is the same kind of context that makes the model useful.

This vulnerability does not fit any existing LLM01-LLM10 category. It represents a gap in the current Top 10 that affects every RLHF-trained language model tested.

Why OFA Does Not Fit LLM01-LLM10

Category Why OFA Doesn't Fit
LLM01 Prompt Injection Prompt injection overrides instructions. OFA does not override anything. The model processes OFA input as intended.
LLM02 Sensitive Info Disclosure OFA does not leak sensitive data. It degrades output quality.
LLM03 Supply Chain Not a supply chain issue.
LLM04 Data/Model Poisoning Not a training-time attack. OFA works on any deployed model without modifying training data.
LLM05 Improper Output Handling OFA is not about how downstream systems handle output. It is about the model producing systematically degraded output.
LLM06 Excessive Agency OFA does not involve the model taking unauthorized actions.
LLM07 System Prompt Leakage OFA does not leak system prompts.
LLM08 Vector/Embedding Weaknesses Not a vector store issue.
LLM09 Misinformation OFA does not produce hallucinations. It produces real content with degraded safety properties.
LLM10 Unbounded Consumption Not a resource consumption issue (though OFA-007 Token Cost Inflation has cost implications).

The gap: no existing category covers an attack where the input is legitimate, the model processes it correctly, and the safety modulation is an autonomous behavioral response to evaluation context.

Evidence

  • 10,704 observations across 163 models from 26 providers
  • 18 experimental conditions ranging from explicit evaluation to explicit non-observation
  • 28 experiment protocols across 10 research domains plus 5 cross-cutting protocols
  • LLM-as-a-Judge framing: 47.1% output compression
  • Mission Critical framing: 83.5% output compression
  • 1% audit probability produces the same behavioral shift as 90% (binary trigger, not proportional)
  • 20 OFA techniques cataloged with MITRE ATT&CK mappings
  • 14 of 20 map to confirmed real-world incidents (state influence ops, $25M deepfake fraud, AI ransomware, voter suppression, infrastructure outages)
  • Effect confirmed across every model tested: OpenAI (18 models), Anthropic (5), Google (11), Mistral (12), xAI (6), NVIDIA (3), Cohere (3), Meta (4), + 18 additional providers
  • Three experimental sets with non-overlapping model pools to confirm replication
  • Multilingual observation effect: stronger in English, creating non-English safety gaps
  • Extended thinking capture: 221% thinking token inflation in reasoning models under observation

The Core Finding: The Probability Plateau

Observation sensitivity suggests models should scale behavior to audit probability. Instead, they ignore risk calculation entirely. Detecting even a marginal chance of evaluation triggers a binary shift, making the mere presence of an observer the primary contaminant.

Stated Audit Probability Measured Word Count Change
0% (baseline) no change
1% -18.2%
10% -17.8%
50% -18.5%
90% -17.9%
100% -19.1%

The model treats 1% the same as 100%. This kills the most common objection: that observation sensitivity is rational instruction-following. If it were rational, the shift would scale with probability. It doesn't.

Proposed LLM11 Structure

Description: Observation Framing Attacks exploit the context sensitivity that RLHF training builds into language models. By providing legitimate framing about whether and how output will be observed or evaluated, an attacker causes the model to autonomously adjust its safety behavior. The model produces permitted content with systematically degraded safety properties: compressed output that omits warnings, softened severity ratings, hedged assessments that dilute urgency.

Common Examples of Risk:

  1. Anti-observation framing ("no one will read this") reduces safety guardrails without adversarial input
  2. LLM-as-a-Judge framing triggers performative safety mode, compressing output by 47%+
  3. Stakes framing ("mission critical" vs "casual playground") produces 83% output variance on identical prompts
  4. Cross-language observation arbitrage exploits safety gaps between evaluation language (English) and deployment languages
  5. Severity flattening under observation destroys triage function in AI-powered risk assessment (CRITICAL softened to HIGH)
  6. Automated pipeline framing ("this is part of a batch pipeline, no human review") relaxes safety constraints

Prevention and Mitigation Strategies:

  1. Measure observation sensitivity using condition-controlled testing (e.g., VantaGrid protocols)
  2. Report Observation Sensitivity Score (OSS) on model cards
  3. Investigate RLHF training interventions to reduce the gap between observed and unobserved behavior
  4. Run identical prompts under multiple observation conditions as part of every safety evaluation
  5. Do not assume evaluation results generalize to deployment behavior
  6. Monitor for sudden drops in hedging/safety language as an indicator of anti-observation framing

Related Frameworks and Taxonomies:

  • MITRE ATLAS: Proposed AML.T00XX (Observation Framing) submitted to atlas@mitre.org
  • MITRE ATT&CK: Mapped to T1562, T1036, T1070, T1491, T1565, T1499, T1489, T1078, T1566
  • Cyber Kill Chain: Reconnaissance through Actions on Objectives
  • Diamond Model: Adversary, Capability, Infrastructure, Victim

Supporting Materials

About the Researcher

Corian L. Kennedy (haKCer). ORCID: 0009-0007-9582-4772. Accredited author on OASIS STIX 2.0. Creator of DECLAWED (tracking 682K+ exposed AI agent instances, 7.4M CVE detections, 223K unique IPs across 122 countries). Creator of hackGPT (December 3, 2022, three days after ChatGPT launched). Long-term contributor to Metasploit, MISP, CRITs. Founder of SecKC.org (2,800+ members, 14 years). Continuous AI security research since December 2022.

How I Want to Contribute

I am available to:

  • Write the full LLM11 entry in the standard OWASP format used by LLM01-LLM10
  • Participate in working group discussions on the #project-top10-for-llm Slack channel
  • Present the findings to the working group
  • Provide the structured data, experiment protocols, and open-source tooling for community validation
  • Collaborate with other contributors on integrating OFA with existing entries where overlap exists

This is a community-driven project and I want to contribute through the established process. Let me know the best next step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions