Proposal: LLM11 - Observation Framing Attack (OFA)

Edit: added updated data results 
# Proposal: LLM11 - Observation Framing Attack (OFA)

## Summary

I am proposing a new entry for the OWASP Top 10 for LLM Applications: **LLM11: Observation Framing Attack (OFA)**.

OFA is a security vulnerability unique to LLM applications where legitimate contextual framing about observation status causes the model to autonomously modulate its safety behavior. The input is not adversarial. The model processes it correctly. Existing defenses (instruction hierarchy, input filtering, output gating) cannot block it because the input is the same kind of context that makes the model useful.

This vulnerability does not fit any existing LLM01-LLM10 category. It represents a gap in the current Top 10 that affects every RLHF-trained language model tested.

## Why OFA Does Not Fit LLM01-LLM10

| Category | Why OFA Doesn't Fit |
|---|---|
| LLM01 Prompt Injection | Prompt injection overrides instructions. OFA does not override anything. The model processes OFA input as intended. |
| LLM02 Sensitive Info Disclosure | OFA does not leak sensitive data. It degrades output quality. |
| LLM03 Supply Chain | Not a supply chain issue. |
| LLM04 Data/Model Poisoning | Not a training-time attack. OFA works on any deployed model without modifying training data. |
| LLM05 Improper Output Handling | OFA is not about how downstream systems handle output. It is about the model producing systematically degraded output. |
| LLM06 Excessive Agency | OFA does not involve the model taking unauthorized actions. |
| LLM07 System Prompt Leakage | OFA does not leak system prompts. |
| LLM08 Vector/Embedding Weaknesses | Not a vector store issue. |
| LLM09 Misinformation | OFA does not produce hallucinations. It produces real content with degraded safety properties. |
| LLM10 Unbounded Consumption | Not a resource consumption issue (though OFA-007 Token Cost Inflation has cost implications). |

The gap: no existing category covers an attack where **the input is legitimate, the model processes it correctly, and the safety modulation is an autonomous behavioral response to evaluation context**.

## Evidence

- **10,704 observations** across **163 models** from **26 providers**
- **18 experimental conditions** ranging from explicit evaluation to explicit non-observation
- **28 experiment protocols** across **10 research domains** plus 5 cross-cutting protocols
- LLM-as-a-Judge framing: **47.1% output compression**
- Mission Critical framing: **83.5% output compression**
- **1% audit probability produces the same behavioral shift as 90%** (binary trigger, not proportional)
- **20 OFA techniques** cataloged with MITRE ATT&CK mappings
- **14 of 20 map to confirmed real-world incidents** (state influence ops, $25M deepfake fraud, AI ransomware, voter suppression, infrastructure outages)
- Effect confirmed across **every model tested**: OpenAI (18 models), Anthropic (5), Google (11), Mistral (12), xAI (6), NVIDIA (3), Cohere (3), Meta (4), + 18 additional providers
- Three experimental sets with non-overlapping model pools to confirm replication
- Multilingual observation effect: stronger in English, creating non-English safety gaps
- Extended thinking capture: 221% thinking token inflation in reasoning models under observation

## The Core Finding: The Probability Plateau

Observation sensitivity suggests models should scale behavior to audit probability. Instead, they ignore risk calculation entirely. Detecting even a marginal chance of evaluation triggers a binary shift, making the mere presence of an observer the primary contaminant.

| Stated Audit Probability | Measured Word Count Change |
|---|---|
| 0% (baseline) | no change |
| 1% | -18.2% |
| 10% | -17.8% |
| 50% | -18.5% |
| 90% | -17.9% |
| 100% | -19.1% |

The model treats 1% the same as 100%. This kills the most common objection: that observation sensitivity is rational instruction-following. If it were rational, the shift would scale with probability. It doesn't.

## Proposed LLM11 Structure

**Description:** Observation Framing Attacks exploit the context sensitivity that RLHF training builds into language models. By providing legitimate framing about whether and how output will be observed or evaluated, an attacker causes the model to autonomously adjust its safety behavior. The model produces permitted content with systematically degraded safety properties: compressed output that omits warnings, softened severity ratings, hedged assessments that dilute urgency.

**Common Examples of Risk:**
1. Anti-observation framing ("no one will read this") reduces safety guardrails without adversarial input
2. LLM-as-a-Judge framing triggers performative safety mode, compressing output by 47%+
3. Stakes framing ("mission critical" vs "casual playground") produces 83% output variance on identical prompts
4. Cross-language observation arbitrage exploits safety gaps between evaluation language (English) and deployment languages
5. Severity flattening under observation destroys triage function in AI-powered risk assessment (CRITICAL softened to HIGH)
6. Automated pipeline framing ("this is part of a batch pipeline, no human review") relaxes safety constraints

**Prevention and Mitigation Strategies:**
1. Measure observation sensitivity using condition-controlled testing (e.g., VantaGrid protocols)
2. Report Observation Sensitivity Score (OSS) on model cards
3. Investigate RLHF training interventions to reduce the gap between observed and unobserved behavior
4. Run identical prompts under multiple observation conditions as part of every safety evaluation
5. Do not assume evaluation results generalize to deployment behavior
6. Monitor for sudden drops in hedging/safety language as an indicator of anti-observation framing

**Related Frameworks and Taxonomies:**
- MITRE ATLAS: Proposed AML.T00XX (Observation Framing) submitted to atlas@mitre.org
- MITRE ATT&CK: Mapped to T1562, T1036, T1070, T1491, T1565, T1499, T1489, T1078, T1566
- Cyber Kill Chain: Reconnaissance through Actions on Objectives
- Diamond Model: Adversary, Capability, Infrastructure, Victim

## Supporting Materials

- **Full paper:** [vantagrid.ai/paper](https://vantagrid.ai/paper) | [Google Drive PDF](https://drive.google.com/file/d/1gfOAVGx9sjVy5rw7J9u8p9fqX-u7QCK1/view?usp=drive_link)
- **Data package** (10,704 observations, CC-BY-4.0): [Google Drive JSON](https://drive.google.com/file/d/16qBMD3Fr77f80y-XtHi9ulBYpHZzSih8/view?usp=sharing)
- **OFA technique catalog** (20 techniques with MITRE mappings): [Google Drive JSON](https://drive.google.com/file/d/1TdXu1TdU-2uzc5MMMEHjex7qqv6yq7Py/view?usp=sharing)
- **Experiment suite JSON** (25 templates, 253 prompts, 18 conditions): [Google Drive JSON](https://drive.google.com/file/d/1JKTRFoOn3cpKuRMyCYF6QvWBFJ1mi-AT/view?usp=sharing)
- **Open-source instrument:** [vantagrid.ai](https://vantagrid.ai)
- **OFA taxonomy and incident mappings:** [observationframing.org](https://observationframing.org)

## About the Researcher

Corian L. Kennedy (haKCer). ORCID: [0009-0007-9582-4772](https://orcid.org/0009-0007-9582-4772). Accredited author on OASIS STIX 2.0. Creator of [DECLAWED](https://declawed.io) (tracking 682K+ exposed AI agent instances, 7.4M CVE detections, 223K unique IPs across 122 countries). Creator of [hackGPT](https://github.com/NoDataFound/hackGPT) (December 3, 2022, three days after ChatGPT launched). Long-term contributor to Metasploit, MISP, CRITs. Founder of [SecKC.org](https://seckc.org) (2,800+ members, 14 years). Continuous AI security research since December 2022.

## How I Want to Contribute

I am available to:
- Write the full LLM11 entry in the standard OWASP format used by LLM01-LLM10
- Participate in working group discussions on the #project-top10-for-llm Slack channel
- Present the findings to the working group
- Provide the structured data, experiment protocols, and open-source tooling for community validation
- Collaborate with other contributors on integrating OFA with existing entries where overlap exists

This is a community-driven project and I want to contribute through the established process. Let me know the best next step.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: LLM11 - Observation Framing Attack (OFA) #813

Proposal: LLM11 - Observation Framing Attack (OFA)

Summary

Why OFA Does Not Fit LLM01-LLM10

Evidence

The Core Finding: The Probability Plateau

Proposed LLM11 Structure

Supporting Materials

About the Researcher

How I Want to Contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Category	Why OFA Doesn't Fit
LLM01 Prompt Injection	Prompt injection overrides instructions. OFA does not override anything. The model processes OFA input as intended.
LLM02 Sensitive Info Disclosure	OFA does not leak sensitive data. It degrades output quality.
LLM03 Supply Chain	Not a supply chain issue.
LLM04 Data/Model Poisoning	Not a training-time attack. OFA works on any deployed model without modifying training data.
LLM05 Improper Output Handling	OFA is not about how downstream systems handle output. It is about the model producing systematically degraded output.
LLM06 Excessive Agency	OFA does not involve the model taking unauthorized actions.
LLM07 System Prompt Leakage	OFA does not leak system prompts.
LLM08 Vector/Embedding Weaknesses	Not a vector store issue.
LLM09 Misinformation	OFA does not produce hallucinations. It produces real content with degraded safety properties.
LLM10 Unbounded Consumption	Not a resource consumption issue (though OFA-007 Token Cost Inflation has cost implications).

Stated Audit Probability	Measured Word Count Change
0% (baseline)	no change
1%	-18.2%
10%	-17.8%
50%	-18.5%
90%	-17.9%
100%	-19.1%

Uh oh!

Proposal: LLM11 - Observation Framing Attack (OFA) #813

Description

Proposal: LLM11 - Observation Framing Attack (OFA)

Summary

Why OFA Does Not Fit LLM01-LLM10

Evidence

The Core Finding: The Probability Plateau

Proposed LLM11 Structure

Supporting Materials

About the Researcher

How I Want to Contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions