fix(pii_filter): add comprehensive Rust implementation hardening and regression tests#3840
fix(pii_filter): add comprehensive Rust implementation hardening and regression tests#3840brian-hussey merged 17 commits intomainfrom
Conversation
3af63ae to
b4b958b
Compare
|
I approve and put some details here for a record, findings are of low priority. Branch Review FindingsBranch: Executive Summary
Verdict: APPROVE with minor fixes Branch OverviewThis branch focuses on three main areas:
Files Changed
Issues by Severity🔴 CRITICAL: NoneNo show-stopping bugs or security vulnerabilities found. 🟡 MEDIUM: 1 Issue1. Custom Pattern ReDoS Validation IncompleteFile: Problem: The validation counts total quantifiers but doesn't detect nested quantifiers, which are the primary cause of Regular Expression Denial of Service (ReDoS) attacks. Current Validation: let quantifiers = pattern.chars()
.filter(|ch| matches!(ch, '*' | '+' | '?'))
.count()
+ pattern.matches('{').count();
if quantifiers > MAX_QUANTIFIERS { // 24
return Err("too many quantifiers");
}Dangerous Patterns That Pass Current Validation:
Impact: A malicious or erroneous custom pattern can cause catastrophic backtracking, consuming CPU for seconds or minutes on small inputs. Example Attack: # This pattern passes current validation
pattern = r"(a+)+"
# Input: 30 'a' characters + '!'
text = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!"
# Processing time: ~1.2 seconds (exponential growth)
# Each additional character doubles the timeFix Option A: Add Nested Quantifier Validation (Recommended): /// Вычисляет максимальную глубину вложенности квантификаторов
fn calculate_quantifier_nesting_depth(pattern: &str) -> usize {
let mut group_depth = 0;
let mut max_quantifier_depth = 0;
let mut in_char_class = false;
let mut escaped = false;
for ch in pattern.chars() {
if escaped {
escaped = false;
continue;
}
if ch == '\\' {
escaped = true;
continue;
}
if ch == '[' && !in_char_class {
in_char_class = true;
continue;
}
if ch == ']' && in_char_class {
in_char_class = false;
continue;
}
if in_char_class {
continue;
}
match ch {
'(' => group_depth += 1,
')' => group_depth = group_depth.saturating_sub(1),
'*' | '+' | '?' => {
if group_depth > 0 {
max_quantifier_depth = max_quantifier_depth.max(group_depth);
}
}
'{' => {
if group_depth > 0 {
max_quantifier_depth = max_quantifier_depth.max(group_depth);
}
}
_ => {}
}
}
max_quantifier_depth
}
// Add to validate_custom_pattern:
const MAX_NESTING_DEPTH: usize = 2;
let nesting = calculate_quantifier_nesting_depth(pattern);
if nesting > MAX_NESTING_DEPTH {
return Err(format!(
"Pattern has nested quantifiers (depth: {}, max: {}) - potential ReDoS",
nesting, MAX_NESTING_DEPTH
));
}Fix Option B: Document Trusted-Input Assumption (Alternative): If custom patterns are only added by trusted admins (not end users), document this assumption: // SECURITY NOTE: Custom patterns are trusted input (added by admins only).
// This validation provides basic safeguards against typos and obvious errors.
// For untrusted pattern sources, use regex-automata DFA engine instead.Tests to Add: #[test]
fn test_rejects_nested_quantifiers() {
assert!(validate_custom_pattern("(a+)+").is_err());
assert!(validate_custom_pattern("((a+)+)+").is_err());
assert!(validate_custom_pattern(r"(\w+\s?)+").is_err());
}
#[test]
fn test_accepts_flat_quantifiers() {
assert!(validate_custom_pattern("a+b+c+").is_ok());
assert!(validate_custom_pattern("EMP-[0-9]{6}").is_ok());
}Effort: ~60 lines of code 🟢 LOW: 4 Issues1. UTF-8 Boundary Error Message Lacks Diagnostic InfoFile: Current Code: if !text.is_char_boundary(detection.start) || !text.is_char_boundary(detection.end) {
return Err("Invalid detection range: offsets must align to UTF-8 boundaries".to_string());
}Problem: Error message doesn't include which detection failed or what the actual byte offsets were, making debugging difficult. Fix: if !text.is_char_boundary(detection.start) || !text.is_char_boundary(detection.end) {
return Err(format!(
"Invalid detection range: offsets {}..{} must align to UTF-8 boundaries (text len: {})",
detection.start, detection.end, text.len()
));
}Effort: ~5 lines 2. Hash Output Length Change UndocumentedFile: Change: // Before: format!("[HASH:{}]", &format!("{:x}", result)[..8])
// After: format!("[HASH:{}]", &format!("{:x}", result)[..16])Impact:
Downstream systems parsing masked output may break if they expect fixed-width fields. Fix: Add to ## Changelog
### [Unreleased]
#### Breaking Changes
- **Hash mask output length increased**: Hash strategy now produces 16-character
hex output instead of 8 characters for improved collision resistance.
- Before: `[HASH:abcd1234]`
- After: `[HASH:abcd1234efgh5678]`
- Migration: Update any regex parsers or fixed-width field extractorsEffort: ~10 lines documentation 3. Performance Test Generates Invalid SSNsFile: Current Code: for i in range(10000):
area = (i % 799) + 100
if area == 666: # Only skips exactly 666
area = 667
lines.append(f"User {i}: SSN {area:03d}-45-6789, Email user{i}@example.com")Problem: Doesn't skip all invalid SSN area codes per SSA rules:
Fix: def is_valid_ssn_area(area: int) -> bool:
"""Check if SSN area code is structurally valid per SSA rules."""
return area != 0 and area != 666 and area < 900
lines = []
i = 0
while len(lines) < 10000:
area = (i % 800) + 100 # Range 100-899
if is_valid_ssn_area(area):
lines.append(f"User {len(lines)}: SSN {area:03d}-45-6789, Email user{len(lines)}@example.com")
i += 1Effort: ~15 lines 4. Cumulative Text Size Not Tracked in Nested Structures (Optional)File: Current Behavior: Each string is validated individually against Attack Scenario: # Each string is 1KB (passes individual check)
data = {"field_" + str(i): "x" * 1024 for i in range(2000)}
# Total: 2MB (may exceed intended memory budget)Fix (Optional, More Invasive): fn process_nested_internal(
&self,
py: Python,
data: &Bound<'_, PyAny>,
path: &str,
depth: usize,
cumulative_size: &mut usize, // NEW parameter
) -> PyResult<(bool, Py<PyAny>, Py<PyAny>)> {
// ...
if let Ok(text) = data.extract::<String>() {
*cumulative_size += text.len();
if *cumulative_size > self.config.max_text_bytes {
return Err(PyErr::new::<pyo3::exceptions::PyValueError, _>(
"Cumulative text size exceeds maximum limit"
));
}
// ...
}
// Recursive calls pass cumulative_size
let (val_modified, new_value, val_detections) =
self.process_nested_internal(py, &value, &new_path, depth + 1, cumulative_size)?;
}Why Optional: This is defense-in-depth. The existing per-string check catches most attacks. Only needed if memory exhaustion via many small strings is a documented threat model. Effort: ~50 lines Positive Findings✅ 1. Excellent SSN ValidationFile: The Rust detector correctly implements SSA Publication No. 05-10033 rules: fn is_valid_ssn(value: &str) -> bool {
let digits: String = value.chars().filter(|c| c.is_ascii_digit()).collect();
if digits.len() != 9 {
return false;
}
let area = &digits[0..3];
let group = &digits[3..5];
let serial = &digits[5..9];
area != "000" && area != "666" && area < "900" && group != "00" && serial != "0000"
}Validates:
Impact: Reduces false positives on random 9-digit numbers. ✅ 2. Strong Credit Card ValidationFile: Implements proper Luhn algorithm + card prefix validation: fn passes_luhn(value: &str) -> bool {
// ✓ Luhn checksum validation
// ✓ Length check (13-19 digits)
// ✓ Card prefix validation (Visa, MC, Amex, etc.)
}Validates:
Impact: Prevents false positives on random 16-digit numbers. ✅ 3. Contextual PII DetectionFile: Built-in patterns require explicit context labels for ambiguous identifiers:
Impact: Significantly reduces false positives on generic identifiers. ✅ 4. Comprehensive Loopback Header FilteringFile: Blocks all HTTP/1.1 hop-by-hop and routing headers: _LOOPBACK_SKIP_HEADERS: frozenset[str] = frozenset({
"authorization",
"connection",
"content-type",
"content-length",
"host",
"keep-alive",
"mcp-session-id",
"proxy-connection",
"te",
"trailer",
"transfer-encoding",
"upgrade",
"x-mcp-session-id",
"x-forwarded-internally",
})Blocks:
Impact: Prevents header injection attacks in loopback scenarios. ✅ 5. Resource Limit EnforcementFile: Validates and enforces safe resource limits: pub max_text_bytes: usize, // Default: 10MB, Max: 100MB
pub max_nested_depth: usize, // Default: 32, Max: 1000
pub max_collection_items: usize // Default: 4096, Max: 1,000,000Validates:
Impact: Prevents DoS via oversized inputs or deeply nested structures. ✅ 6. Detection Range ValidationFile: Comprehensive validation of detection ranges before masking: fn validate_detection_ranges(text: &str, detections: &[Detection]) -> Result<(), String> {
for detection in detections {
// ✓ start <= end
// ✓ end <= text length
// ✓ UTF-8 character boundaries
// ✓ Overlapping range detection
}
}Impact: Prevents panics and memory safety issues during masking. Recommended PriorityRequired Before Merge
Recommended Before Merge
Nice to Have
Testing SummaryTest Coverage Added
Key Test Scenarios Covered
Security AssessmentAttack Vectors Addressed
Remaining Concerns
ConclusionThis branch demonstrates strong security engineering with:
Overall Verdict: APPROVE with minor fixes Required Action: Address the nested quantifier validation gap (Issue #1) before merging, or explicitly document that custom patterns are trusted-input only. |
|
Addressed the review follow-ups on the branch. What changed:
Verification:
I did not add the optional cumulative nested-text budget tracking in this PR. That one is more invasive, and the existing per-string size/depth/collection limits remain unchanged. |
|
I gave page content to cline with IBM sonnet 4.5 model and this is the result of analysis: CRITICAL CORRECTION: ReDoS Analysis for Rust Regex CrateSummary: Traditional ReDoS Does NOT Apply to Rust's regex CrateAfter reviewing the official Rust regex documentation, the ReDoS concern about nested quantifiers (Issue #2 in my review) is based on a Key Facts from Rust regex Documentation:
What This Means:❌ BUSTED: Traditional ReDoS ConcernsPatterns like (a+)+ or (a*)* that cause catastrophic backtracking in PCRE, JavaScript, Python, etc. DO NOT cause ReDoS in Rust's regex crate.* These patterns will:
✅ VALID: Size Limit ConcernsThe RegexBuilder::size_limit and custom pattern validation in validate_custom_pattern() serve a DIFFERENT purpose:
|
dima-zakharov
left a comment
There was a problem hiding this comment.
Fixes Applied
- Documented custom pattern trust boundary in plugins_rust/pii_filter/src/patterns.rs and READMEs
- Improved UTF-8 boundary error message in plugins_rust/pii_filter/src/masking.rs
- Documented hash mask output length change in READMEs
- Fixed SSN generation in performance tests to skip invalid SSA area codes
- Correction: Rust regex engine does not suffer from traditional ReDoS attacks
Recommendation
Ready to merge
3ced268 to
6905c12
Compare
6905c12 to
a7b9c7e
Compare
Given the plugin is currently in-process (and that’s not changing for now), adding parallelism would directly compete with |
a7b9c7e to
46786b2
Compare
…t in Rust implementation Signed-off-by: lucarlig <luca.carlig@ibm.com>
…ction to Rust implementation Signed-off-by: lucarlig <luca.carlig@ibm.com>
…ing logic Signed-off-by: lucarlig <luca.carlig@ibm.com>
…tector Signed-off-by: lucarlig <luca.carlig@ibm.com>
…dge cases Signed-off-by: lucarlig <luca.carlig@ibm.com>
…imit upper bounds Signed-off-by: lucarlig <luca.carlig@ibm.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
46786b2 to
cd6ed0f
Compare
|
The Python Part looks good to me, need to make sure we have some E2E test so that other plugins wont fail. |
|
Using admin override to merge. PR approved by @dima-zakharov. |
📌 Summary
This PR hardens the Rust PII filter implementation with comprehensive validation, error handling, detection improvements, and regression test coverage to ensure robust security behavior. It also tightens loopback passthrough header filtering so internal loopback requests do not forward hop-by-hop, routing, or MCP session headers from the inbound client request.
🔁 Reproduction Steps
Issues were identified through internal security review and testing that revealed gaps in validation, detection patterns, error handling, and loopback header filtering.
🐞 Root Cause
The Rust PII filter implementation needed hardening in several areas:
💡 Fix Description
Implemented comprehensive improvements across the branch:
Mask strategy preservation and nested key support (
16a9930c0)Comprehensive detection patterns and protection (
b6860120d)Input validation and error handling (
56e85f82e)Resource limits and config validation (
e886de2a6)Comprehensive test coverage (
415ba377f)Documentation (
e77106295)Loopback passthrough header hardening
Connection,Transfer-Encoding,TE,Trailer,Upgrade,Host, andContent-Length🧪 Verification
make lintmake testmake coveragecargo test📐 MCP Compliance (if relevant)
✅ Checklist
make black isort pre-commit)