Skip to content

ANSI escape sequences in logs interfere with regex filtering #873

@sarub0b0

Description

@sarub0b0

Description

When container logs contain ANSI escape sequences (control characters for colors, cursor movements, etc.), regex filtering doesn't work as expected because the invisible control characters interfere with pattern matching.

This issue was discovered while investigating log filtering with netshoot containers, which often include shell output with ANSI escape sequences.

Current Behavior

For example, if a log line contains:

\x1b[31mERROR\x1b[0m: Connection failed
  • Visually displayed as: ERROR: Connection failed (with red color)
  • Actual string contains: \x1b[31mERROR\x1b[0m: Connection failed
  • Regex ^ERROR match: ❌ Fails (line actually starts with \x1b)
  • User expectation: ✅ Should match (user sees "ERROR" at the start)

Expected Behavior

Users should be able to filter logs based on what they visually see, not based on invisible control characters.

Examples of Affected Patterns

  1. Line-start matching: ^ERROR doesn't match lines that appear to start with "ERROR"
  2. Line-end matching: failed$ doesn't match lines that appear to end with "failed"
  3. Empty line filtering: ^$ doesn't match visually empty lines that contain only control characters

Impact

  • Affected containers: Debug containers (netshoot), interactive shells, applications with colored output
  • User experience: Users need to know about invisible control characters to write effective filters
  • Workaround difficulty: Currently no workaround available

Technical Background

This project already has an ANSI escape sequence parser (src/ansi/parser.rs) that handles:

  • CSI sequences (colors, cursor movements, etc.)
  • Graphics rendering (SGR)
  • Cursor control

However, the parser doesn't cover all possible control sequences (OSC, DCS, terminal-specific sequences, etc.).

Proposed Solution

Maintain two versions of log content:

  1. Raw content: For display (preserves colors and formatting)
  2. Plain content: For filtering (all ANSI escape sequences removed)

Example implementation approach:

pub struct FilterableLogContent {
    pub raw: String,    // For display
    pub plain: String,  // For filtering
}

Implementation Considerations

  1. Scope of stripping:

    • CSI sequences: \x1b[... (already parsed)
    • OSC sequences: \x1b]... (not yet covered)
    • Other control characters: C0/C1 control codes
  2. Stripping method:

    • Option A: Regex-based (simple, covers 95-99% of cases)
    • Option B: Parser-based (accurate but requires extending existing parser)
    • Option C: Hybrid (recommended)
  3. Performance impact: Need to strip ANSI codes for every log line

  4. Memory impact: Storing both raw and plain versions doubles string storage

  5. Compatibility: Ensure existing log display functionality isn't affected

Reference

Additional Notes

This is a separate issue from #871 (leading space after timestamp). While both affect regex filtering, they have different root causes and solutions.

This issue is marked for future consideration and is not blocking current functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions