|
| 1 | +# Redact |
| 2 | + |
| 3 | +Path-shape-preserving redactor used by the crash reporter and (Phase 4+) the error reporter. |
| 4 | + |
| 5 | +The hot path is `redact_line`, called once per log line. One composed regex with named capture |
| 6 | +groups drives a single pass; the dispatch closure inspects which group matched and calls the |
| 7 | +matching rewriter. `Cow::Borrowed` is returned for lines with no matches so the no-PII case |
| 8 | +costs zero allocations. |
| 9 | + |
| 10 | +## Pattern table |
| 11 | + |
| 12 | +| Group | Matches | Rewrites to | |
| 13 | +| --------------- | ---------------------------------------------------- | ------------------------------------------------------- | |
| 14 | +| `unix_home` | `/Users/<user>/...`, `/home/<user>/...` | `$HOME/<allowlisted-parent-or-dir>/<file>.<ext>` | |
| 15 | +| `win_home` | `C:\Users\<user>\...` | `$HOME\<allowlisted-parent-or-dir>\<file>.<ext>` | |
| 16 | +| `unix_system` | `/tmp/...`, `/var/...`, `/private/...`, `/opt/...` | Prefix kept; tail walked with same shape rules | |
| 17 | +| `volumes` | `/Volumes/<label>/...` (label may contain spaces) | `/Volumes/<volume>/<allowlisted-or-dir>/<file>.<ext>` | |
| 18 | +| `media` | `/media/<label>/...` (label may contain spaces) | `/media/<volume>/<allowlisted-or-dir>/<file>.<ext>` | |
| 19 | +| `smb_uri` | `smb://host/share/...` | `smb://<host>/<share>/<redacted tail>` | |
| 20 | +| `unc` | `\\host\share\...` | `\\<host>\<share>\<redacted tail>` | |
| 21 | +| `url_userinfo` | `scheme://user[:pass]@host/...` | `scheme://<userinfo>@host/...` (host preserved) | |
| 22 | +| `email` | `local@domain.tld` (loose RFC 5321 ish) | `<email>` | |
| 23 | +| `mdns` | `<label>.local` bare hostnames | `<host>.local` | |
| 24 | +| `ipv4` | dotted-quad with valid octet ranges | `<ipv4>` | |
| 25 | +| `ipv6` | full + common compact forms (`::1`, `fe80::1`, ...) | `<ipv6>` | |
| 26 | + |
| 27 | +### Path-shape preservation |
| 28 | + |
| 29 | +For paths, we keep: |
| 30 | + |
| 31 | +- The **mount/home prefix** as a fixed token (`$HOME`, `/Volumes/<volume>`, `/media/<volume>`, |
| 32 | + `/tmp/`, etc.). |
| 33 | +- The **immediate parent directory name** if it's in the allowlist |
| 34 | + (`Documents`, `Downloads`, `Desktop`, `Library`, `src`, `Pictures`, `Movies`, `Music`, |
| 35 | + `Public`, `AppData`, `Application Support`). |
| 36 | +- The **file extension** if it's <= 8 ASCII alphanumeric chars. |
| 37 | + |
| 38 | +Everything else collapses to `<dir>` or `<file>`. So |
| 39 | +`/Users/john/Documents/budget.pdf` → `$HOME/Documents/<file>.pdf`, but |
| 40 | +`/Users/john/SecretProject/budget.pdf` → `$HOME/<dir>/<file>.pdf`. |
| 41 | + |
| 42 | +### Decision: why path-shape preservation + allowlist |
| 43 | + |
| 44 | +Tradeoff between debuggability ("I can see this is a Documents path") and PII safety |
| 45 | +("but I don't want to leak project codenames"). The allowlist captures the dirs that are |
| 46 | +near-universal across users — anything custom collapses. Net result: triagers can usually |
| 47 | +guess the failure context without seeing the user's secrets. |
| 48 | + |
| 49 | +### Decision: MTP device names not handled in Phase 1 |
| 50 | + |
| 51 | +The plan listed "MTP device names (from log target prefix)" but the cross-cutting reminder |
| 52 | +clarifies: redactor operates on the message body, not the target. Bare device names like |
| 53 | +`Pixel 9 Pro` in the message body are too generic to detect without context. If we end up |
| 54 | +needing this, we'll add a per-call `RedactionContext` rather than baking it into the global |
| 55 | +regex. |
| 56 | + |
| 57 | +## How to add a new pattern |
| 58 | + |
| 59 | +Three steps: |
| 60 | + |
| 61 | +1. Add a new alternative inside `redactor_regex()` with a unique `(?P<group_name>...)` and |
| 62 | + write a corresponding rewriter (or extend `dispatch`) to map matches to redacted output. |
| 63 | +2. Add a dedicated test in `tests.rs` with at least 6 input→expected tuples covering edge |
| 64 | + cases (start of line, middle of line, embedded in punctuation, multiple per line). |
| 65 | +3. Append two or three lines to `fixtures/log-corpus.txt` exercising the new pattern. |
| 66 | + Update `fixtures/log-corpus.redacted.txt` to match. The `replacement_count_histogram` |
| 67 | + test will tell you if the corpus is missing your pattern. |
| 68 | + |
| 69 | +## Files |
| 70 | + |
| 71 | +| File | Purpose | |
| 72 | +| ----------------------------------- | ------------------------------------------------------------- | |
| 73 | +| `mod.rs` | Public API + composed regex + path rewriters | |
| 74 | +| `tests.rs` | Per-pattern tests, idempotency, golden corpus, histogram | |
| 75 | +| `fixtures/log-corpus.txt` | Synthesized log lines covering every pattern class | |
| 76 | +| `fixtures/log-corpus.redacted.txt` | Expected redaction of the corpus (golden snapshot) | |
| 77 | + |
| 78 | +## Gotchas |
| 79 | + |
| 80 | +- The dispatch order in `dispatch()` mirrors the alternation order in the regex. SMB URIs |
| 81 | + with userinfo (`smb://user@host/...`) match `smb_uri` first (it's listed earlier) — they |
| 82 | + do **not** fall through to `url_userinfo`. The userinfo is dropped along with the host. |
| 83 | +- `redact_text` splits on `\n` and redacts each line independently. This keeps regex `\b` |
| 84 | + anchors predictable and lets us return `Cow::Borrowed` per line. |
| 85 | +- Verbose regex mode (`(?x)`) ignores whitespace **outside** character classes. Inside |
| 86 | + `[...]` whitespace is literal, so `[A-Za-z]` is fine but `[ A-Za-z ]` would match a space. |
| 87 | +- Paths with embedded spaces like `/Volumes/My Backup Drive/...` are matched by allowing |
| 88 | + single spaces between path components. Multi-space gaps stop the match. |
| 89 | +- The `url_userinfo` pattern preserves the host on purpose — the assumption is that the host |
| 90 | + is part of a well-known service URL the developer needs to see. If we ever store private |
| 91 | + hosts in URLs, revisit this. |
0 commit comments