Skip to content

Commit 1d719f3

Browse files
committed
Feature: Shared PII redactor for logs + crashes
- New `redact` module: one composed regex with named groups dispatches per pattern (Unix/Windows home, /Volumes, /media, SMB UNC + smb://, *.local, IPv4/IPv6, email, URL userinfo, /tmp + /var + /private + /opt). - Path-shape preservation: keeps file extension and allowlisted parent dir name (`Documents`, `Library`, `Application Support`, ...); collapses unknown dirs to `<dir>` so project codenames don't leak. - 19 redact tests + 21 crash reporter tests pass: per-pattern, idempotency, golden corpus snapshot (~140 synthesized lines, regenerate via `REGENERATE_REDACT_CORPUS=1`), histogram coverage check, and negatives (`cmdr_lib::network::smb_client`, `0.1.2-alpha`, etc. unchanged). - `crash_reporter::sanitize_panic_message` now delegates to `redact::redact_panic_message` (kept as a thin wrapper for one cycle).
1 parent 1d5f661 commit 1d719f3

9 files changed

Lines changed: 1323 additions & 26 deletions

File tree

apps/desktop/src-tauri/src/crash_reporter/CLAUDE.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,11 @@ Both paths write to `crash-report.json` in the app data dir (same dir as `settin
2727
## Key design decisions
2828

2929
- **Opt-in only** (`updates.crashReports` defaults to `false`). Consistent with the "no telemetry" stance.
30-
- **No PII, ever.** Panic messages are sanitized to strip file paths before writing. No file names, usernames, device
31-
IDs, or license keys are included.
30+
- **No PII, ever.** Panic messages are sanitized via the shared
31+
[`crate::redact`](../redact/CLAUDE.md) module to strip file paths, hostnames, IPs, emails,
32+
and URL userinfo before writing. The `sanitize_panic_message` function in `mod.rs` is a
33+
thin wrapper around `redact::redact_panic_message`. No file names, usernames, device IDs,
34+
or license keys are included.
3235
- **Dev mode: capture only, never send.** Crash files are written (useful for testing), but the send path is skipped to
3336
avoid polluting production data.
3437
- **Crash loop protection**: If the crash file's timestamp is less than five seconds before the current launch, skip

apps/desktop/src-tauri/src/crash_reporter/mod.rs

Lines changed: 5 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ mod symbolicate;
1010
mod tests;
1111

1212
use crate::config;
13+
use crate::redact;
1314
use crate::settings;
14-
use regex::Regex;
1515
use serde::{Deserialize, Serialize};
1616
use std::path::{Path, PathBuf};
1717
use std::sync::OnceLock;
@@ -154,23 +154,11 @@ fn extract_panic_message(info: &std::panic::PanicHookInfo<'_>) -> Option<String>
154154
Some(info.to_string())
155155
}
156156

157-
/// Strip file paths from panic messages to prevent PII leaks.
158-
/// Matches Unix paths (/Users/..., /home/..., /tmp/...) and Windows paths (C:\...).
157+
/// Strip PII from panic messages. Thin wrapper around the shared redactor; kept here so
158+
/// callers don't need to know about `crate::redact` and so existing tests have a stable
159+
/// entry point during the migration.
159160
fn sanitize_panic_message(message: &str) -> String {
160-
static RE: OnceLock<Regex> = OnceLock::new();
161-
let re = RE.get_or_init(|| {
162-
// Match Unix absolute paths and Windows drive paths.
163-
// Captures paths like /Users/foo/bar.rs:42:5 or C:\Users\foo\bar.rs
164-
Regex::new(
165-
r#"(?x)
166-
(?:
167-
/(?:Users|home|tmp|var|private|opt|usr|nix)[/][^\s"':;,)}\]]+
168-
| [A-Z]:\\[^\s"':;,)}\]]+
169-
)"#,
170-
)
171-
.expect("valid regex")
172-
});
173-
re.replace_all(message, "<path>").into_owned()
161+
redact::redact_panic_message(message)
174162
}
175163

176164
fn parse_backtrace_frames(backtrace_str: &str) -> Vec<String> {

apps/desktop/src-tauri/src/crash_reporter/tests.rs

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -111,13 +111,17 @@ fn nonexistent_crash_file_returns_none() {
111111
assert!(result.is_none());
112112
}
113113

114+
// Sanitization is delegated to `crate::redact` (see `redact/tests.rs` for full pattern
115+
// coverage). These tests verify the wrapper still strips the same PII the old sanitizer
116+
// did, just with the new path-shape-preserving output.
117+
114118
#[test]
115119
fn sanitize_unix_home_path() {
116120
let msg = r#"No such file or directory (os error 2): /Users/john/Documents/secret-project/file.txt"#;
117121
let sanitized = sanitize_panic_message(msg);
118122
assert!(!sanitized.contains("/Users/john"));
119123
assert!(!sanitized.contains("secret-project"));
120-
assert!(sanitized.contains("<path>"));
124+
assert!(sanitized.contains("$HOME"));
121125
}
122126

123127
#[test]
@@ -126,23 +130,25 @@ fn sanitize_linux_home_path() {
126130
let sanitized = sanitize_panic_message(msg);
127131
assert!(!sanitized.contains("/home/alice"));
128132
assert!(!sanitized.contains("id_rsa"));
129-
assert!(sanitized.contains("<path>"));
133+
assert!(sanitized.contains("$HOME"));
130134
}
131135

132136
#[test]
133137
fn sanitize_windows_path() {
134138
let msg = r"couldn't read C:\Users\Bob\Desktop\passwords.txt";
135139
let sanitized = sanitize_panic_message(msg);
136140
assert!(!sanitized.contains(r"C:\Users\Bob"));
137-
assert!(sanitized.contains("<path>"));
141+
assert!(sanitized.contains("$HOME"));
138142
}
139143

140144
#[test]
141145
fn sanitize_tmp_path() {
142146
let msg = "error at /tmp/build-abc123/src/main.rs:42:5";
143147
let sanitized = sanitize_panic_message(msg);
144-
assert!(!sanitized.contains("/tmp/build"));
145-
assert!(sanitized.contains("<path>"));
148+
assert!(!sanitized.contains("/tmp/build-abc123"));
149+
// Path-shape preservation keeps the `/tmp/` prefix and the file extension.
150+
assert!(sanitized.contains("/tmp/"));
151+
assert!(sanitized.contains("<file>.rs"));
146152
}
147153

148154
#[test]
@@ -158,8 +164,8 @@ fn sanitize_multiple_paths() {
158164
let sanitized = sanitize_panic_message(msg);
159165
assert!(!sanitized.contains("/Users/a"));
160166
assert!(!sanitized.contains("/Users/b"));
161-
// Should have two <path> replacements
162-
assert_eq!(sanitized.matches("<path>").count(), 2);
167+
// Two paths → two $HOME replacements.
168+
assert_eq!(sanitized.matches("$HOME").count(), 2);
163169
}
164170

165171
#[test]

apps/desktop/src-tauri/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ mod network;
9898
mod permissions;
9999
#[cfg(target_os = "linux")]
100100
mod permissions_linux;
101+
mod redact;
101102
pub mod search;
102103
mod secrets;
103104
mod settings;
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Redact
2+
3+
Path-shape-preserving redactor used by the crash reporter and (Phase 4+) the error reporter.
4+
5+
The hot path is `redact_line`, called once per log line. One composed regex with named capture
6+
groups drives a single pass; the dispatch closure inspects which group matched and calls the
7+
matching rewriter. `Cow::Borrowed` is returned for lines with no matches so the no-PII case
8+
costs zero allocations.
9+
10+
## Pattern table
11+
12+
| Group | Matches | Rewrites to |
13+
| --------------- | ---------------------------------------------------- | ------------------------------------------------------- |
14+
| `unix_home` | `/Users/<user>/...`, `/home/<user>/...` | `$HOME/<allowlisted-parent-or-dir>/<file>.<ext>` |
15+
| `win_home` | `C:\Users\<user>\...` | `$HOME\<allowlisted-parent-or-dir>\<file>.<ext>` |
16+
| `unix_system` | `/tmp/...`, `/var/...`, `/private/...`, `/opt/...` | Prefix kept; tail walked with same shape rules |
17+
| `volumes` | `/Volumes/<label>/...` (label may contain spaces) | `/Volumes/<volume>/<allowlisted-or-dir>/<file>.<ext>` |
18+
| `media` | `/media/<label>/...` (label may contain spaces) | `/media/<volume>/<allowlisted-or-dir>/<file>.<ext>` |
19+
| `smb_uri` | `smb://host/share/...` | `smb://<host>/<share>/<redacted tail>` |
20+
| `unc` | `\\host\share\...` | `\\<host>\<share>\<redacted tail>` |
21+
| `url_userinfo` | `scheme://user[:pass]@host/...` | `scheme://<userinfo>@host/...` (host preserved) |
22+
| `email` | `local@domain.tld` (loose RFC 5321 ish) | `<email>` |
23+
| `mdns` | `<label>.local` bare hostnames | `<host>.local` |
24+
| `ipv4` | dotted-quad with valid octet ranges | `<ipv4>` |
25+
| `ipv6` | full + common compact forms (`::1`, `fe80::1`, ...) | `<ipv6>` |
26+
27+
### Path-shape preservation
28+
29+
For paths, we keep:
30+
31+
- The **mount/home prefix** as a fixed token (`$HOME`, `/Volumes/<volume>`, `/media/<volume>`,
32+
`/tmp/`, etc.).
33+
- The **immediate parent directory name** if it's in the allowlist
34+
(`Documents`, `Downloads`, `Desktop`, `Library`, `src`, `Pictures`, `Movies`, `Music`,
35+
`Public`, `AppData`, `Application Support`).
36+
- The **file extension** if it's <= 8 ASCII alphanumeric chars.
37+
38+
Everything else collapses to `<dir>` or `<file>`. So
39+
`/Users/john/Documents/budget.pdf``$HOME/Documents/<file>.pdf`, but
40+
`/Users/john/SecretProject/budget.pdf``$HOME/<dir>/<file>.pdf`.
41+
42+
### Decision: why path-shape preservation + allowlist
43+
44+
Tradeoff between debuggability ("I can see this is a Documents path") and PII safety
45+
("but I don't want to leak project codenames"). The allowlist captures the dirs that are
46+
near-universal across users — anything custom collapses. Net result: triagers can usually
47+
guess the failure context without seeing the user's secrets.
48+
49+
### Decision: MTP device names not handled in Phase 1
50+
51+
The plan listed "MTP device names (from log target prefix)" but the cross-cutting reminder
52+
clarifies: redactor operates on the message body, not the target. Bare device names like
53+
`Pixel 9 Pro` in the message body are too generic to detect without context. If we end up
54+
needing this, we'll add a per-call `RedactionContext` rather than baking it into the global
55+
regex.
56+
57+
## How to add a new pattern
58+
59+
Three steps:
60+
61+
1. Add a new alternative inside `redactor_regex()` with a unique `(?P<group_name>...)` and
62+
write a corresponding rewriter (or extend `dispatch`) to map matches to redacted output.
63+
2. Add a dedicated test in `tests.rs` with at least 6 input→expected tuples covering edge
64+
cases (start of line, middle of line, embedded in punctuation, multiple per line).
65+
3. Append two or three lines to `fixtures/log-corpus.txt` exercising the new pattern.
66+
Update `fixtures/log-corpus.redacted.txt` to match. The `replacement_count_histogram`
67+
test will tell you if the corpus is missing your pattern.
68+
69+
## Files
70+
71+
| File | Purpose |
72+
| ----------------------------------- | ------------------------------------------------------------- |
73+
| `mod.rs` | Public API + composed regex + path rewriters |
74+
| `tests.rs` | Per-pattern tests, idempotency, golden corpus, histogram |
75+
| `fixtures/log-corpus.txt` | Synthesized log lines covering every pattern class |
76+
| `fixtures/log-corpus.redacted.txt` | Expected redaction of the corpus (golden snapshot) |
77+
78+
## Gotchas
79+
80+
- The dispatch order in `dispatch()` mirrors the alternation order in the regex. SMB URIs
81+
with userinfo (`smb://user@host/...`) match `smb_uri` first (it's listed earlier) — they
82+
do **not** fall through to `url_userinfo`. The userinfo is dropped along with the host.
83+
- `redact_text` splits on `\n` and redacts each line independently. This keeps regex `\b`
84+
anchors predictable and lets us return `Cow::Borrowed` per line.
85+
- Verbose regex mode (`(?x)`) ignores whitespace **outside** character classes. Inside
86+
`[...]` whitespace is literal, so `[A-Za-z]` is fine but `[ A-Za-z ]` would match a space.
87+
- Paths with embedded spaces like `/Volumes/My Backup Drive/...` are matched by allowing
88+
single spaces between path components. Multi-space gaps stop the match.
89+
- The `url_userinfo` pattern preserves the host on purpose — the assumption is that the host
90+
is part of a well-known service URL the developer needs to see. If we ever store private
91+
hosts in URLs, revisit this.

0 commit comments

Comments
 (0)