CFP Resilience Baseline

This document defines the no-data-loss strategy for CFP submissions in Homedir.

Goals

Primary store: ${homedir.data.dir}/cfp-submissions.json.
Automatic local snapshots: ${homedir.data.dir}/backups/cfp/.
Snapshot retention: configurable (cfp.persistence.backups.max-files, default 120).
Snapshot frequency guard: configurable (cfp.persistence.backups.min-interval-ms, default 300000).
Recovery flow:
1. Try primary file.
2. If primary missing/corrupted, recover from newest valid snapshot.
3. Quarantine corrupted primary as cfp-submissions.corrupt-<timestamp>.json.
4. Rebuild primary from recovered snapshot.

Current persisted schema uses an envelope:
- schema_version
- kind
- updated_at
- checksum_sha256 (integrity guard)
- submissions (map by id)
Backward compatibility is preserved:
- Legacy map-only payloads are still readable from primary, backups, and WAL frames.
- Legacy payloads are auto-migrated to the versioned envelope on successful load.
- Envelope payloads missing checksum are auto-hydrated on successful load.

GET /api/events/{eventId}/cfp/submissions/storage (admin-only) includes:
- primary/backups paths and sizes
- primary validation flags (primary_valid, primary_missing_checksum, primary_validation_error)
- backup validation counters (backup_valid_count, backup_invalid_count, backup_missing_checksum_count, latest_backup_valid)
- WAL status and counters (wal_enabled, wal_size_bytes, wal_appends, wal_compactions, wal_recoveries)
- checksum status and counters (checksum_enabled, checksum_required, checksum_mismatches, checksum_hydrations)
POST /api/events/{eventId}/cfp/submissions/storage/repair?dry_run=true|false (admin-only):
- scans primary + backup snapshots
- repairs checksum-missing snapshots
- quarantines corrupted backups (*.corrupt-<timestamp>.json)
- supports dry-run mode for safe audits before mutating files

cfp.persistence.backups.enabled=true
cfp.persistence.backups.max-files=120
cfp.persistence.backups.min-interval-ms=300000
cfp.persistence.checksum.enabled=true
cfp.persistence.checksum.required=false (can be switched to true after legacy fleet migration)

PR1: durable CFP local snapshots + automatic recovery from corruption. (implemented)
PR2: portable CFP export/import bundle and recursive admin backup/restore checks. (implemented)
PR3: CFP storage observability endpoint for admin verification + restore drill checklist. (implemented)
PR4: CFP persistence schema versioning + automatic legacy migration across primary/WAL/backups. (implemented)
PR5: checksum-based CFP integrity guard + auto-hydration and recovery fallback. (implemented)
PR6: admin storage telemetry expanded with WAL/checksum counters for live operational verification. (implemented)
PR7: storage telemetry now validates primary/backups and reports backup integrity counters. (implemented)
PR8: admin storage repair endpoint with dry-run/execute modes for checksum hydration + corrupted backup quarantine. (implemented)

Stop app traffic.
Restore homedir.data.dir backup.
Start app.
Verify /api/events/{eventId}/cfp/submissions/mine returns existing submissions.
Verify admin CFP moderation queue reads historical entries.