This document defines the no-data-loss strategy for CFP submissions in Homedir.
- CFP data must survive container restart, deploys, and node replacement.
- CFP data must be recoverable if a JSON file is corrupted.
- CFP data must be portable to a new environment.
- Primary store:
${homedir.data.dir}/cfp-submissions.json. - Automatic local snapshots:
${homedir.data.dir}/backups/cfp/. - Snapshot retention: configurable (
cfp.persistence.backups.max-files, default120). - Snapshot frequency guard: configurable (
cfp.persistence.backups.min-interval-ms, default300000). - Recovery flow:
- Try primary file.
- If primary missing/corrupted, recover from newest valid snapshot.
- Quarantine corrupted primary as
cfp-submissions.corrupt-<timestamp>.json. - Rebuild primary from recovered snapshot.
- Current persisted schema uses an envelope:
schema_versionkindupdated_atchecksum_sha256(integrity guard)submissions(map by id)
- Backward compatibility is preserved:
- Legacy map-only payloads are still readable from primary, backups, and WAL frames.
- Legacy payloads are auto-migrated to the versioned envelope on successful load.
- Envelope payloads missing checksum are auto-hydrated on successful load.
GET /api/events/{eventId}/cfp/submissions/storage(admin-only) includes:- primary/backups paths and sizes
- primary validation flags (
primary_valid,primary_missing_checksum,primary_validation_error) - backup validation counters (
backup_valid_count,backup_invalid_count,backup_missing_checksum_count,latest_backup_valid) - WAL status and counters (
wal_enabled,wal_size_bytes,wal_appends,wal_compactions,wal_recoveries) - checksum status and counters (
checksum_enabled,checksum_required,checksum_mismatches,checksum_hydrations)
POST /api/events/{eventId}/cfp/submissions/storage/repair?dry_run=true|false(admin-only):- scans primary + backup snapshots
- repairs checksum-missing snapshots
- quarantines corrupted backups (
*.corrupt-<timestamp>.json) - supports dry-run mode for safe audits before mutating files
cfp.persistence.backups.enabled=truecfp.persistence.backups.max-files=120cfp.persistence.backups.min-interval-ms=300000cfp.persistence.checksum.enabled=truecfp.persistence.checksum.required=false(can be switched totrueafter legacy fleet migration)
homedir.data.dirmust point to a persistent volume (PVC/host disk).- Do not run with ephemeral-only storage in production.
- Include
${homedir.data.dir}in platform backup policy.
- PR1: durable CFP local snapshots + automatic recovery from corruption. (implemented)
- PR2: portable CFP export/import bundle and recursive admin backup/restore checks. (implemented)
- PR3: CFP storage observability endpoint for admin verification + restore drill checklist. (implemented)
- PR4: CFP persistence schema versioning + automatic legacy migration across primary/WAL/backups. (implemented)
- PR5: checksum-based CFP integrity guard + auto-hydration and recovery fallback. (implemented)
- PR6: admin storage telemetry expanded with WAL/checksum counters for live operational verification. (implemented)
- PR7: storage telemetry now validates primary/backups and reports backup integrity counters. (implemented)
- PR8: admin storage repair endpoint with dry-run/execute modes for checksum hydration + corrupted backup quarantine. (implemented)
- Stop app traffic.
- Restore
homedir.data.dirbackup. - Start app.
- Verify
/api/events/{eventId}/cfp/submissions/minereturns existing submissions. - Verify admin CFP moderation queue reads historical entries.