Read-atomic/crash-atomic NodeFSStorageAdapter#600
Conversation
a5725b5 to
af0af0a
Compare
There was a problem hiding this comment.
Pull request overview
Updates the NodeFS-backed storage adapter to make filesystem writes read-atomic and (on POSIX) crash-atomic using a temp-file + fsync + rename strategy, and adds targeted tests/documentation for the new durability guarantees.
Changes:
- Implement atomic write path for
save()using<baseDirectory>/.tmp/,fsync, andrename, plus POSIX directoryfsyncfor rename durability. - Add cache rollback behavior on write failures to keep in-memory state consistent with on-disk state.
- Extend NodeFS adapter tests with atomicity/durability and cache-rollback scenarios; expand package docs with durability/atomicity details.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| packages/automerge-repo-storage-nodefs/src/index.ts | Implements atomic write + directory fsync durability, tmp directory handling, and cache rollback behavior. |
| packages/automerge-repo-storage-nodefs/test/NodeFSStorageAdapter.test.ts | Adds new tests intended to validate atomic write behavior and cache rollback on failures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
af0af0a to
bb97333
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
5496db6 to
0e2d814
Compare
0e2d814 to
628b3c9
Compare
|
@alexjg @msakrejda @expede, #600 lines up with a cluster of open issues and PRs that all look like one failure class, and it seemed worth connecting them in one place. The class: async work from storage or sync (a rejected save/load, a throw while decoding a peer message) that is neither awaited nor caught. It surfaces as an unhandled rejection or uncaught exception, so in Node the process exits, and You can reproduce the mechanism in isolation: #673 includes a self-contained script (eventemitter3 + a real socket Existing issues that are this class:
The PRs, by layer:
Bottom line: one failure class at different layers (uncaught async from storage/sync, leading to unhandled rejection / process exit / torn or lost writes). #600 makes a write survive a crash; the rejection-handling PRs stop the crash and stop dropping the error. They are complementary. The deeper "do it properly" direction (make storage/network I/O abort-aware and properly awaited rather than fire-and-forget) is sketched in a WIP branch, Happy to fold the cross-references into #389 as an umbrella, or open a short tracking issue, if that is easier to follow. |
The NodeFS storage backend can fail and leave torn writes. This PR improves the NodeFS write safety guarantees for POSIX and Windows targets. Possibly controversial is that we explicitly
fsyncon POSIX, which adds up to 100us-1ms per write (depending on the specific SSD hardware) in exchange for durability guarantees. This doesn't get us to fully transactional writes (with CAS before/after gates, WAL, etc etc), but SIGNIFICANTLY improves atomic write reliability with read-/crash-atomicity.Anecdotally: we were getting lots of torn writes in pushwork (due to an early exit bug) — we switched to this before fixing pushwork and haven't seen any torn writes since.
We've rebased
subductionjsover this PR. That branch adds asaveBatchto the interface for performance (e.g. hitting IDB many times but then need to update all instances) — this PR can be applied to those semantics, too