Skip to content

Commit b6b2f24

Browse files
committed
docs: document storage error handling and the storage adapter contract
Capture what StorageSource guarantees on a failed write (the process stays up, and the change is retained in memory so a later save or reload re-persists it), and the responsibilities that sit with storage adapters: recoverability, retry and backoff as one option among others, and escalation. Note the reliability strategies available, with network redundancy via peers as the primary lever, and that configuring observability and alerting is the embedding application's job via the pluggable logger (setLoggerFactory, console/pino/winston). No first-class storage-error signal is added.
1 parent 451bd98 commit b6b2f24

1 file changed

Lines changed: 114 additions & 0 deletions

File tree

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Storage error handling
2+
3+
Guidance on what happens when storage I/O fails in `automerge-repo`, and where
4+
the responsibility for reliability lies. The behavior described here is
5+
implemented in [`src/StorageSource.ts`](../src/StorageSource.ts); the adapter
6+
contract applies to implementations of
7+
[`src/storage/StorageAdapterInterface.ts`](../src/storage/StorageAdapterInterface.ts).
8+
9+
## What `StorageSource` guarantees
10+
11+
When a throttled save rejects (a disk/IO error, quota exceeded, an aborted
12+
IndexedDB transaction, a failed remote write, and so on), `StorageSource`
13+
catches the error and logs it rather than letting it escape. Two properties
14+
follow:
15+
16+
- **The process stays up.** The save runs fire-and-forget from a
17+
`"heads-changed"` listener, so an unhandled rejection would, in Node, exit the
18+
process by default. Catching it keeps the repo alive.
19+
- **The change is retained in memory.** Nothing is dropped. A later save or a
20+
reload re-persists the current document, because `lastSavedHeads` only
21+
advances on a successful write, so the next successful save re-includes
22+
whatever a failed one did not persist.
23+
24+
This is deliberately the only thing `StorageSource` does about a failed write.
25+
It converts a fatal crash into a recoverable condition, which buys time for the
26+
recovery strategies below. It does not retry, back off, or escalate. That is not
27+
its job (see the contract below).
28+
29+
## The storage adapter contract
30+
31+
A `StorageAdapter` should be designed to be robust. Most failure handling
32+
belongs in the adapter rather than in `StorageSource`, because only the adapter
33+
knows its backend.
34+
35+
- **Recoverability is the adapter's call.** Many storage failures are rare and
36+
transient: a momentary lock, a quota blip, a 503 from a remote store. The
37+
adapter is the layer that can distinguish a transient failure from a permanent
38+
one.
39+
- **Retry and backoff are a consideration, not a mandate.** Exponential backoff
40+
is one reasonable strategy for some backends, but it is not always the right
41+
answer and it is not the only one. Whether and how to retry depends on the
42+
backend's semantics, which is precisely why the policy lives in the adapter
43+
and not in a generic layer that cannot know those semantics.
44+
- **Escalation is the adapter's responsibility** when a failure is genuinely
45+
unrecoverable. How to escalate (surface to the host, fail a health check, and
46+
so on) is backend- and deployment-specific.
47+
48+
## Reliability strategies to consider
49+
50+
If durability is a concern, there is more than one lever, and adapter-level
51+
retry is rarely the most important one:
52+
53+
- **Network redundancy via peers.** This is usually the strongest lever. The
54+
repo instances you connect to each have their own storage adapter, so a
55+
document synced to peers is already durable in more than one place. A local
56+
storage failure does not lose data that a connected peer holds; once storage
57+
recovers, normal sync re-persists it. Designing for connectivity to a
58+
well-provisioned peer (for example a sync server backed by reliable storage)
59+
buys more real durability than hardening any single adapter.
60+
- **Adapter-level retry and backoff.** Useful for transient backend failures,
61+
with the caveats above. Evaluate it per backend; do not assume it is
62+
sufficient on its own.
63+
- **Other strategies.** Depending on requirements, writing through to more than
64+
one backend, putting a durable queue in front of a flaky store, or periodic
65+
reconciliation may fit better than retry alone. Treat the options above as a
66+
starting point, not an exhaustive list.
67+
68+
## Observability and alerting
69+
70+
Persistent failures need to be visible; otherwise a server can look healthy
71+
while silently failing to persist. `automerge-repo` surfaces these through its
72+
logger: a failed save is reported via `logger.error(...)` under the relevant
73+
subsystem namespace (for example `automerge-repo:storage-source`).
74+
75+
The logger is pluggable. By default `.debug` output is routed through the
76+
[`debug`](https://www.npmjs.com/package/debug) package (filter with
77+
`DEBUG=automerge-repo:*`), and `info` / `warn` / `error` go to `console`. The
78+
`Logger` interface is shaped to match `console`, [pino], and [winston], and
79+
[`setLoggerFactory`](../src/Logger.ts) routes all automerge-repo output through
80+
your own logger when called once at startup:
81+
82+
```ts
83+
import { setLoggerFactory } from "@automerge/automerge-repo"
84+
import winston from "winston"
85+
86+
const logger = winston.createLogger({ /* ... */ })
87+
88+
setLoggerFactory(namespace => ({
89+
debug: (msg, ...args) => logger.debug(msg, { namespace, args }),
90+
info: (msg, ...args) => logger.info(msg, { namespace, args }),
91+
warn: (msg, ...args) => logger.warn(msg, { namespace, args }),
92+
error: (msg, ...args) => logger.error(msg, { namespace, args }),
93+
}))
94+
```
95+
96+
A reasonable production setup ships these logs to a backend that supports
97+
alerting (for example by exporting them through OpenTelemetry) and alerts on
98+
persistent storage errors. Configuring the logger and wiring an
99+
observability and alerting layer is the responsibility of the application
100+
embedding `automerge-repo`. The library's job is to emit the events at a
101+
sensible level and namespace; routing and alerting are deployment concerns.
102+
103+
## Why there is no first-class storage error event
104+
105+
We intentionally do not expose a typed `storage-error` event or signal. The
106+
logging path already exists and is configurable as above, escalation policy
107+
belongs in the adapter, and redundancy comes from the network. A separate
108+
in-process error signal would duplicate the logger and would invite
109+
backend-specific recovery policy into a layer that should stay
110+
backend-agnostic. A consumer that wants programmatic handling can supply a
111+
custom `LoggerFactory` that inspects the namespace and level.
112+
113+
[pino]: https://github.com/pinojs/pino
114+
[winston]: https://github.com/winstonjs/winston

0 commit comments

Comments
 (0)