|
| 1 | +# Storage error handling |
| 2 | + |
| 3 | +Guidance on what happens when storage I/O fails in `automerge-repo`, and where |
| 4 | +the responsibility for reliability lies. The behavior described here is |
| 5 | +implemented in [`src/StorageSource.ts`](../src/StorageSource.ts); the adapter |
| 6 | +contract applies to implementations of |
| 7 | +[`src/storage/StorageAdapterInterface.ts`](../src/storage/StorageAdapterInterface.ts). |
| 8 | + |
| 9 | +## What `StorageSource` guarantees |
| 10 | + |
| 11 | +When a throttled save rejects (a disk/IO error, quota exceeded, an aborted |
| 12 | +IndexedDB transaction, a failed remote write, and so on), `StorageSource` |
| 13 | +catches the error and logs it rather than letting it escape. Two properties |
| 14 | +follow: |
| 15 | + |
| 16 | +- **The process stays up.** The save runs fire-and-forget from a |
| 17 | + `"heads-changed"` listener, so an unhandled rejection would, in Node, exit the |
| 18 | + process by default. Catching it keeps the repo alive. |
| 19 | +- **The change is retained in memory.** Nothing is dropped. A later save or a |
| 20 | + reload re-persists the current document, because `lastSavedHeads` only |
| 21 | + advances on a successful write, so the next successful save re-includes |
| 22 | + whatever a failed one did not persist. |
| 23 | + |
| 24 | +This is deliberately the only thing `StorageSource` does about a failed write. |
| 25 | +It converts a fatal crash into a recoverable condition, which buys time for the |
| 26 | +recovery strategies below. It does not retry, back off, or escalate. That is not |
| 27 | +its job (see the contract below). |
| 28 | + |
| 29 | +## The storage adapter contract |
| 30 | + |
| 31 | +A `StorageAdapter` should be designed to be robust. Most failure handling |
| 32 | +belongs in the adapter rather than in `StorageSource`, because only the adapter |
| 33 | +knows its backend. |
| 34 | + |
| 35 | +- **Recoverability is the adapter's call.** Many storage failures are rare and |
| 36 | + transient: a momentary lock, a quota blip, a 503 from a remote store. The |
| 37 | + adapter is the layer that can distinguish a transient failure from a permanent |
| 38 | + one. |
| 39 | +- **Retry and backoff are a consideration, not a mandate.** Exponential backoff |
| 40 | + is one reasonable strategy for some backends, but it is not always the right |
| 41 | + answer and it is not the only one. Whether and how to retry depends on the |
| 42 | + backend's semantics, which is precisely why the policy lives in the adapter |
| 43 | + and not in a generic layer that cannot know those semantics. |
| 44 | +- **Escalation is the adapter's responsibility** when a failure is genuinely |
| 45 | + unrecoverable. How to escalate (surface to the host, fail a health check, and |
| 46 | + so on) is backend- and deployment-specific. |
| 47 | + |
| 48 | +## Reliability strategies to consider |
| 49 | + |
| 50 | +If durability is a concern, there is more than one lever, and adapter-level |
| 51 | +retry is rarely the most important one: |
| 52 | + |
| 53 | +- **Network redundancy via peers.** This is usually the strongest lever. The |
| 54 | + repo instances you connect to each have their own storage adapter, so a |
| 55 | + document synced to peers is already durable in more than one place. A local |
| 56 | + storage failure does not lose data that a connected peer holds; once storage |
| 57 | + recovers, normal sync re-persists it. Designing for connectivity to a |
| 58 | + well-provisioned peer (for example a sync server backed by reliable storage) |
| 59 | + buys more real durability than hardening any single adapter. |
| 60 | +- **Adapter-level retry and backoff.** Useful for transient backend failures, |
| 61 | + with the caveats above. Evaluate it per backend; do not assume it is |
| 62 | + sufficient on its own. |
| 63 | +- **Other strategies.** Depending on requirements, writing through to more than |
| 64 | + one backend, putting a durable queue in front of a flaky store, or periodic |
| 65 | + reconciliation may fit better than retry alone. Treat the options above as a |
| 66 | + starting point, not an exhaustive list. |
| 67 | + |
| 68 | +## Observability and alerting |
| 69 | + |
| 70 | +Persistent failures need to be visible; otherwise a server can look healthy |
| 71 | +while silently failing to persist. `automerge-repo` surfaces these through its |
| 72 | +logger: a failed save is reported via `logger.error(...)` under the relevant |
| 73 | +subsystem namespace (for example `automerge-repo:storage-source`). |
| 74 | + |
| 75 | +The logger is pluggable. By default `.debug` output is routed through the |
| 76 | +[`debug`](https://www.npmjs.com/package/debug) package (filter with |
| 77 | +`DEBUG=automerge-repo:*`), and `info` / `warn` / `error` go to `console`. The |
| 78 | +`Logger` interface is shaped to match `console`, [pino], and [winston], and |
| 79 | +[`setLoggerFactory`](../src/Logger.ts) routes all automerge-repo output through |
| 80 | +your own logger when called once at startup: |
| 81 | + |
| 82 | +```ts |
| 83 | +import { setLoggerFactory } from "@automerge/automerge-repo" |
| 84 | +import winston from "winston" |
| 85 | + |
| 86 | +const logger = winston.createLogger({ /* ... */ }) |
| 87 | + |
| 88 | +setLoggerFactory(namespace => ({ |
| 89 | + debug: (msg, ...args) => logger.debug(msg, { namespace, args }), |
| 90 | + info: (msg, ...args) => logger.info(msg, { namespace, args }), |
| 91 | + warn: (msg, ...args) => logger.warn(msg, { namespace, args }), |
| 92 | + error: (msg, ...args) => logger.error(msg, { namespace, args }), |
| 93 | +})) |
| 94 | +``` |
| 95 | + |
| 96 | +A reasonable production setup ships these logs to a backend that supports |
| 97 | +alerting (for example by exporting them through OpenTelemetry) and alerts on |
| 98 | +persistent storage errors. Configuring the logger and wiring an |
| 99 | +observability and alerting layer is the responsibility of the application |
| 100 | +embedding `automerge-repo`. The library's job is to emit the events at a |
| 101 | +sensible level and namespace; routing and alerting are deployment concerns. |
| 102 | + |
| 103 | +## Why there is no first-class storage error event |
| 104 | + |
| 105 | +We intentionally do not expose a typed `storage-error` event or signal. The |
| 106 | +logging path already exists and is configurable as above, escalation policy |
| 107 | +belongs in the adapter, and redundancy comes from the network. A separate |
| 108 | +in-process error signal would duplicate the logger and would invite |
| 109 | +backend-specific recovery policy into a layer that should stay |
| 110 | +backend-agnostic. A consumer that wants programmatic handling can supply a |
| 111 | +custom `LoggerFactory` that inspects the namespace and level. |
| 112 | + |
| 113 | +[pino]: https://github.com/pinojs/pino |
| 114 | +[winston]: https://github.com/winstonjs/winston |
0 commit comments