Background
In distributed systems, producers (clients) may retry writes due to network failures, timeouts, or leader changes. Without idempotent guarantees, these retries can result in duplicate log entries, breaking correctness for downstream consumers and higher-level systems.
Mainstream log systems such as Kafka and Pulsar provide idempotent write semantics to ensure that retries do not introduce duplicates.
Currently, Woodpecker does not provide built-in idempotent write support, which means duplicate data may be persisted when clients retry append operations.
Problem Statement
Under the current design:
- Clients may retry append requests when facing transient failures.
- The storage layer cannot distinguish between:
- a new write, and
- a replayed write caused by retry.
- As a result, the same logical record may be appended multiple times to a segment.
This behavior makes it difficult to:
- Build exactly-once or effectively-once semantics on top of Woodpecker.
- Use Woodpecker as a reliable WAL for message queues, stream processing engines, or stateful systems.
Proposed Feature
Introduce idempotent write support in Woodpecker, similar to Kafka / Pulsar.
At a high level, the system should guarantee:
Multiple retries of the same logical write will result in at most one successful append within a bounded idempotency window.
Design Considerations (High-Level)
Some possible directions (non-binding):
Idempotency Identifier
Instead of relying on a dense, monotonically increasing sequence number, idempotent writes are identified by a flexible idempotency identifier (idempotencyId).
The identifier can be:
- Explicitly provided by the client, representing business-level idempotency semantics.
- Implicitly derived by the system, defaulting to a hash (e.g. MD5) of the message payload when not provided.
This design allows clients to choose the most appropriate idempotency strategy for their workload, while still providing a safe default.
Default Behavior
- If the client does not provide an idempotency identifier:
- The storage layer computes a hash of the message data (e.g. MD5) and uses it as the idempotencyId.
- In this mode, idempotency assumes that retries submit identical message content, which is expected in most cases.
Deduplication Semantics
- The storage layer maintains a bounded idempotency state per (logId, idempotencyId) within a configured window.
- On append:
- If the idempotencyId has already been successfully committed within the window:
- Treat the request as a duplicate and acknowledge success without re-appending data.
- If the idempotencyId is not present:
- Accept and persist the data, then record the idempotencyId.
- Idempotency is guaranteed only within the configured window, after which older identifiers may be evicted.
Crash & Recovery
Idempotency state must be:
- Persisted or derivable during recovery.
- Correctly reconstructed during fence & finalize and other recovery paths.
- The design should integrate naturally with existing segment / blk metadata and recovery logic.
Compatibility
- Idempotent writes are optional and opt-in.
- Non-idempotent clients continue to work exactly as before, without additional metadata or performance overhead.
- Existing write paths remain unchanged when idempotency is disabled.
Design Rationale
- Avoids forcing clients into dense, strictly ordered sequence numbers.
- Enables business-level idempotency semantics when needed.
- Provides a safe and intuitive default behavior.
- Aligns well with append-only storage and WAL-style systems.
Background
In distributed systems, producers (clients) may retry writes due to network failures, timeouts, or leader changes. Without idempotent guarantees, these retries can result in duplicate log entries, breaking correctness for downstream consumers and higher-level systems.
Mainstream log systems such as Kafka and Pulsar provide idempotent write semantics to ensure that retries do not introduce duplicates.
Currently, Woodpecker does not provide built-in idempotent write support, which means duplicate data may be persisted when clients retry append operations.
Problem Statement
Under the current design:
This behavior makes it difficult to:
Proposed Feature
Introduce idempotent write support in Woodpecker, similar to Kafka / Pulsar.
At a high level, the system should guarantee:
Multiple retries of the same logical write will result in at most one successful append within a bounded idempotency window.Design Considerations (High-Level)
Some possible directions (non-binding):
Idempotency Identifier
Instead of relying on a dense, monotonically increasing sequence number, idempotent writes are identified by a flexible idempotency identifier (idempotencyId).
The identifier can be:
This design allows clients to choose the most appropriate idempotency strategy for their workload, while still providing a safe default.
Default Behavior
Deduplication Semantics
Crash & Recovery
Idempotency state must be:
Compatibility
Design Rationale