Description
Currently, Woodpecker (in Service mode) relies on a simple "Ping-Pong" network check for liveness probes. However, this only confirms the process is running and the network is reachable; it does not guarantee that the entire data path (Log read/write) is functional.
We need a more robust Health Check API that validates whether logs can actually be written to and read from the underlying storage/buffer, covering all deployment modes.
Proposed Logic
The new health check should be based on real-time log activity rather than injecting dummy "heartbeat" topics.
- Log-Level Health
- Active Logs: If a log has incoming write attempts:
- Success within the last 10 minutes = Healthy.
- Stalled/Blocked for > 10 minutes = Unhealthy.
- Idle Logs: If no data has been written to a specific log for a while, it should persist its last known health state (Stay Healthy if it was Healthy).
- Global Service Health
- Partial Success: If at least one log is successfully reading/writing, the global state is Healthy.
- Global Stall: If multiple logs have pending writes but all have been stalled for > 10 minutes, the global state is Unhealthy.
- Idle System (Cold Start/No Traffic): If there is no log activity across the entire system:
- Fallback to a storage backend check (e.g., HeadBucket or equivalent metadata check for Object Storage).
- If the storage backend is reachable/writable, the global state is Healthy.
Description
Currently, Woodpecker (in Service mode) relies on a simple "Ping-Pong" network check for liveness probes. However, this only confirms the process is running and the network is reachable; it does not guarantee that the entire data path (Log read/write) is functional.
We need a more robust Health Check API that validates whether logs can actually be written to and read from the underlying storage/buffer, covering all deployment modes.
Proposed Logic
The new health check should be based on real-time log activity rather than injecting dummy "heartbeat" topics.