enhance: refactor StagedFileWriter sync to eliminate per-writer goroutine by tinswzy · Pull Request #124 · zilliztech/woodpecker

tinswzy · 2026-03-27T06:18:27Z

issue: #103

Replace per-writer run() goroutine + ticker with time.AfterFunc + shared
conc.Pool, reducing goroutine count from O(N_logs) to O(NumCPU) for
service mode (multi-tenant). Only StagedFileWriter is changed; local and
minio backends are unaffected.

Core changes:

StagedFileWriter: remove run(), flushTaskChan, awaitAllFlushTasks;
merge Sync() into full roll-buffer + processFlushTask cycle;
WriteDataAsync uses CAS-guarded AfterFunc for periodic sync
LogStore: create shared syncPool (conc.Pool), pass through
SegmentProcessor to each StagedFileWriter
gRPC server: add keepalive params, MaxConcurrentStreams, connection
limit (LimitedListener)
Client: add idle connection cleanup (5min), adaptive tail-read
backoff (200ms→5s), auditor skip for compacted segments

Benchmark results (service mode, 21K logs):

Goroutines: 20,005 → 17 (99.9% reduction)
Idle alloc: 734 MB/sec → 0 MB/sec (100% elimination)
Stack memory: 123 MB → 3.7 MB
40K logs: TIMEOUT → PASS

…tine (#103) Replace per-writer run() goroutine + ticker with time.AfterFunc + shared conc.Pool, reducing goroutine count from O(N_logs) to O(NumCPU) for service mode (multi-tenant). Only StagedFileWriter is changed; local and minio backends are unaffected. Core changes: - StagedFileWriter: remove run(), flushTaskChan, awaitAllFlushTasks; merge Sync() into full roll-buffer + processFlushTask cycle; WriteDataAsync uses CAS-guarded AfterFunc for periodic sync - LogStore: create shared syncPool (conc.Pool), pass through SegmentProcessor to each StagedFileWriter - gRPC server: add keepalive params, MaxConcurrentStreams, connection limit (LimitedListener) - Client: add idle connection cleanup (5min), adaptive tail-read backoff (200ms→5s), auditor skip for compacted segments Benchmark results (service mode, 21K logs): - Goroutines: 20,005 → 17 (99.9% reduction) - Idle alloc: 734 MB/sec → 0 MB/sec (100% elimination) - Stack memory: 123 MB → 3.7 MB - 40K logs: TIMEOUT → PASS Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>

github-actions · 2026-03-27T06:19:20Z

❌ Unit Test failed. Comment /rerun-unit-test to rerun, or /rerun to rerun all failed checks.

github-actions · 2026-03-27T06:20:15Z

❌ Lint failed. Comment /rerun-lint to rerun, or /rerun to rerun all failed checks.

github-actions · 2026-03-27T06:22:46Z

❌ E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

The exponential backoff for tail reading (Phase 3 optimization) did not reset currentPollInterval when a reader transitioned between segments on ErrFileReaderEndOfFile. Accumulated backoff from waiting for entries in the previous segment carried over, causing the reader to poll at 2-5s intervals when searching for the next segment. With a 10s per-read timeout, this left only 2-3 retry attempts, leading to context deadline exceeded in TestConcurrentWriteAndReadWithSegmentRollingFrequently. Also fix stale test assertions in TestNewConfiguration for Phase 5 cleanup config defaults (CleanupInterval 60->30s, MaxIdleTime 300->60s) Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>

github-actions · 2026-03-27T06:55:29Z

❌ E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

codecov · 2026-03-27T06:58:32Z

Codecov Report

❌ Patch coverage is 54.86726% with 102 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.32%. Comparing base (161656c) to head (cd77f78).

Files with missing lines	Patch %	Lines
server/limited_listener.go	0.00%	29 Missing ⚠️
server/storage/stagedstorage/writer_impl.go	64.10%	25 Missing and 3 partials ⚠️
woodpecker/client/logstore_client_pool_remote.go	48.57%	18 Missing ⚠️
woodpecker/log/internal_log_writer.go	33.33%	5 Missing and 1 partial ⚠️
woodpecker/log/log_writer.go	33.33%	5 Missing and 1 partial ⚠️
server/service.go	73.68%	3 Missing and 2 partials ⚠️
server/logstore.go	75.00%	2 Missing and 1 partial ⚠️
woodpecker/log/log_reader.go	66.66%	2 Missing and 1 partial ⚠️
common/config/configuration.go	60.00%	2 Missing ⚠️
server/storage/stagedstorage/reader_impl.go	75.00%	2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (54.86%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #124      +/-   ##
==========================================
- Coverage   88.62%   88.32%   -0.30%     
==========================================
  Files          90       91       +1     
  Lines       18102    18139      +37     
==========================================
- Hits        16043    16022      -21     
- Misses       1557     1612      +55     
- Partials      502      505       +3

Components	Coverage Δ
Server	`86.56% <55.33%> (-0.36%)`	⬇️
Client	`92.87% <52.17%> (-0.58%)`	⬇️
Meta	`90.29% <ø> (ø)`
Common	`90.86% <71.42%> (-0.09%)`	⬇️

Files with missing lines	Coverage Δ
common/metrics/service_metrics.go	`94.44% <100.00%> (+0.32%)`	⬆️
server/processor/segment_processor.go	`95.04% <100.00%> (+0.02%)`	⬆️
woodpecker/segment/segment_handle.go	`94.32% <100.00%> (+0.03%)`	⬆️
woodpecker/woodpecker_client.go	`100.00% <ø> (ø)`
common/config/configuration.go	`95.94% <60.00%> (-0.56%)`	⬇️
server/storage/stagedstorage/reader_impl.go	`81.54% <75.00%> (-0.08%)`	⬇️
server/logstore.go	`87.52% <75.00%> (-0.36%)`	⬇️
woodpecker/log/log_reader.go	`89.19% <66.66%> (-0.41%)`	⬇️
server/service.go	`82.77% <73.68%> (-0.37%)`	⬇️
woodpecker/log/internal_log_writer.go	`88.91% <33.33%> (-1.39%)`	⬇️
... and 4 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…al backoff). This is semantically correct — we're waiting for a one-time segment creation event, not polling for incremental data. Exponential backoff only applies to ErrEntryNotFound (waiting for new entries in an existing active segment), which is the idle-tail-read scenario. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>

github-actions · 2026-03-27T07:24:03Z

❌ E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

… concurrent read/write tests consistent. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>

github-actions · 2026-03-27T07:57:24Z

❌ E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

… to 30s (matching the existing readTimeoutCtx pattern at line 773 and the TestFinalVerification reader at line 2950) On CI, the writer takes 3-20+ seconds to write 20 messages (each with 10ms sleep + gRPC overhead + sync latency). The reader exhausts its per-read timeout waiting for the writer to produce the last entry. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>

github-actions · 2026-03-27T08:42:41Z

❌ E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

The AfterFunc + shared pool refactor introduced a data race between Finalize() (called from CompleteSegment gRPC handler) and processFlushTask() (called from pool worker via AfterFunc-triggered Sync). Both access blockIndexes, writtenBytes, and file — but Finalize held mu while processFlushTask held flushMu, providing no mutual exclusion. Fix: Finalize() now acquires both mu and flushMu after Sync() completes. Lock order (mu → flushMu) matches Sync()'s implicit order to avoid deadlock Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>

github-actions · 2026-03-27T09:19:08Z

❌ E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

github-actions · 2026-03-27T09:28:22Z

❌ Chaos Test failed. Comment /rerun-chaos-test to rerun, or /rerun to rerun all failed checks.

…to prevent reader EOF misdetection (#103) When a segment is finalized (footer written) while the server-side reader's lastAddConfirmed is stale due to LAC coalescing (50ms window), the reader incorrectly returns ErrFileReaderEndOfFile instead of retrying. This causes the client reader to skip to the next segment, permanently losing the last few entries of the completed segment. Root cause: isFooterExistsUnsafe() detected the footer but did not extract its LAC, so readDataBlocksUnsafe() treated the stale-LAC-filtered empty result as EOF rather than a transient condition. Fix: - isFooterExistsUnsafe: update lastAddConfirmed from footer LAC when a footer is discovered, so subsequent reads see the correct boundary - readDataBlocksUnsafe: return ErrEntryNotFound (retry) instead of EOF when the newly discovered footer LAC covers the requested entry - Reduce DefaultNoDataReadMaxIntervalMs from 5s to 2s for faster tail read retry under high concurrency - Increase concurrent read/write test per-read timeout from 30s to 60s Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>

tinswzy · 2026-03-30T12:37:45Z

/rerun-all

tinswzy · 2026-04-01T03:36:55Z

/rerun-all

tinswzy added this to the v0.1.26 milestone Mar 27, 2026

tinswzy added the enhancement New feature or request label Mar 27, 2026

Fix: Increased both 5-second timeouts to 10 seconds, making all three…

49fec06

… concurrent read/write tests consistent. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>

github-actions bot added the ci-passed label Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance: refactor StagedFileWriter sync to eliminate per-writer goroutine#124

enhance: refactor StagedFileWriter sync to eliminate per-writer goroutine#124
tinswzy wants to merge 7 commits intodevfrom
enhance_sync_policy

tinswzy commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

codecov bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

tinswzy commented Mar 30, 2026

Uh oh!

tinswzy commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tinswzy commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

codecov bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

tinswzy commented Mar 30, 2026

Uh oh!

tinswzy commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Mar 27, 2026 •

edited

Loading