Skip to content

enhance: refactor StagedFileWriter sync to eliminate per-writer goroutine#124

Open
tinswzy wants to merge 7 commits intodevfrom
enhance_sync_policy
Open

enhance: refactor StagedFileWriter sync to eliminate per-writer goroutine#124
tinswzy wants to merge 7 commits intodevfrom
enhance_sync_policy

Conversation

@tinswzy
Copy link
Copy Markdown
Collaborator

@tinswzy tinswzy commented Mar 27, 2026

issue: #103

Replace per-writer run() goroutine + ticker with time.AfterFunc + shared
conc.Pool, reducing goroutine count from O(N_logs) to O(NumCPU) for
service mode (multi-tenant). Only StagedFileWriter is changed; local and
minio backends are unaffected.

Core changes:

  • StagedFileWriter: remove run(), flushTaskChan, awaitAllFlushTasks;
    merge Sync() into full roll-buffer + processFlushTask cycle;
    WriteDataAsync uses CAS-guarded AfterFunc for periodic sync
  • LogStore: create shared syncPool (conc.Pool), pass through
    SegmentProcessor to each StagedFileWriter
  • gRPC server: add keepalive params, MaxConcurrentStreams, connection
    limit (LimitedListener)
  • Client: add idle connection cleanup (5min), adaptive tail-read
    backoff (200ms→5s), auditor skip for compacted segments

Benchmark results (service mode, 21K logs):

  • Goroutines: 20,005 → 17 (99.9% reduction)
  • Idle alloc: 734 MB/sec → 0 MB/sec (100% elimination)
  • Stack memory: 123 MB → 3.7 MB
  • 40K logs: TIMEOUT → PASS

…tine (#103)

Replace per-writer run() goroutine + ticker with time.AfterFunc + shared
conc.Pool, reducing goroutine count from O(N_logs) to O(NumCPU) for
service mode (multi-tenant). Only StagedFileWriter is changed; local and
minio backends are unaffected.

Core changes:
- StagedFileWriter: remove run(), flushTaskChan, awaitAllFlushTasks;
  merge Sync() into full roll-buffer + processFlushTask cycle;
  WriteDataAsync uses CAS-guarded AfterFunc for periodic sync
- LogStore: create shared syncPool (conc.Pool), pass through
  SegmentProcessor to each StagedFileWriter
- gRPC server: add keepalive params, MaxConcurrentStreams, connection
  limit (LimitedListener)
- Client: add idle connection cleanup (5min), adaptive tail-read
  backoff (200ms→5s), auditor skip for compacted segments

Benchmark results (service mode, 21K logs):
- Goroutines: 20,005 → 17 (99.9% reduction)
- Idle alloc: 734 MB/sec → 0 MB/sec (100% elimination)
- Stack memory: 123 MB → 3.7 MB
- 40K logs: TIMEOUT → PASS

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
@tinswzy tinswzy added this to the v0.1.26 milestone Mar 27, 2026
@tinswzy tinswzy added the enhancement New feature or request label Mar 27, 2026
@github-actions
Copy link
Copy Markdown

Unit Test failed. Comment /rerun-unit-test to rerun, or /rerun to rerun all failed checks.

@github-actions
Copy link
Copy Markdown

Lint failed. Comment /rerun-lint to rerun, or /rerun to rerun all failed checks.

@github-actions
Copy link
Copy Markdown

E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

  The exponential backoff for tail reading (Phase 3 optimization) did not
  reset currentPollInterval when a reader transitioned between segments on
  ErrFileReaderEndOfFile. Accumulated backoff from waiting for entries in
  the previous segment carried over, causing the reader to poll at 2-5s
  intervals when searching for the next segment. With a 10s per-read
  timeout, this left only 2-3 retry attempts, leading to context deadline
  exceeded in TestConcurrentWriteAndReadWithSegmentRollingFrequently.

  Also fix stale test assertions in TestNewConfiguration for Phase 5
  cleanup config defaults (CleanupInterval 60->30s, MaxIdleTime 300->60s)

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
@github-actions
Copy link
Copy Markdown

E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 54.86726% with 102 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.32%. Comparing base (161656c) to head (cd77f78).

Files with missing lines Patch % Lines
server/limited_listener.go 0.00% 29 Missing ⚠️
server/storage/stagedstorage/writer_impl.go 64.10% 25 Missing and 3 partials ⚠️
woodpecker/client/logstore_client_pool_remote.go 48.57% 18 Missing ⚠️
woodpecker/log/internal_log_writer.go 33.33% 5 Missing and 1 partial ⚠️
woodpecker/log/log_writer.go 33.33% 5 Missing and 1 partial ⚠️
server/service.go 73.68% 3 Missing and 2 partials ⚠️
server/logstore.go 75.00% 2 Missing and 1 partial ⚠️
woodpecker/log/log_reader.go 66.66% 2 Missing and 1 partial ⚠️
common/config/configuration.go 60.00% 2 Missing ⚠️
server/storage/stagedstorage/reader_impl.go 75.00% 2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (54.86%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #124      +/-   ##
==========================================
- Coverage   88.62%   88.32%   -0.30%     
==========================================
  Files          90       91       +1     
  Lines       18102    18139      +37     
==========================================
- Hits        16043    16022      -21     
- Misses       1557     1612      +55     
- Partials      502      505       +3     
Components Coverage Δ
Server 86.56% <55.33%> (-0.36%) ⬇️
Client 92.87% <52.17%> (-0.58%) ⬇️
Meta 90.29% <ø> (ø)
Common 90.86% <71.42%> (-0.09%) ⬇️
Files with missing lines Coverage Δ
common/metrics/service_metrics.go 94.44% <100.00%> (+0.32%) ⬆️
server/processor/segment_processor.go 95.04% <100.00%> (+0.02%) ⬆️
woodpecker/segment/segment_handle.go 94.32% <100.00%> (+0.03%) ⬆️
woodpecker/woodpecker_client.go 100.00% <ø> (ø)
common/config/configuration.go 95.94% <60.00%> (-0.56%) ⬇️
server/storage/stagedstorage/reader_impl.go 81.54% <75.00%> (-0.08%) ⬇️
server/logstore.go 87.52% <75.00%> (-0.36%) ⬇️
woodpecker/log/log_reader.go 89.19% <66.66%> (-0.41%) ⬇️
server/service.go 82.77% <73.68%> (-0.37%) ⬇️
woodpecker/log/internal_log_writer.go 88.91% <33.33%> (-1.39%) ⬇️
... and 4 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…al backoff). This is semantically correct — we're waiting for a one-time segment creation event, not polling for incremental data.

  Exponential backoff only applies to ErrEntryNotFound (waiting for new entries in an existing active segment), which is the idle-tail-read scenario.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
@github-actions
Copy link
Copy Markdown

E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

… concurrent read/write tests consistent.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
@github-actions
Copy link
Copy Markdown

E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

… to 30s (matching the existing readTimeoutCtx pattern at line 773 and the TestFinalVerification reader at line 2950)

On CI, the writer takes 3-20+ seconds to write 20 messages (each with 10ms sleep + gRPC overhead + sync latency). The reader
exhausts its per-read timeout waiting for the writer to produce the last entry.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
@github-actions
Copy link
Copy Markdown

E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

The AfterFunc + shared pool refactor introduced a data race between
Finalize() (called from CompleteSegment gRPC handler) and
processFlushTask() (called from pool worker via AfterFunc-triggered
Sync). Both access blockIndexes, writtenBytes, and file — but Finalize
held mu while processFlushTask held flushMu, providing no mutual
exclusion.

Fix: Finalize() now acquires both mu and flushMu after Sync() completes.
Lock order (mu → flushMu) matches Sync()'s implicit order to avoid
deadlock

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
@github-actions
Copy link
Copy Markdown

E2E Service failed. Comment /rerun-e2e-service to rerun, or /rerun to rerun all failed checks.

@github-actions
Copy link
Copy Markdown

Chaos Test failed. Comment /rerun-chaos-test to rerun, or /rerun to rerun all failed checks.

…to prevent reader EOF misdetection (#103)

When a segment is finalized (footer written) while the server-side reader's
lastAddConfirmed is stale due to LAC coalescing (50ms window), the reader
incorrectly returns ErrFileReaderEndOfFile instead of retrying. This causes
the client reader to skip to the next segment, permanently losing the last
few entries of the completed segment.

Root cause: isFooterExistsUnsafe() detected the footer but did not extract
its LAC, so readDataBlocksUnsafe() treated the stale-LAC-filtered empty
result as EOF rather than a transient condition.

Fix:
- isFooterExistsUnsafe: update lastAddConfirmed from footer LAC when a
  footer is discovered, so subsequent reads see the correct boundary
- readDataBlocksUnsafe: return ErrEntryNotFound (retry) instead of EOF
  when the newly discovered footer LAC covers the requested entry
- Reduce DefaultNoDataReadMaxIntervalMs from 5s to 2s for faster tail
  read retry under high concurrency
- Increase concurrent read/write test per-read timeout from 30s to 60s

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
@tinswzy
Copy link
Copy Markdown
Collaborator Author

tinswzy commented Mar 30, 2026

/rerun-all

1 similar comment
@tinswzy
Copy link
Copy Markdown
Collaborator Author

tinswzy commented Apr 1, 2026

/rerun-all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-passed enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant