enhance: refactor StagedFileWriter sync to eliminate per-writer goroutine#124
enhance: refactor StagedFileWriter sync to eliminate per-writer goroutine#124
Conversation
…tine (#103) Replace per-writer run() goroutine + ticker with time.AfterFunc + shared conc.Pool, reducing goroutine count from O(N_logs) to O(NumCPU) for service mode (multi-tenant). Only StagedFileWriter is changed; local and minio backends are unaffected. Core changes: - StagedFileWriter: remove run(), flushTaskChan, awaitAllFlushTasks; merge Sync() into full roll-buffer + processFlushTask cycle; WriteDataAsync uses CAS-guarded AfterFunc for periodic sync - LogStore: create shared syncPool (conc.Pool), pass through SegmentProcessor to each StagedFileWriter - gRPC server: add keepalive params, MaxConcurrentStreams, connection limit (LimitedListener) - Client: add idle connection cleanup (5min), adaptive tail-read backoff (200ms→5s), auditor skip for compacted segments Benchmark results (service mode, 21K logs): - Goroutines: 20,005 → 17 (99.9% reduction) - Idle alloc: 734 MB/sec → 0 MB/sec (100% elimination) - Stack memory: 123 MB → 3.7 MB - 40K logs: TIMEOUT → PASS Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
|
❌ Unit Test failed. Comment |
|
❌ Lint failed. Comment |
|
❌ E2E Service failed. Comment |
The exponential backoff for tail reading (Phase 3 optimization) did not reset currentPollInterval when a reader transitioned between segments on ErrFileReaderEndOfFile. Accumulated backoff from waiting for entries in the previous segment carried over, causing the reader to poll at 2-5s intervals when searching for the next segment. With a 10s per-read timeout, this left only 2-3 retry attempts, leading to context deadline exceeded in TestConcurrentWriteAndReadWithSegmentRollingFrequently. Also fix stale test assertions in TestNewConfiguration for Phase 5 cleanup config defaults (CleanupInterval 60->30s, MaxIdleTime 300->60s) Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
|
❌ E2E Service failed. Comment |
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (54.86%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## dev #124 +/- ##
==========================================
- Coverage 88.62% 88.32% -0.30%
==========================================
Files 90 91 +1
Lines 18102 18139 +37
==========================================
- Hits 16043 16022 -21
- Misses 1557 1612 +55
- Partials 502 505 +3
🚀 New features to boost your workflow:
|
…al backoff). This is semantically correct — we're waiting for a one-time segment creation event, not polling for incremental data. Exponential backoff only applies to ErrEntryNotFound (waiting for new entries in an existing active segment), which is the idle-tail-read scenario. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
|
❌ E2E Service failed. Comment |
… concurrent read/write tests consistent. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
|
❌ E2E Service failed. Comment |
… to 30s (matching the existing readTimeoutCtx pattern at line 773 and the TestFinalVerification reader at line 2950) On CI, the writer takes 3-20+ seconds to write 20 messages (each with 10ms sleep + gRPC overhead + sync latency). The reader exhausts its per-read timeout waiting for the writer to produce the last entry. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
|
❌ E2E Service failed. Comment |
The AfterFunc + shared pool refactor introduced a data race between Finalize() (called from CompleteSegment gRPC handler) and processFlushTask() (called from pool worker via AfterFunc-triggered Sync). Both access blockIndexes, writtenBytes, and file — but Finalize held mu while processFlushTask held flushMu, providing no mutual exclusion. Fix: Finalize() now acquires both mu and flushMu after Sync() completes. Lock order (mu → flushMu) matches Sync()'s implicit order to avoid deadlock Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
|
❌ E2E Service failed. Comment |
|
❌ Chaos Test failed. Comment |
…to prevent reader EOF misdetection (#103) When a segment is finalized (footer written) while the server-side reader's lastAddConfirmed is stale due to LAC coalescing (50ms window), the reader incorrectly returns ErrFileReaderEndOfFile instead of retrying. This causes the client reader to skip to the next segment, permanently losing the last few entries of the completed segment. Root cause: isFooterExistsUnsafe() detected the footer but did not extract its LAC, so readDataBlocksUnsafe() treated the stale-LAC-filtered empty result as EOF rather than a transient condition. Fix: - isFooterExistsUnsafe: update lastAddConfirmed from footer LAC when a footer is discovered, so subsequent reads see the correct boundary - readDataBlocksUnsafe: return ErrEntryNotFound (retry) instead of EOF when the newly discovered footer LAC covers the requested entry - Reduce DefaultNoDataReadMaxIntervalMs from 5s to 2s for faster tail read retry under high concurrency - Increase concurrent read/write test per-read timeout from 30s to 60s Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
|
/rerun-all |
1 similar comment
|
/rerun-all |
issue: #103
Replace per-writer run() goroutine + ticker with time.AfterFunc + shared
conc.Pool, reducing goroutine count from O(N_logs) to O(NumCPU) for
service mode (multi-tenant). Only StagedFileWriter is changed; local and
minio backends are unaffected.
Core changes:
merge Sync() into full roll-buffer + processFlushTask cycle;
WriteDataAsync uses CAS-guarded AfterFunc for periodic sync
SegmentProcessor to each StagedFileWriter
limit (LimitedListener)
backoff (200ms→5s), auditor skip for compacted segments
Benchmark results (service mode, 21K logs):