enhance: Refactor Log Sync Mechanism to Support 100k+ Tenants via Event-Driven Scheduling


**1. Background / Problem Description**
Currently, woodpecker assigns a dedicated Goroutine to each log stream on the server side. These Goroutines run in a loop to monitor and sync buffered data.
The Scalability Bottleneck:
- Resource Exhaustion: Supporting 100k+ logs on a single node causes the memory footprint of Goroutine stacks (min. 2KB-4KB each) to exceed 400MB - 800MB just for idle stacks, even before accounting for actual log buffers.
- CPU Scheduler Pressure: The Go runtime scheduler must constantly scan 100k+ Goroutines. Even if they are in a sleep state, the wake-up and context-switching overhead for such a massive number of entities severely degrades CPU efficiency.
- Inefficiency: In many multi-tenant scenarios (e.g., IoT or Microservices), only a small fraction of logs are "active" at any given millisecond. The current "One-Goroutine-Per-Log" model wastes resources on inactive tenants.

**2. Proposed Solution: Event-Driven & Lazy Activation**
We propose refactoring the sync logic from a Static Polling model to an Event-Driven Task Pool model. This decouples the "Log Entity" from the "Execution Thread".
Key Components:
- Lazy Activation: A log stream will NOT have an associated Goroutine by default. It remains a passive data structure until the first byte of data is ingested.
- Global Scheduler (Timing Wheel):
  - Instead of 100k timers, we use a Single Timing Wheel to manage all expiration events.
  - When a log becomes "Active" (first data arrives), it registers a one-time timeout task (e.g., 5s) in the timing wheel.
- Shared Worker Pool: A fixed-size pool of Worker Goroutines (e.g., $N = \text{NumCPU} \times 2$) handles the actual Sync() I/O operations.
- Task Queue: When a log is ready to sync (either via MaxDelay timeout or BufferFull event), its LogID is pushed to a central SyncQueue.
The "Silent-to-Active" Workflow:
1. Ingest: Data arrives $\rightarrow$ Update Buffer.
2. Trigger: If isActive == false: set isActive = true and register with the Global Scheduler.
3. Dispatch: Scheduler or Buffer-Threshold-Monitor pushes LogID to Worker Pool.
4. Sync & Hibernate: Worker performs I/O $\rightarrow$ If buffer is empty, set isActive = false (the log goes back to sleep).

**3. Implementation Plan**
- [ ] State Management: Implement an atomic state flag for each log to prevent duplicate queuing.
- [ ] Timing Wheel Integration: Use a high-performance timing wheel (e.g., RussellLuo/timingwheel) for $O(1)$ timer management.
- [ ] Worker Pool: Implement a bounded worker pool to prevent I/O bursts from overwhelming the underlying storage/file system.
- [ ] Batching Optimization: Allow workers to drain multiple ready-to-sync logs in a single batch to improve I/O throughput.

**4. Expected Results (Single-Node Success Metrics)**
- Goroutine Scalability: Reduce Goroutine count from $O(N_{logs})$ to $O(N_{workers})$. For 100k logs, the system should maintain < 1,000 Goroutines total.
- Memory Efficiency: Memory usage should scale with Actual Data Volume rather than the Number of Tenants.
- Idle Performance: A system with 100k inactive logs should consume near-zero CPU (only the cost of the timing wheel's tick).
- Density: Enable a single woodpecker instance to comfortably handle 100k - 200k concurrent log streams on standard cloud VMs (e.g., 4C8G).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance: Refactor Log Sync Mechanism to Support 100k+ Tenants via Event-Driven Scheduling #103

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

enhance: Refactor Log Sync Mechanism to Support 100k+ Tenants via Event-Driven Scheduling #103

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions