Skip to content

enhance: Refactor Log Sync Mechanism to Support 100k+ Tenants via Event-Driven Scheduling #103

@tinswzy

Description

@tinswzy

1. Background / Problem Description
Currently, woodpecker assigns a dedicated Goroutine to each log stream on the server side. These Goroutines run in a loop to monitor and sync buffered data.
The Scalability Bottleneck:

  • Resource Exhaustion: Supporting 100k+ logs on a single node causes the memory footprint of Goroutine stacks (min. 2KB-4KB each) to exceed 400MB - 800MB just for idle stacks, even before accounting for actual log buffers.
  • CPU Scheduler Pressure: The Go runtime scheduler must constantly scan 100k+ Goroutines. Even if they are in a sleep state, the wake-up and context-switching overhead for such a massive number of entities severely degrades CPU efficiency.
  • Inefficiency: In many multi-tenant scenarios (e.g., IoT or Microservices), only a small fraction of logs are "active" at any given millisecond. The current "One-Goroutine-Per-Log" model wastes resources on inactive tenants.

2. Proposed Solution: Event-Driven & Lazy Activation
We propose refactoring the sync logic from a Static Polling model to an Event-Driven Task Pool model. This decouples the "Log Entity" from the "Execution Thread".
Key Components:

  • Lazy Activation: A log stream will NOT have an associated Goroutine by default. It remains a passive data structure until the first byte of data is ingested.
  • Global Scheduler (Timing Wheel):
    • Instead of 100k timers, we use a Single Timing Wheel to manage all expiration events.
    • When a log becomes "Active" (first data arrives), it registers a one-time timeout task (e.g., 5s) in the timing wheel.
  • Shared Worker Pool: A fixed-size pool of Worker Goroutines (e.g., $N = \text{NumCPU} \times 2$) handles the actual Sync() I/O operations.
  • Task Queue: When a log is ready to sync (either via MaxDelay timeout or BufferFull event), its LogID is pushed to a central SyncQueue.
    The "Silent-to-Active" Workflow:
  1. Ingest: Data arrives $\rightarrow$ Update Buffer.
  2. Trigger: If isActive == false: set isActive = true and register with the Global Scheduler.
  3. Dispatch: Scheduler or Buffer-Threshold-Monitor pushes LogID to Worker Pool.
  4. Sync & Hibernate: Worker performs I/O $\rightarrow$ If buffer is empty, set isActive = false (the log goes back to sleep).

3. Implementation Plan

  • State Management: Implement an atomic state flag for each log to prevent duplicate queuing.
  • Timing Wheel Integration: Use a high-performance timing wheel (e.g., RussellLuo/timingwheel) for $O(1)$ timer management.
  • Worker Pool: Implement a bounded worker pool to prevent I/O bursts from overwhelming the underlying storage/file system.
  • Batching Optimization: Allow workers to drain multiple ready-to-sync logs in a single batch to improve I/O throughput.

4. Expected Results (Single-Node Success Metrics)

  • Goroutine Scalability: Reduce Goroutine count from $O(N_{logs})$ to $O(N_{workers})$. For 100k logs, the system should maintain < 1,000 Goroutines total.
  • Memory Efficiency: Memory usage should scale with Actual Data Volume rather than the Number of Tenants.
  • Idle Performance: A system with 100k inactive logs should consume near-zero CPU (only the cost of the timing wheel's tick).
  • Density: Enable a single woodpecker instance to comfortably handle 100k - 200k concurrent log streams on standard cloud VMs (e.g., 4C8G).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions