Skip to content

[Refactor] [history server] Enable stateless flushing by decoupling identity state from EventCollector #4688

@JiangJiaWei1103

Description

@JiangJiaWei1103

Goal

Make EventCollector identity-stateless, which means it no longer needs to track identity state internally (specifically the session name and node ID). Instead, it reads these values directly from each incoming event.

Both session_name and node_id are base fields present in every Ray event.

Motivation

  1. Single source of truth: session_name and node_id come directly from the raw event, which is always the most accurate source.
  2. Data-driven routing: Storage path is determined by the event itself, not by whatever the collector happens to have cached.
  3. Simpler, predictable flushing: Flush logic is based on time, buffer size, or shutdown; not on identity transitions that may be hard to anticipate or test.

Problem

Today, EventCollector stores currentSessionName and currentNodeID as internal state. Every time an event arrives, it checks whether the event's identity matches the stored state. If it does not, the collector flushes the old buffer and updates its internal state before writing.

This creates an unwanted coupling between two separate concerns:

  • WHERE to write: The storage path, derived from the event's identity
  • WHEN to write: The flush trigger, currently fired by identity changes (e.g., session or node ID change)

In other words, a storage routing decision is driving flush timing, and they should not be tied together.

# Current Design (identity-stateful)

Event ──▶ EventCollector ──▶ check currentSessionName / currentNodeID
              │                       │
              │  state mismatch?  ────▶ flush old buffer + update state
              │                       │
              └───────────────────────▶ write to path derived from internal state

Proposed Solution

  • Read session_name and node_id directly from each event to determine where to write (no internal state needed)
  • Flushing timing: Fire on a timer (periodic flushing), when the buffer is full (not supported yet), or on shutdown
# Proposed Design (identity-stateless)

Event ──▶ EventCollector ──▶ read event.session_name + event.node_id
              │                       │
              │                       ▼
              └───────────────────────▶ write to path derived from event fields

Flush triggers: periodic | buffer size | shutdown (decoupled from identity)

Compared to Spark

The Spark counterpart is EventLoggingListener and one EventLoggingListener instance is bound to an application attempt, which has a fixed log file path. That means the identity has been fixated since constructed, so there is no need to detect runtime identity transitions. If identity (application attempt) changes, a new instance is created, following the classic immutable identity pattern.

In KubeRay, identities can change within the lifecycle of one collector sidecar (i.e., one EventCollector processes events from multiple sessions and nodes), so we need to handle identity states differently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions