[Refactor] [history server] Enable stateless flushing by decoupling identity state from EventCollector

## Goal

Make EventCollector **identity-stateless**, which means it no longer needs to track identity state internally (specifically the [session name](https://github.com/ray-project/ray/blob/dc6e3dc8deea0e3eadc393f75dcc9775f66f9cc5/src/ray/protobuf/public/events_base_event.proto#L102) and [node ID](https://github.com/ray-project/ray/blob/dc6e3dc8deea0e3eadc393f75dcc9775f66f9cc5/src/ray/protobuf/public/events_base_event.proto#L118)). Instead, it reads these values directly from each incoming event.

> Both `session_name` and `node_id` are base fields present in every Ray event.

## Motivation

1. **Single source of truth:** `session_name` and `node_id` come directly from the raw event, which is always the most accurate source.
2. **Data-driven routing:** Storage path is determined by the event itself, not by whatever the collector happens to have cached.
3. **Simpler, predictable flushing:** Flush logic is based on time, buffer size, or shutdown; not on identity transitions that may be hard to anticipate or test.

## Problem

Today, `EventCollector` stores `currentSessionName` and `currentNodeID` as internal state. Every time an event arrives, it checks whether the event's identity matches the stored state. If it does not, the collector flushes the old buffer and updates its internal state before writing.

This creates an unwanted coupling between two separate concerns:

- **WHERE to write:** The storage path, derived from the event's identity
- **WHEN to write:** The flush trigger, currently fired by identity changes (e.g., session or node ID change)

In other words, a storage routing decision is driving flush timing, and they should not be tied together.

```text
# Current Design (identity-stateful)

Event ──▶ EventCollector ──▶ check currentSessionName / currentNodeID
              │                       │
              │  state mismatch?  ────▶ flush old buffer + update state
              │                       │
              └───────────────────────▶ write to path derived from internal state
```

## Proposed Solution

- Read `session_name` and `node_id` directly from each event to determine where to write (no internal state needed)
- Flushing timing: Fire on a timer (periodic flushing), when the buffer is full (not supported yet), or on shutdown

```text
# Proposed Design (identity-stateless)

Event ──▶ EventCollector ──▶ read event.session_name + event.node_id
              │                       │
              │                       ▼
              └───────────────────────▶ write to path derived from event fields

Flush triggers: periodic | buffer size | shutdown (decoupled from identity)
```

## Compared to Spark

The Spark counterpart is `EventLoggingListener` and one `EventLoggingListener` instance is bound to an application attempt, which has a **fixed** log file path. That means the identity has been fixated since constructed, so there is no need to detect runtime identity transitions. If identity (application attempt) changes, a new instance is created, following the classic **immutable identity pattern**.

In KubeRay, identities can change within the lifecycle of one collector sidecar (i.e., one `EventCollector` processes events from multiple sessions and nodes), so we need to handle identity states differently.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] [history server] Enable stateless flushing by decoupling identity state from EventCollector #4688

Goal

Motivation

Problem

Proposed Solution

Compared to Spark

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Refactor] [history server] Enable stateless flushing by decoupling identity state from EventCollector #4688

Description

Goal

Motivation

Problem

Proposed Solution

Compared to Spark

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions