Skip to content

[iris] adopted task attempts permanently drop container logs (log_pusher=None) #5261

@ravwojdyla-agent

Description

@ravwojdyla-agent

Describe the bug

After restart-worker on a worker hosting a running task, the new worker
process adopts the still-running container but never wires up its log
forwarding. iris job logs and iris rpc controller get-task-logs return
zero entries for that task for the rest of its lifetime, while the task
itself runs normally and produces logs (visible inside the container and via
wandb's output.log).

The worker's own /system/worker/... logs flow into finelog because
RemoteLogHandler is wired up at registration time (after LogPusher
construction). Newly-submitted tasks work because Worker.submit_task wires
the pusher then. Only adopted attempts are affected.

To Reproduce

  1. Submit any task; wait for TASK_STATE_RUNNING.
  2. Restart the worker hosting it: iris rpc controller restart-worker --worker-id <WORKER_ID>.
  3. The new worker adopts the container.
  4. iris job logs <TASK_ID> — zero entries.
  5. finelog Parquet (gs://marin-us-central2/iris/<cluster>/state/logs/) has nothing for that task either.

Expected behavior

Adopted task attempts should stream container logs through LogPusher and
finelog like normally-submitted attempts do.

Additional context

Root cause in lib/iris/src/iris/cluster/worker/worker.py:

  • Worker.start() calls adopt_running_containers() at line 231 BEFORE constructing self._log_pusher at line 266.
  • The adopt path passes self._log_pusher (still None, initialised at line 201) into TaskAttempt.adopt(...) at lines 320–326.
  • The adopted TaskAttempt keeps that None permanently.
  • TaskAttempt._push_logs silently returns when self._log_pusher is falsy (lib/iris/src/iris/cluster/worker/task_attempt.py:908).
  • DockerLogReader (lib/iris/src/iris/cluster/runtime/docker.py:254-279) keeps reading docker logs --timestamps --since ... and every batch gets dropped on the floor.

Worker.submit_task wires the pusher correctly at lib/iris/src/iris/cluster/worker/worker.py:710, so tasks submitted after start() finishes work normally.

Suggested fix: build self._log_pusher before calling adopt_running_containers(), or pass a lazily-resolved holder/callable so adopted attempts pick up the pusher when it materialises.

Independent of in-process stdout redirection (e.g. wandb console="auto") — iris captures via the docker daemon, not the python process's stdout fd.

Metadata

Metadata

Assignees

Labels

agent-generatedCreated by automation/agentbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions