Describe the bug
After restart-worker on a worker hosting a running task, the new worker
process adopts the still-running container but never wires up its log
forwarding. iris job logs and iris rpc controller get-task-logs return
zero entries for that task for the rest of its lifetime, while the task
itself runs normally and produces logs (visible inside the container and via
wandb's output.log).
The worker's own /system/worker/... logs flow into finelog because
RemoteLogHandler is wired up at registration time (after LogPusher
construction). Newly-submitted tasks work because Worker.submit_task wires
the pusher then. Only adopted attempts are affected.
To Reproduce
- Submit any task; wait for
TASK_STATE_RUNNING.
- Restart the worker hosting it:
iris rpc controller restart-worker --worker-id <WORKER_ID>.
- The new worker adopts the container.
iris job logs <TASK_ID> — zero entries.
- finelog Parquet (
gs://marin-us-central2/iris/<cluster>/state/logs/) has nothing for that task either.
Expected behavior
Adopted task attempts should stream container logs through LogPusher and
finelog like normally-submitted attempts do.
Additional context
Root cause in lib/iris/src/iris/cluster/worker/worker.py:
Worker.start() calls adopt_running_containers() at line 231 BEFORE constructing self._log_pusher at line 266.
- The adopt path passes
self._log_pusher (still None, initialised at line 201) into TaskAttempt.adopt(...) at lines 320–326.
- The adopted
TaskAttempt keeps that None permanently.
TaskAttempt._push_logs silently returns when self._log_pusher is falsy (lib/iris/src/iris/cluster/worker/task_attempt.py:908).
DockerLogReader (lib/iris/src/iris/cluster/runtime/docker.py:254-279) keeps reading docker logs --timestamps --since ... and every batch gets dropped on the floor.
Worker.submit_task wires the pusher correctly at lib/iris/src/iris/cluster/worker/worker.py:710, so tasks submitted after start() finishes work normally.
Suggested fix: build self._log_pusher before calling adopt_running_containers(), or pass a lazily-resolved holder/callable so adopted attempts pick up the pusher when it materialises.
Independent of in-process stdout redirection (e.g. wandb console="auto") — iris captures via the docker daemon, not the python process's stdout fd.
Describe the bug
After
restart-workeron a worker hosting a running task, the new workerprocess adopts the still-running container but never wires up its log
forwarding.
iris job logsandiris rpc controller get-task-logsreturnzero entries for that task for the rest of its lifetime, while the task
itself runs normally and produces logs (visible inside the container and via
wandb's
output.log).The worker's own
/system/worker/...logs flow into finelog becauseRemoteLogHandleris wired up at registration time (afterLogPusherconstruction). Newly-submitted tasks work because
Worker.submit_taskwiresthe pusher then. Only adopted attempts are affected.
To Reproduce
TASK_STATE_RUNNING.iris rpc controller restart-worker --worker-id <WORKER_ID>.iris job logs <TASK_ID>— zero entries.gs://marin-us-central2/iris/<cluster>/state/logs/) has nothing for that task either.Expected behavior
Adopted task attempts should stream container logs through
LogPusherandfinelog like normally-submitted attempts do.
Additional context
Root cause in
lib/iris/src/iris/cluster/worker/worker.py:Worker.start()callsadopt_running_containers()at line 231 BEFORE constructingself._log_pusherat line 266.self._log_pusher(stillNone, initialised at line 201) intoTaskAttempt.adopt(...)at lines 320–326.TaskAttemptkeeps thatNonepermanently.TaskAttempt._push_logssilentlyreturns whenself._log_pusheris falsy (lib/iris/src/iris/cluster/worker/task_attempt.py:908).DockerLogReader(lib/iris/src/iris/cluster/runtime/docker.py:254-279) keeps readingdocker logs --timestamps --since ...and every batch gets dropped on the floor.Worker.submit_taskwires the pusher correctly atlib/iris/src/iris/cluster/worker/worker.py:710, so tasks submitted afterstart()finishes work normally.Suggested fix: build
self._log_pusherbefore callingadopt_running_containers(), or pass a lazily-resolved holder/callable so adopted attempts pick up the pusher when it materialises.Independent of in-process stdout redirection (e.g. wandb
console="auto") — iris captures via the docker daemon, not the python process's stdout fd.