Configure Celery for at-least-once delivery to prevent task loss on worker death#1497
Configure Celery for at-least-once delivery to prevent task loss on worker death#1497
Conversation
Enable CELERY_TASK_ACKS_LATE and CELERY_TASK_REJECT_ON_WORKER_LOST globally so a worker dying mid-task — OOM, SIGKILL, host loss, deploy eviction — causes the broker to redeliver the message instead of silently dropping it. Previously long-running ingest/parse/embed tasks that died just before their final DB write left documents stuck with backend_lock=True and no parsed content. Document the at-least-once / idempotency contract in docs/architecture/asynchronous-processing.md (including the per-task opt-out for tasks that genuinely cannot be made idempotent), and pin the new settings + their Django→Celery namespace propagation with a regression test so a future settings refactor cannot silently revert the resilience guarantee.
Code ReviewOverviewThis PR correctly addresses a real reliability gap: Celery's default at-most-once delivery means OOM-killed or SIGKILL'd workers silently lose tasks, leaving documents stuck with The documentation addition and regression tests are both good practice. Below are concerns that need addressing before this can be merged safely. Critical: Redis visibility timeout is a hidden time bombThe broker is Redis ( Several of the long-running ingest/parse/embed tasks this PR is trying to protect are exactly the ones most likely to exceed 1 hour on large documents. The PR should either:
Significant: Existing tasks are not idempotent, but now run at-least-onceThe PR documentation states "all tasks MUST be idempotent" but makes no changes to existing tasks. A quick audit shows several tasks that will produce duplicate data on retry:
These should either be fixed (switching to Minor: Test accessor pattern is unusual# test imports the *module* then accesses .app on it
from config import celery_app
celery_app.app.conf.task_acks_lateThis works but from config.celery_app import app as celery_app_instance
celery_app_instance.conf.task_acks_lateMinor: Celery conf tests have a subtle scope caveat worth documenting
Positive callouts
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Test fix (pytest CI failure): - test_celery_worker_death_resilience: switch from celery_app.app.conf to celery_app.conf. config/__init__.py exposes celery_app as the Celery app instance (not the module), so .app was a Celery object with no such attribute, causing all 4 tests to fail with AttributeError. Address PR #1497 review feedback: - Set CELERY_BROKER_TRANSPORT_OPTIONS = {visibility_timeout: 12h} so long-running document tasks aren't double-delivered by the Redis broker default 1h timeout combined with task_acks_late=True. - Add visibility-timeout regression test. - Document Redis visibility-timeout caveat and list known non-idempotent tasks (corpus_tasks, agent_tasks, import_tasks) as follow-up hardening work in asynchronous-processing.md. - Add docstring noting that celery_app.conf is import-time-cached and unaffected by @override_settings. - Add inline comment on worker_uploads acks_late=True clarifying it is now redundant with the global default but kept as intent. - Update CHANGELOG to mention visibility timeout and known-non- idempotent task tracking.
Code Review: PR #1497 — Celery at-least-once deliveryOverviewThis is a well-scoped infrastructure PR: three Django settings, documentation of the idempotency contract, and regression tests that pin the configuration. No application logic is touched. The change addresses a real production hazard — long-running ingest/parse/embed tasks being silently lost on worker death — using Celery's built-in mechanisms rather than custom code. Overall this is good work; the comments below are mostly minor. Issues1. Magic number violates project conventions (CLAUDE.md)
CELERY_BROKER_TRANSPORT_OPTIONS = {"visibility_timeout": 12 * 60 * 60}CLAUDE.md explicitly says "No magic numbers — use constants files in # Maximum expected duration for any document-processing task (seconds).
# Used to set the Redis visibility timeout above the Celery default of 3600s.
CELERY_REDIS_VISIBILITY_TIMEOUT_SECONDS = 12 * 60 * 60Then from opencontractserver.constants.celery import CELERY_REDIS_VISIBILITY_TIMEOUT_SECONDS
CELERY_BROKER_TRANSPORT_OPTIONS = {"visibility_timeout": CELERY_REDIS_VISIBILITY_TIMEOUT_SECONDS}2. Regression test has an asymmetry: visibility timeout isn't verified against
|
Summary
This PR configures Celery to use at-least-once delivery semantics instead of the default at-most-once, ensuring that tasks are not silently lost when workers die unexpectedly (OOM, SIGKILL, host failure, deploy eviction). Previously, long-running ingest/parse/embed tasks could fail mid-flight, leaving documents stuck with
backend_lock=Trueand no parsed content.Key Changes
Enable at-least-once delivery in Celery configuration (
config/settings/base.py):CELERY_TASK_ACKS_LATE = Trueto defer message acknowledgment until task completionCELERY_TASK_REJECT_ON_WORKER_LOST = Trueto requeue tasks on hard worker kills instead of treating them as successfulDocument task idempotency requirements (
docs/architecture/asynchronous-processing.md):Add regression tests (
opencontractserver/tests/test_celery_worker_death_resilience.py):Update changelog (
CHANGELOG.md):Implementation Details
The change leverages Celery's built-in configuration options rather than custom logic. The broker now only removes messages after successful task completion, and hard-killed workers trigger redelivery instead of silent task loss.
Trade-off: At-least-once delivery means tasks may execute multiple times. All Celery tasks in the project must be idempotent — the documentation provides clear patterns for achieving this (deterministic database operations, idempotency keys, state re-checks, etc.). For the rare case where a task cannot be made idempotent, a per-task opt-out is available with explicit documentation of why.
The regression tests ensure the configuration cannot be accidentally reverted through future refactoring.
https://claude.ai/code/session_017LtMNFSpJukVcW9NPQaxxw