Back grounding get_or_create with DB-level uniqueness#1418
Back grounding get_or_create with DB-level uniqueness#1418
Conversation
Closes #1407. Follow-up to #1397. * Adds two partial UniqueConstraints on Annotation that exactly match the application-level get_or_create lookup keys in the extraction grounding pipeline so concurrent Celery retries can't bypass idempotency. Migration 0069 deduplicates any existing rows before adding the constraints. * Moves creator_id into the get_or_create lookup so the new constraints back the lookup; without that the race-loser's IntegrityError wouldn't fire. * Documents the PostgreSQL-specific json equality semantics on the span path and the rationale for the end-of-function dedup loop. * Replaces the redundant assertIsNotNone + assert pair with a single assert in test_extraction_grounding.py (only the latter narrows for mypy / pyright).
Code Review — PR #1418: Back grounding
|
The original constraint condition (structural=False, annotation_type=...)
applied to ALL non-structural annotations, blocking legitimate flows
that produce duplicates under the key tuple — e.g.
add_annotations_from_exact_strings finding multiple occurrences of the
same word on a page, or hierarchical annotation trees with duplicate
raw_text at different parents (4 backend tests failing in CI:
test_annotation_tree::{test_mid_node_subtree_and_descendants,
test_stress_test} and test_llm_annotation_tools::
{test_add_annotations_pdf, test_async_pdf_and_text}).
Add a boolean Annotation.is_grounding_source (default False), set True
only inside _create_pdf_annotation / _create_span_annotation, and scope
both partial UniqueConstraints to is_grounding_source=True. The
migration backfills True for any pre-existing OC_EXTRACT_SOURCE rows so
historical grounding annotations are also covered, then dedupes and
adds the constraints.
…ent trim - Add test_db_constraint_blocks_concurrent_token_label_grounding_duplicates in TestGroundingPipelinePDFIntegration: directly exercises the IntegrityError + get_or_create fallback round-trip the constraint enables. Closes the test-coverage gap flagged in review. - Extend the migration's dedup loop to repoint every realistic cross-reference (UserFeedback FK; ChatMessage / AssignmentTask / Relationship M2Ms) to the keeper before deleting redundant rows, so no cross-domain references are silently dropped. Document why Embedding (CASCADE) and Annotation.parent (CASCADE on leaf rows) are intentionally not repointed. - Trim the assertIsNotNone-vs-assert explanation to a single trailing comment to match project convention. - Pin migration dependencies to the latest migration of every app whose models the dedup helper touches (conversations, feedback, users) in addition to the existing extracts pin.
Code Review — PR #1418: Back-grounding
|
| Severity | Issue |
|---|---|
| Critical | _repoint_cross_references bulk-UPDATE into M2M through tables that may already contain keeper_id, causing IntegrityError during the exact race scenarios the migration targets |
| Medium | Migration imports live application constant (should freeze the string value) |
| Medium | No SPAN_LABEL constraint test to cover the json-equality path |
| Minor | Existing tests don't assert is_grounding_source=True |
| Minor | structural=False asymmetry between constraint condition and lookup key |
| Nit | Verbose CHANGELOG entry |
The critical M2M issue should be fixed before merge — on a production database that experienced the Celery race (even once), the migration will abort.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
- Migration 0069: rewrite _repoint_cross_references to delete colliding
through-table rows before updating annotation_id. The realistic race
the migration is built to clean up — both keeper and redundant in the
same datacell.sources — would have hit Django M2M's
UNIQUE(owner_id, annotation_id) and aborted the dedup transaction.
Also fix a latent typo (apps.get_model("users", "AssignmentTask"))
that would have raised LookupError; the model is users.Assignment.
- Migration 0069: inline OC_EXTRACT_SOURCE_LABEL string instead of
importing the live constant, so replay survives future renames.
- extraction_grounding.py: move structural=False from defaults into
get_or_create lookup so the constraint condition and lookup match
exactly.
- tests: add SPAN_LABEL constraint-path test (parallel to the
TOKEN_LABEL one), assert is_grounding_source=True on the existing
ground_text/ground_pdf tests, and add a direct test for migration
0069's M2M dedup helper covering the keeper+redundant collision case.
- CHANGELOG: trim the verbose grounding follow-up entry.
Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
Code Review — PR #1418: Back-grounding
|
| Test | What it exercises |
|---|---|
test_db_constraint_blocks_concurrent_span_label_grounding_duplicates |
JSONB structural equality in the constraint; duplicate INSERT raises IntegrityError; re-run is idempotent |
test_db_constraint_blocks_concurrent_token_label_grounding_duplicates |
Same for TOKEN_LABEL; get_or_create falls back to SELECT after IntegrityError |
TestMigration0069M2MDedup.test_repoint_cross_references_handles_existing_keeper_in_through_table |
The M2M collision edge-case (keeper + redundant both already linked to same datacell) |
The updated assertion self.assertTrue(annot.is_grounding_source) replacing the redundant assertIsNotNone pair is a clean improvement.
One gap: there is no test exercising the backfill_is_grounding_source RunPython step in isolation (i.e. verifying that a pre-migration OC_EXTRACT_SOURCE row gets is_grounding_source=True after the migration runs). This is lower priority given the backfill logic is straightforward, but would increase confidence during the production rollout.
Summary
The core approach is correct and the implementation is careful. The main items worth addressing before merge:
- Documentation (low effort): Add a deployment-window note to the migration docstring advising that Celery grounding tasks should be quiesced before migrating.
atomic = Falsecomment: One line explaining the reason.- Optional: Extract migration helpers out of the migration file to make them independently testable without
importlibpath tricks.
Everything else is minor. The PR is in good shape.
Closes #1407. Follow-up to #1397.
Summary
Addresses the four review concerns raised in #1407 against the extraction-grounding hardening in #1397.
1. Race condition: DB-level uniqueness backing
get_or_create(medium)opencontractserver/utils/extraction_grounding.pyalready usedAnnotation.objects.get_or_create()to make Celery retries idempotent, but without a backingUniqueConstrainttwo workers racing on the same datacell could both miss on the SELECT and both succeed on the CREATE.Added two partial
UniqueConstraints onAnnotationwhose fields exactly match the application-level lookup keys:TOKEN_LABEL(PDF)annotation_unique_token_label_grounding_key(document, corpus, annotation_label, page, raw_text, creator)SPAN_LABEL(text/DOCX)annotation_unique_span_label_grounding_key(document, corpus, annotation_label, raw_text, json, creator)Both scoped via
condition=Q(structural=False).creatoris in the key so two distinct users manually creating identical annotations are not blocked — only the realistic Celery-race target (the grounding pipeline always uses the datacell owner's ID) is.To make the constraints actually back the lookup,
creator_idis moved out ofdefaults={...}and into the lookup keys for both_create_pdf_annotationand_create_span_annotation.opencontractserver/annotations/migrations/0069_grounding_annotation_unique_constraints.pyfirst deduplicates any pre-existing rows (keeping the lowest-pk row in each group and re-pointingdatacell.sourcesM2M FKs to it) before adding the constraints — same pattern used in0068_enforce_embedder_path_not_null.py.2. PostgreSQL-specific
jsonlookup (low)The
_create_span_annotationdocstring now explicitly notes that thejson={"start", "end"}lookup relies onJSONFieldmapping to PostgreSQLjsonb(structural equality, key-order independent) and that on SQLite the comparison would be lexical. Both runtime and test databases are PostgreSQL, so this is documented rather than guarded.3. End-of-function dedup rationale (minor)
The comment now spells out why dedup is needed rather than treating it as a symptom: distinct extraction values can legitimately resolve to the same source span (e.g.
["Acme Holdings", "Acme Holdings Inc."]both anchoring on the same token range, or an idempotent re-run returning the existing row twice). The dedup is intentional, not masking a bug inalign_text_to_document.4. Test style nit
Replaced the redundant
assertIsNotNone(...) + assert ... is not Nonepair with a single plainassert— only the latter narrows for mypy / pyright. Inline comment explains the choice.Test plan
docker compose -f test.yml run django pytest opencontractserver/tests/test_extraction_grounding.py -n 4 --dist loadscope --create-db(existing idempotency tests now also exercise the DB-level constraint path)docker compose -f test.yml run django python manage.py migrate annotations 0069_grounding_annotation_unique_constraintsagainst a database seeded with deliberate duplicates to exercise the dedupRunPythonstepcreatorin the key keeps the blast radius narrow)Generated by Claude Code