Skip to content

Back grounding get_or_create with DB-level uniqueness#1418

Open
JSv4 wants to merge 7 commits intomainfrom
claude/resolve-issue-1407-EbKyf
Open

Back grounding get_or_create with DB-level uniqueness#1418
JSv4 wants to merge 7 commits intomainfrom
claude/resolve-issue-1407-EbKyf

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Apr 30, 2026

Closes #1407. Follow-up to #1397.

Summary

Addresses the four review concerns raised in #1407 against the extraction-grounding hardening in #1397.

1. Race condition: DB-level uniqueness backing get_or_create (medium)

opencontractserver/utils/extraction_grounding.py already used Annotation.objects.get_or_create() to make Celery retries idempotent, but without a backing UniqueConstraint two workers racing on the same datacell could both miss on the SELECT and both succeed on the CREATE.

Added two partial UniqueConstraints on Annotation whose fields exactly match the application-level lookup keys:

Annotation type Constraint name Fields
TOKEN_LABEL (PDF) annotation_unique_token_label_grounding_key (document, corpus, annotation_label, page, raw_text, creator)
SPAN_LABEL (text/DOCX) annotation_unique_span_label_grounding_key (document, corpus, annotation_label, raw_text, json, creator)

Both scoped via condition=Q(structural=False). creator is in the key so two distinct users manually creating identical annotations are not blocked — only the realistic Celery-race target (the grounding pipeline always uses the datacell owner's ID) is.

To make the constraints actually back the lookup, creator_id is moved out of defaults={...} and into the lookup keys for both _create_pdf_annotation and _create_span_annotation.

opencontractserver/annotations/migrations/0069_grounding_annotation_unique_constraints.py first deduplicates any pre-existing rows (keeping the lowest-pk row in each group and re-pointing datacell.sources M2M FKs to it) before adding the constraints — same pattern used in 0068_enforce_embedder_path_not_null.py.

2. PostgreSQL-specific json lookup (low)

The _create_span_annotation docstring now explicitly notes that the json={"start", "end"} lookup relies on JSONField mapping to PostgreSQL jsonb (structural equality, key-order independent) and that on SQLite the comparison would be lexical. Both runtime and test databases are PostgreSQL, so this is documented rather than guarded.

3. End-of-function dedup rationale (minor)

The comment now spells out why dedup is needed rather than treating it as a symptom: distinct extraction values can legitimately resolve to the same source span (e.g. ["Acme Holdings", "Acme Holdings Inc."] both anchoring on the same token range, or an idempotent re-run returning the existing row twice). The dedup is intentional, not masking a bug in align_text_to_document.

4. Test style nit

Replaced the redundant assertIsNotNone(...) + assert ... is not None pair with a single plain assert — only the latter narrows for mypy / pyright. Inline comment explains the choice.

Test plan

  • docker compose -f test.yml run django pytest opencontractserver/tests/test_extraction_grounding.py -n 4 --dist loadscope --create-db (existing idempotency tests now also exercise the DB-level constraint path)
  • docker compose -f test.yml run django python manage.py migrate annotations 0069_grounding_annotation_unique_constraints against a database seeded with deliberate duplicates to exercise the dedup RunPython step
  • CI green on the full backend test suite (verifies the new constraints don't regress unrelated annotation flows; creator in the key keeps the blast radius narrow)

Generated by Claude Code

Closes #1407. Follow-up to #1397.

* Adds two partial UniqueConstraints on Annotation that exactly match
  the application-level get_or_create lookup keys in the extraction
  grounding pipeline so concurrent Celery retries can't bypass
  idempotency. Migration 0069 deduplicates any existing rows before
  adding the constraints.
* Moves creator_id into the get_or_create lookup so the new constraints
  back the lookup; without that the race-loser's IntegrityError
  wouldn't fire.
* Documents the PostgreSQL-specific json equality semantics on the
  span path and the rationale for the end-of-function dedup loop.
* Replaces the redundant assertIsNotNone + assert pair with a single
  assert in test_extraction_grounding.py (only the latter narrows for
  mypy / pyright).
@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Code Review — PR #1418: Back grounding get_or_create with DB-level uniqueness

Overall this is a solid correctness fix. The core change is small and well-targeted; the migration follows the established pattern from 0068; and the PR description + docstrings thoroughly explain the reasoning. A few items worth discussing:


✅ What's done well

  • Core fix is correct. Moving creator_id from defaults into the lookup key is the necessary step to make the constraint actually back get_or_create. Without it, the partial UniqueConstraint would fire on the loser of the race, but Django's retry SELECT would use a different predicate and miss, re-raising the IntegrityError instead of returning the existing row.
  • Migration safety. atomic = False is appropriate here — dedup + constraint-add can't share an outer transaction on a large table, and partial failure leaves data in a cleaner state (deduped without the constraint) that allows a safe re-run. The per-group transaction.atomic() inside the dedup loop is correct.
  • Blast radius is narrow. The condition=Q(structural=False, ...) partial constraint means structural / layout annotations (which can legitimately repeat across parser runs) are completely unaffected.
  • Dedup handles M2M re-pointing before deleting duplicates — same pattern as 0068, and the only real FK concern for annotations created exclusively by the grounding pipeline.

Issues

Medium — test coverage gap for the IntegrityError retry path

The existing idempotency tests call the grounding functions sequentially (same process, same transaction). They verify that a second call returns the same annotation PK, but they don't exercise the DB-level constraint path: two concurrent INSERTs where the loser gets an IntegrityError and Django's get_or_create retries with a SELECT.

The constraint is the new correctness invariant, but there's no test that fires it. A minimal test could insert the annotation directly, then call _create_pdf_annotation / _create_span_annotation again and assert that created=False is returned and the annotation count hasn't increased — which would prove that the constraint + retry round-trip works end-to-end.

def test_db_constraint_backs_get_or_create_token_label(self):
    # Pre-insert the row that the pipeline would create
    existing = Annotation.objects.create(
        document=self.document,
        corpus=self.corpus,
        annotation_label=self.label,
        page=0,
        annotation_type=TOKEN_LABEL,
        raw_text="some text",
        creator=self.user,
        json={...},
        structural=False,
    )
    count_before = Annotation.objects.count()
    annot, created = _create_pdf_annotation(...)
    self.assertFalse(created)
    self.assertEqual(annot.pk, existing.pk)
    self.assertEqual(Annotation.objects.count(), count_before)

Minor — dedup migration doesn't account for other Annotation FK references

The dedup loop re-points Datacell.sources (the M2M through table) before deleting duplicate annotations, which is correct for the grounding pipeline's usage. However, if any other part of the codebase holds FKs to Annotation (e.g. via annotationset, relationship models, or comment threads), those references to the deleted duplicates would either cascade-delete or raise a constraint error depending on on_delete settings.

This is low risk because the duplicates being targeted were created exclusively by the grounding pipeline (and thus only referenced by Datacell.sources), but it would be worth a quick grep to confirm no other FK column points at Annotation without being handled:

grep -r "ForeignKey.*Annotation\|OneToOneField.*Annotation" opencontractserver/ --include="*.py" | grep -v migrations

If there are other FK references, they should either be added to the dedup loop or confirmed to have on_delete=CASCADE (so they'd self-clean on delete).

Minor — test comment verbosity exceeds project conventions

CLAUDE.md says to default to no comments and only add one when the WHY is non-obvious. The 5-line block comment explaining assert vs assertIsNotNone is more explanation than the change warrants — the original paired assertion was already being replaced, so the comment is defending against a pattern that no longer exists in the file.

A single trailing comment (# plain assert narrows for mypy) or no comment at all would fit the project convention better. The existing code already has many examples of assert x is not None without explanation.


Nits

  • The migration module-level docstring is long but appropriate given the migration's complexity — this is an exception to the "no multi-line comments" rule because migration context genuinely needs to survive outside of git blame.
  • In models.py, the constraints use django.db.models.UniqueConstraint with the fully-qualified prefix rather than a bare UniqueConstraint — minor inconsistency with the style used elsewhere in the same Meta class, but not worth a separate PR.

Summary

The main correctness fix (moving creator_id into the lookup key) and the migration (dedup + partial UniqueConstraint) are both correct. The missing piece is a direct test that fires the IntegrityError path and confirms the Django retry round-trip works — without it, a regression in how the constraint interacts with get_or_create wouldn't be caught by the existing test suite. I'd recommend adding that test before merging.

JSv4 added 4 commits April 30, 2026 22:40
The original constraint condition (structural=False, annotation_type=...)
applied to ALL non-structural annotations, blocking legitimate flows
that produce duplicates under the key tuple — e.g.
add_annotations_from_exact_strings finding multiple occurrences of the
same word on a page, or hierarchical annotation trees with duplicate
raw_text at different parents (4 backend tests failing in CI:
test_annotation_tree::{test_mid_node_subtree_and_descendants,
test_stress_test} and test_llm_annotation_tools::
{test_add_annotations_pdf, test_async_pdf_and_text}).

Add a boolean Annotation.is_grounding_source (default False), set True
only inside _create_pdf_annotation / _create_span_annotation, and scope
both partial UniqueConstraints to is_grounding_source=True. The
migration backfills True for any pre-existing OC_EXTRACT_SOURCE rows so
historical grounding annotations are also covered, then dedupes and
adds the constraints.
…ent trim

- Add test_db_constraint_blocks_concurrent_token_label_grounding_duplicates
  in TestGroundingPipelinePDFIntegration: directly exercises the
  IntegrityError + get_or_create fallback round-trip the constraint
  enables. Closes the test-coverage gap flagged in review.
- Extend the migration's dedup loop to repoint every realistic
  cross-reference (UserFeedback FK; ChatMessage / AssignmentTask /
  Relationship M2Ms) to the keeper before deleting redundant rows, so
  no cross-domain references are silently dropped. Document why
  Embedding (CASCADE) and Annotation.parent (CASCADE on leaf rows) are
  intentionally not repointed.
- Trim the assertIsNotNone-vs-assert explanation to a single trailing
  comment to match project convention.
- Pin migration dependencies to the latest migration of every app
  whose models the dedup helper touches (conversations, feedback,
  users) in addition to the existing extracts pin.
@claude
Copy link
Copy Markdown

claude Bot commented May 1, 2026

Code Review — PR #1418: Back-grounding get_or_create with DB-level uniqueness

Overall this is a well-scoped, well-documented hardening PR. The core change (moving creator_id + is_grounding_source from defaults into the lookup, and adding the two partial UniqueConstraints) is correct and addresses the race condition cleanly. The migration structure mirrors the established pattern from 0068. Most concerns below are in the migration's dedup helper.


Critical

_repoint_cross_references can produce M2M uniqueness violations during dedup

migrations/0069, _repoint_cross_references:

Datacell.sources.through.objects.filter(annotation_id__in=redundant_ids).update(
    annotation_id=keeper_id
)

Django's default M2M through table has a UNIQUE(datacell_id, annotation_id) constraint. In exactly the race scenario this migration is designed to clean up — one datacell whose grounding pipeline ran twice — both the keeper and a redundant annotation will already be in that datacell's sources. Blindly updating annotation_id=keeper_id for the redundant row then produces two through-table rows with the same (datacell_id, keeper_id), raising an IntegrityError and aborting the dedup transaction.

The same hazard applies to all other M2M through tables repointed by the function (ChatMessage.source_annotations, ChatMessage.created_annotations, Relationship.source_annotations, Relationship.target_annotations, AssignmentTask.resulting_annotations).

Suggested fix per relationship — delete conflicting rows before updating:

# For each M2M, delete redundant rows that would collide with an
# already-present keeper entry, then update the rest.
existing = set(
    Datacell.sources.through.objects
        .filter(annotation_id=keeper_id)
        .values_list("datacell_id", flat=True)
)
Datacell.sources.through.objects.filter(
    annotation_id__in=redundant_ids, datacell_id__in=existing
).delete()
Datacell.sources.through.objects.filter(
    annotation_id__in=redundant_ids
).update(annotation_id=keeper_id)

Apply the same pattern to each of the seven relationships in _repoint_cross_references. Without this, the migration will fail for any database that experienced the race it was written to fix.


Medium

Migration imports application-level constant — fragile on replay

migrations/0069:

from opencontractserver.constants.annotations import OC_EXTRACT_SOURCE_LABEL

Django migrations are meant to be reproducible snapshots. If OC_EXTRACT_SOURCE_LABEL is ever renamed or the constants module is restructured, replaying this migration against a clean database from history breaks. The established convention is to inline the string value directly:

OC_EXTRACT_SOURCE_LABEL = "OC Extract Source"  # frozen at migration write time

No test for the SPAN_LABEL constraint

test_db_constraint_blocks_concurrent_token_label_grounding_duplicates only covers the TOKEN_LABEL path. Given that the SPAN_LABEL constraint uses json as part of its identity key (which has the PostgreSQL-specific semantics explicitly called out in the docstring), a parallel test_db_constraint_blocks_concurrent_span_label_grounding_duplicates would be worthwhile — especially to confirm the json-equality lookup works at the DB level.


Minor

Existing tests don't assert is_grounding_source=True

test_ground_text_document and test_ground_pdf_creates_token_label_annotations now loop over the returned annotations but don't verify the new field:

self.assertTrue(annot.is_grounding_source)

Adding this assertion would close the loop between the model/migration change and the pipeline change, and catch any future regression where the flag gets silently dropped from the get_or_create lookup.

structural is in constraint condition but not in the get_or_create lookup

extraction_grounding.py, _create_pdf_annotation and _create_span_annotation:

structural=False is in the UniqueConstraint condition but only in defaults for get_or_create. This is correct — a new row will always get structural=False — but it creates an asymmetry: if a pre-existing row somehow had structural=True with the same key tuple, get_or_create would find it, yet the constraint wouldn't protect it. This is practically impossible for grounding rows, but adding structural=False to the lookup key would make the intent self-documenting and ensure the constraint's condition is always satisfied on both paths.


Nit

CHANGELOG entry length

The new CHANGELOG bullet is unusually verbose compared to the surrounding entries. Per project convention, each bullet should be concise (a sentence or two with file/line references). The detailed mechanistic explanation is better placed in the PR description (where it already lives).


Summary

Severity Issue
Critical _repoint_cross_references bulk-UPDATE into M2M through tables that may already contain keeper_id, causing IntegrityError during the exact race scenarios the migration targets
Medium Migration imports live application constant (should freeze the string value)
Medium No SPAN_LABEL constraint test to cover the json-equality path
Minor Existing tests don't assert is_grounding_source=True
Minor structural=False asymmetry between constraint condition and lookup key
Nit Verbose CHANGELOG entry

The critical M2M issue should be fixed before merge — on a production database that experienced the Celery race (even once), the migration will abort.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

JSv4 and others added 2 commits May 2, 2026 11:22
- Migration 0069: rewrite _repoint_cross_references to delete colliding
  through-table rows before updating annotation_id. The realistic race
  the migration is built to clean up — both keeper and redundant in the
  same datacell.sources — would have hit Django M2M's
  UNIQUE(owner_id, annotation_id) and aborted the dedup transaction.
  Also fix a latent typo (apps.get_model("users", "AssignmentTask"))
  that would have raised LookupError; the model is users.Assignment.
- Migration 0069: inline OC_EXTRACT_SOURCE_LABEL string instead of
  importing the live constant, so replay survives future renames.
- extraction_grounding.py: move structural=False from defaults into
  get_or_create lookup so the constraint condition and lookup match
  exactly.
- tests: add SPAN_LABEL constraint-path test (parallel to the
  TOKEN_LABEL one), assert is_grounding_source=True on the existing
  ground_text/ground_pdf tests, and add a direct test for migration
  0069's M2M dedup helper covering the keeper+redundant collision case.
- CHANGELOG: trim the verbose grounding follow-up entry.
Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review — PR #1418: Back-grounding get_or_create with DB-level uniqueness

Overview

This PR hardens the extraction grounding pipeline against a real (if low-probability) Celery race condition. Two concurrent workers retrying the same datacell could each miss the get_or_create SELECT and both succeed on the INSERT, producing duplicate OC_EXTRACT_SOURCE annotations. The fix introduces:

  1. A new is_grounding_source boolean field scoping the new constraints to grounding-pipeline rows only.
  2. Two partial UniqueConstraints on Annotation matching the exact lookup keys used by get_or_create.
  3. Migration 0069 that backfills the flag, deduplicates pre-existing rows (repointing all FK/M2M cross-references), then adds the constraints.
  4. creator_id, is_grounding_source=True, and structural=False moved out of defaults={} into the lookup keys so the DB constraint and the application-level call are symmetric.

The approach is sound. Observations below, in rough priority order.


Correctness — one item to verify

Deployment window: in-flight Celery tasks during migration

The migration is atomic = False and does destructive work (deletes duplicates, adds constraints). If Celery workers are creating grounding annotations while the migration runs, those new rows will have is_grounding_source=False (the field exists but the code hasn't been deployed yet), so they'll slip through the constraint. More critically, if a worker inserts a row between the backfill step and the constraint step, that row won't be covered.

This is a deployment-window concern rather than a code bug, but it deserves a comment in the migration docstring noting that Celery queues should be drained (or the grounding task paused) before the migration is applied in production. The existing docstring is detailed; a one-sentence note would round it out.


Code quality observations

Migration's atomic = False — comment on the why

The atomic = False is correct: you can't add a constraint inside the same transaction that dedupes rows on PostgreSQL without advisory locks. A short inline comment would help the next reader know this was a deliberate choice, not an oversight:

class Migration(migrations.Migration):
    # Must be non-atomic: the dedup RunPython steps must commit before
    # the UniqueConstraints can be added safely on PostgreSQL.
    atomic = False

_dedupe_token_label / _dedupe_span_label — N+1 query pattern

Each duplicate group issues a second per-group SELECT to fetch IDs in pk order:

for group in duplicates:          # QuerySet evaluated lazily → 1 query per group
    ids = list(Annotation.objects.filter(...).order_by("id").values_list(...))

For a production database with many race-produced duplicates this could be slow. A minor improvement is to fetch all IDs in a single query and partition in Python:

from itertools import groupby

all_ids = (
    Annotation.objects.filter(...)
    .values("document_id", "corpus_id", "annotation_label_id", "page",
            "raw_text", "creator_id", "id")
    .order_by("document_id", ..., "id")
)
for key, group_rows in groupby(all_ids, key=lambda r: (...)):
    ids = [r["id"] for r in group_rows]
    ...

Not a blocker (duplicates are rare, migration is one-time), but worth noting for very large tables.

_repoint_m2m_through — field-name convention

The helper builds the Django lookup dynamically:

through_model.objects.filter(annotation_id__in=redundant_ids,
    **{f"{owner_field}_id__in": existing_owners})

Django M2M through-table fields are named <related_model_name>_id by default, but owner_field is passed as "datacell", "chatmessage", etc. — all lowercase model names. This works for the models referenced here, but could silently produce a wrong filter if the through-table field name ever differs (e.g. a custom through model). A defensive getattr(through_model, f"{owner_field}_id", None) is not None assertion before the filter would catch mismatches early.

importlib.util.spec_from_file_location in TestMigration0069M2MDedup

Directly loading the migration file with importlib.util is creative but fragile — it breaks if the file is moved or renamed. A cleaner alternative is to extract _repoint_cross_references (and _repoint_m2m_through) into a private utility module (e.g. opencontractserver/utils/migration_helpers.py) and import normally. The migration would call the same function, and the test imports it directly without path magic.

This is a style preference rather than a bug, but the fragility of the path string is worth flagging.


Minor / nits

  • CHANGELOG extra space: The blob-retention entry has a leading double space (- **Blob retention...) instead of a single dash+space. Cosmetic only.
  • reverse_dedupe docstring: The current """No-op: dropping the constraint cannot resurrect deleted duplicates.""" is accurate and appreciated — no change needed.
  • Test comment: # narrows for mypy (with an s) is an improvement over # narrow for mypy — consistent across new and updated lines. ✓

Security / performance

No new attack surface introduced. The constraint is additive and narrowly scoped via the is_grounding_source condition. The dedup logic correctly uses parameterized ORM queries throughout.


Test coverage

The three new test scenarios cover the right cases:

Test What it exercises
test_db_constraint_blocks_concurrent_span_label_grounding_duplicates JSONB structural equality in the constraint; duplicate INSERT raises IntegrityError; re-run is idempotent
test_db_constraint_blocks_concurrent_token_label_grounding_duplicates Same for TOKEN_LABEL; get_or_create falls back to SELECT after IntegrityError
TestMigration0069M2MDedup.test_repoint_cross_references_handles_existing_keeper_in_through_table The M2M collision edge-case (keeper + redundant both already linked to same datacell)

The updated assertion self.assertTrue(annot.is_grounding_source) replacing the redundant assertIsNotNone pair is a clean improvement.

One gap: there is no test exercising the backfill_is_grounding_source RunPython step in isolation (i.e. verifying that a pre-migration OC_EXTRACT_SOURCE row gets is_grounding_source=True after the migration runs). This is lower priority given the backfill logic is straightforward, but would increase confidence during the production rollout.


Summary

The core approach is correct and the implementation is careful. The main items worth addressing before merge:

  1. Documentation (low effort): Add a deployment-window note to the migration docstring advising that Celery grounding tasks should be quiesced before migrating.
  2. atomic = False comment: One line explaining the reason.
  3. Optional: Extract migration helpers out of the migration file to make them independently testable without importlib path tricks.

Everything else is minor. The PR is in good shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve extraction grounding

2 participants