Skip to content

typing: graduate opencontractserver.documents.* (refs #1447)#1475

Open
JSv4 wants to merge 7 commits intomainfrom
mypy/graduate-documents
Open

typing: graduate opencontractserver.documents.* (refs #1447)#1475
JSv4 wants to merge 7 commits intomainfrom
mypy/graduate-documents

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented May 3, 2026

Continues draining the mypy baseline. Graduates all eight `opencontractserver.documents.*` modules.

Removed from `mypy.ini`

  • `opencontractserver.documents.{admin,checks,models,query_optimizer,versioning}`
  • `opencontractserver.documents.tests.test_pdf_hash`
  • `opencontractserver.documents.management.commands.{compute_document_hashes,migrate_pipeline_settings}`

Added to `mypy.ini`

```
[mypy-opencontractserver.documents.models]
disable_error_code = django-manager-missing
```

`DocumentPath` self-relation `path_records` / `children` and `IngestionSource.document_paths` are django-tree-queries reverse-relation surfaces that django-stubs cannot resolve.

Pruned from `docs/typing/mypy_baseline.txt`

40 lines (7124 → 7084).

Per-file fixes (highlights)

  • `versioning.py` — replaced runtime `User = get_user_model()` with a TYPE_CHECKING import of `opencontractserver.users.models.User`; widened `move_document(new_folder=...)` from `Optional[CorpusFolder] = "UNSET"` to `Optional[CorpusFolder] | str = "UNSET"` and added an `assert not isinstance(new_folder, str)` after the sentinel check so `folder_to_use` narrows correctly.
  • `models.py` — manager-override `# type: ignore[misc]`; `author_obj: AbstractBaseUser` narrowing in `update_summary`; two `all_components: list[Any] = []` accumulators.
  • `query_optimizer.py` — typed `result: dict[str, list[Any]] = {...}`.
  • `checks.py` — typed three `errors` accumulators (one widened to `list[Error | Warning]`).
  • `admin.py` — three Django-admin `short_description` / `admin_order_field` ignores.
  • `tests/test_pdf_hash.py` — narrowing `assert computed_hash is not None`; cast `_meta.get_field(...)` to `Any` for field-introspection.
  • `compute_document_hashes.py` — `assert document.pdf_file.name is not None` before `storage.exists`; coalesced `(old_hash or "")[:8]` for the no-change branch.
  • `migrate_pipeline_settings.py` — typed two `all_components` accumulators; widened heterogeneous `stats` dict; renamed inner iteration variable so the list-comp re-assignment doesn't conflict with the source tuple type.

Acceptance (per #1447)

  • `mypy.ini` block(s) removed
  • Pre-commit `mypy` runs clean on full project surface
  • No runtime behavior changes intended

```
$ docker compose -f test.yml run --rm django python -m mypy --config-file mypy.ini opencontractserver config
Success: no issues found in 1019 source files

$ pre-commit run --all-files
mypy.....................................................................Passed
```

Test plan

  • CI mypy job passes
  • CI backend tests pass

Continues draining the mypy baseline. Graduates all eight documents.*
modules.

Removed from mypy.ini
- opencontractserver.documents.{admin,checks,models,query_optimizer,
  versioning}
- opencontractserver.documents.tests.test_pdf_hash
- opencontractserver.documents.management.commands.{
  compute_document_hashes,migrate_pipeline_settings}

Added to mypy.ini
- [mypy-opencontractserver.documents.models] disable_error_code =
  django-manager-missing (DocumentPath self-relation path_records /
  children and IngestionSource.document_paths are django-tree-queries
  reverse-relation surfaces django-stubs cannot resolve)

Pruned 40 baseline lines (7124 -> 7084).

Per-file fixes
- versioning.py: replaced runtime User = get_user_model() with
  TYPE_CHECKING import of opencontractserver.users.models.User and
  quoted six `user: "User"` annotations; widened
  move_document(new_folder=...) from Optional[CorpusFolder] = "UNSET"
  to Optional[CorpusFolder] | str = "UNSET" and added
  `assert not isinstance(new_folder, str)` after the sentinel check so
  folder_to_use: Optional[CorpusFolder] narrows correctly.
- models.py: scoped # type: ignore[misc] on the
  `objects = DocumentManager()` manager override; added
  `author_obj: AbstractBaseUser` annotation in Document.update_summary
  plus # type: ignore[misc] on the corresponding author=author_obj
  kwarg passed to DocumentSummaryRevision.objects.create(); typed two
  `all_components: list[Any] = []` accumulators.
- query_optimizer.py: typed `result: dict[str, list[Any]] = {...}` in
  the related-object collector.
- checks.py: typed three `errors: list[Warning] = []` accumulators
  (one widened to list[Error | Warning] because the body appends both
  severity classes).
- admin.py: scoped # type: ignore[attr-defined] on the three
  Django-admin `method.short_description = "..."` /
  `admin_order_field = "..."` annotations.
- tests/test_pdf_hash.py: added narrowing
  `assert computed_hash is not None` before the SHA-256 length
  assertion; cast `_meta.get_field("pdf_file_hash")` to Any in the
  field-introspection helper.
- management/commands/compute_document_hashes.py: added
  `assert document.pdf_file.name is not None` before
  storage.exists(); coalesced `old_hash[:8]` to `(old_hash or "")[:8]`
  for the no-change branch.
- management/commands/migrate_pipeline_settings.py: typed two
  `all_components: list[Any] = []` accumulators; widened the
  heterogeneous stats dict to dict[str, Any]; renamed the inner
  iteration variable to group_components and rebound
  `components: list[Any] = list(group_components)` so the list-comp
  re-assignment doesn't conflict with the source tuple type.

Verification
- pre-commit run --all-files -> all hooks pass (mypy: Passed)
- docker compose -f test.yml run --rm django python -m mypy
  --config-file mypy.ini opencontractserver config -> "no issues found
  in 1019 source files"
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review: typing: graduate opencontractserver.documents.* (refs #1447)

Overview

Clean, focused PR that graduates 8 opencontractserver.documents.* modules from mypy.ini's ignore_errors = True baseline. No runtime behavior changes are intended, and the diff is disciplined — minimal suppressions, targeted type annotations, and two genuine bug-adjacent narrowing fixes.


Positives

  • Scoped suppressions: Using disable_error_code = django-manager-missing for documents.models instead of the blunt ignore_errors = True is exactly right. The comment explaining the django-tree-queries limitation is valuable.
  • TYPE_CHECKING guard in versioning.py: Replacing the runtime User = get_user_model() with a TYPE_CHECKING-only import is the correct Django pattern to avoid potential circular-import issues and aligns with how opencontractserver.users.models.User is imported elsewhere in the codebase.
  • assert computed_hash is not None ordering in tests: Placed after assertEqualassertEqual would fail first if None, so the assert is purely a type narrowing hint for mypy and doesn't change test semantics.
  • (old_hash or '')[:8]: This quietly fixes a real latent bug — old_hash was always str | None at that branch, and slicing None would raise TypeError at runtime.

Issues / Suggestions

1. assert not isinstance(new_folder, str) is silently disabled in production (versioning.py:435)

assert not isinstance(new_folder, str)
folder_to_use = new_folder

assert statements are stripped when Python runs with -O (optimised mode), which some production containers enable. If a caller accidentally passes an arbitrary string (not just "UNSET"), this silently becomes folder_to_use = "<some string>", and the DocumentPath.folder FK assignment will blow up with an unhelpful error later.

Suggestion: Replace with a proper guard:

if isinstance(new_folder, str):
    raise TypeError(f"new_folder must be a CorpusFolder or None, got {new_folder!r}")
folder_to_use = new_folder

This costs two lines and eliminates the risk entirely.

2. Sentinel type widened to str is too broad (versioning.py:413)

new_folder: "Optional[CorpusFolder] | str" = "UNSET",

The union now accepts any string, not just the "UNSET" sentinel. A caller passing new_folder="some/path" would pass type-checking but hit the assertion at runtime. A Literal["UNSET"] would tighten this without any additional code:

from typing import Literal
new_folder: Optional[CorpusFolder] | Literal["UNSET"] = "UNSET",

mypy would then reject any non-"UNSET" string at the call site.

3. Unnecessary explanatory comments (CLAUDE.md convention)

Per CLAUDE.md: "Don't explain WHAT the code does... Only add [a comment] when the WHY is non-obvious."

Several comments added in this PR explain the mechanics to mypy rather than a non-obvious constraint:

  • migrate_pipeline_settings.py:606: # Filter if specific component requested. ``list(...)`` widens the source tuple... — the type annotation already communicates this.
  • versioning.py:432–435: The three-line comment block restating the sentinel logic.
  • compute_document_hashes.py:74–76: Comment before the assert restating what the truthy check above already guarantees.

These can be trimmed or removed. They read as explaining the workaround to mypy rather than documenting a hidden invariant for future maintainers.

4. checks.py type inconsistency — three functions annotated list[Warning], one list[Error | Warning]

# check_pipeline_settings_populated, check_modernbert_references, check_configured_components_secrets
errors: list[Warning] = []

# check_embedding_batch_size_configuration
errors: list[Error | Warning] = []

The first three could in principle also append Error-level items in the future. Widening all four to list[Error | Warning] would be more uniform and future-proof, and costs nothing at runtime.

5. Minor: assert computed_hash is not None in test comes after the assertion it enables

self.assertEqual(computed_hash, self.pdf_hash)  # would fail here if None
assert computed_hash is not None                # narrowing for mypy
self.assertEqual(len(computed_hash), 64)        # benefits from narrowing

The placement is functionally correct, but the narrowing assert should logically precede the assertions it enables. Move it before the assertEqual(computed_hash, self.pdf_hash) line so the intent is clear:

assert computed_hash is not None
self.assertEqual(computed_hash, self.pdf_hash)
self.assertEqual(len(computed_hash), 64)

Summary

Correctness ✅ No runtime behavior changes; (old_hash or '')[:8] fixes a latent bug
Type safety ⚠️ assert vs raise for sentinel narrowing; str too broad vs Literal["UNSET"]
Conventions ⚠️ Several comments explain the "what" not the "why"
Tests ✅ Changes are appropriate; no new tests needed for a typing-only PR
Security ✅ No concerns

The two sentinel-related points (items 1 and 2) are the ones most worth addressing before merge. Items 3–5 are lower priority but worth a quick pass.

JSv4 added 3 commits May 3, 2026 17:49
Resolves three conflicts (CHANGELOG.md, docs/typing/mypy_baseline.txt,
mypy.ini): documents.* modules graduated by this PR are kept off the
ignore list and out of the baseline, while origin/main's parallel
typing graduations (corpuses, agents, analyzer, llms.tools, shared,
small single-file packages) are preserved.
The PR commit message claimed to quote six `user: "User"` annotations
but two were missed: `permanently_delete_document` (line 677) and
`permanently_delete_all_in_trash` (line 782). Without quotes those
annotations are evaluated at function-definition time, so importing
`opencontractserver.documents.versioning` raised `NameError: name
'User' is not defined` because `User` is only imported under
`TYPE_CHECKING`. That broke every test module transitively importing
versioning (sidecar/zip imports, document_path migrations, agent memory
curate task, etc.) — which is what the failing pytest run reported.

Quote both annotations to match the existing pattern (lines 136, 413,
469, 506); now all six `user: "User"` occurrences are consistent.
Addresses the five points from claude[bot]'s review on the original
(pre-merge) revision of this PR:

1. versioning.py move_document() — replace `assert not isinstance(...)`
   with a real `raise TypeError` so the sentinel guard is not silently
   stripped when Python runs with `-O`.

2. versioning.py move_document() — tighten `new_folder` annotation
   from `Optional[CorpusFolder] | str` to
   `Optional[CorpusFolder] | Literal['UNSET']` so callers passing any
   other string fail at type-check time rather than runtime.

3. Trim explanatory comments that restated the type-narrowing mechanics
   for mypy rather than documenting non-obvious invariants:
   - migrate_pipeline_settings.py:606 (the list(...) widening blurb)
   - compute_document_hashes.py:74-77 (the str-or-None recap above the
     narrowing assert)

4. checks.py — widen the three `errors: list[Warning]` accumulators
   to `list[Error | Warning]` so all four check functions in the
   module use a uniform error-list type. Future check additions can
   append `Error` items without re-typing the accumulator.

5. tests/test_pdf_hash.py — move the `assert computed_hash is not None`
   narrowing line above the first `assertEqual(computed_hash, ...)`
   call so the intent (narrow before use) reads top-to-bottom.
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review: typing: graduate opencontractserver.documents.* (refs #1447)

Summary: Clean, minimal typing graduation PR. Removes eight ignore_errors blocks from mypy.ini and fixes the underlying type errors with targeted, well-scoped annotations. No runtime behavior changes. Net result: 40 fewer baseline errors (7124 → 7084).


What's done well

  • versioning.py sentinel pattern: Replacing Optional[CorpusFolder] = "UNSET" with Optional[CorpusFolder] | Literal['UNSET'] is strictly more correct than widening to str. The Literal['UNSET'] annotation makes the sentinel semantically explicit and prevents any other string from passing the type checker at call sites.
  • raise TypeError over assert: The comment in versioning.py correctly notes that assert is stripped under python -O. Using a real raise TypeError is the right call for a public API function.
  • TYPE_CHECKING import for User: Removing the module-level User = get_user_model() and replacing with a TYPE_CHECKING guard is the canonical pattern when User is only referenced in annotations. Avoids a redundant DB-hitting call at import time in tests and management commands.
  • Scoped disable_error_code for documents.models: Rather than ignore_errors = True, the PR uses a narrow disable_error_code = django-manager-missing at the bottom of mypy.ini, mirroring the existing users.models pattern. Correctly documents the root cause (django-tree-queries reverse relation surfaces).
  • stats: dict[str, Any]: The comment explaining why the dict is heterogeneous is appropriate — this is exactly the kind of non-obvious constraint the CLAUDE.md calls out as comment-worthy.

Issues / suggestions

Minor: inconsistent assert vs raise for null-narrowing in compute_document_hashes.py

# compute_document_hashes.py:75
assert document.pdf_file.name is not None  # narrowing for mypy

versioning.py explicitly avoids assert because it's stripped under python -O, but this management command uses one. Management commands are rarely run with -O, so the risk is low, but the inconsistency is worth noting. If the name is None at this point, the next line (storage.exists(document.pdf_file.name)) would pass None anyway, which was the original bug. Consider replacing with:

if document.pdf_file.name is None:
    self.stdout.write(self.style.ERROR(f"Document {document.id} has a file field with no name"))
    continue

This aligns with the error-handling style already used in the surrounding handle() method.

Nit: author=author_obj # type: ignore[misc] in models.py

# models.py:294
author=author_obj,  # type: ignore[misc]

The underlying issue is likely that DocumentSummaryRevision.author is typed as a concrete User subclass but author_obj is AbstractBaseUser. The ignore is pragmatic and correct for now, but a follow-up to narrow author_obj to settings.AUTH_USER_MODEL or the return type of get_user_model() would eliminate this ignore cleanly. Not a blocker for this PR.

Nit: list[Error | Warning] applied uniformly but one function only emits Error

In checks.py, check_embedding_batch_size_configuration (line 242) only ever appends Error instances, so list[Error] would be strictly more precise. The uniform list[Error | Warning] annotation is harmless and keeps the pattern consistent, so this is purely informational.


mypy.ini restructuring

The section replacement (removing documents.* blocks, adding corpuses.*, discovery.*, examples.*, feedback.*) is correct. The alphabetical/logical grouping is clean. The comment block above the new [mypy-opencontractserver.documents.models] entry at line ~970 clearly explains the django-tree-queries limitation.


Test coverage

  • test_pdf_hash.py: the assert computed_hash is not None addition strengthens the test by making a None return an explicit failure (previously it would fail on the assertEqual with a less clear message). The field: Any = _meta.get_field(...) cast is the correct approach for field-introspection tests.
  • No new behavioral code paths introduced, so no additional tests are required.

Verdict: Approve. The one meaningful suggestion (assert vs. guard in compute_document_hashes.py) is a nit and doesn't block merge. The rest of the fixes are clean and follow established patterns in the codebase.

JSv4 added 2 commits May 3, 2026 18:33
The merge with origin/main accidentally retained ignore_errors blocks for
modules that origin/main had since graduated (corpuses.*, discovery.views,
examples.*, feedback.models). This also produced a duplicate
[mypy-opencontractserver.discovery.tests.test_discovery_views] section,
which made mypy abort config parsing and surface ~2.1k baseline errors.

Removes the redundant blocks. The narrower User typing in versioning.py
also broke a corpuses.models call site that passes AbstractBaseUser into
import_document; suppress with a localized type: ignore[arg-type] rather
than re-baselining corpuses.models.

Address Claude review (PR #1475): replace bare 'assert pdf_file.name is
not None' in compute_document_hashes with an explicit None-guard that
records an error and continues, matching the surrounding error-handling
style and surviving 'python -O'.
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review

Summary: Typing graduation for opencontractserver.documents.* — removes 8 ignore_errors = True blocks from mypy.ini, prunes 40 baseline error lines (7124 → 7084), and makes surgical per-file fixes. Clean, well-scoped PR. The mechanical typing changes are all correct. A few items worth a look:


Genuine runtime changes (the PR description says "no runtime behavior changes intended")

compute_document_hashes.py — The new pdf_file.name is None guard at line 72–80 is actually a legitimate bug fix (a missing-name FieldFile would have thrown an unhandled TypeError at storage.exists(None)). This is a good change but it is a behavior change — the command now counts these as errors and continues rather than crashing. Fine to keep, just worth documenting accurately.

versioning.pymove_document sentinel:

elif isinstance(new_folder, str):
    raise TypeError(...)

Previously, a stray string (e.g., move_document(..., new_folder="some/path")) would have silently gone to the else branch and tried to assign a string as folder_to_use, likely crashing downstream. The new raise TypeError is strictly better defensive behavior, but it is a runtime change. The comment says "Defensive: assert is stripped under python -O so use a real raise" — this rationale is correct and worth keeping.


corpuses/models.py# type: ignore[arg-type] at user=user

user=user,  # type: ignore[arg-type]

This suppress appears because import_document() now has user: "User" (the concrete model) while the caller in corpus.import_content() passes a differently-typed user. Since corpuses is still under ignore_errors = True, this is fine as a bridge, but it would be worth leaving a # TODO: fix when corpuses is graduated (#1447) note so it doesn't get forgotten. The suppress is correct as a stopgap.


Three newly surfaced baseline errors

The honest accounting is appreciated — the 3 errors hidden by the old ignore_errors blocks that are now visible and properly baselined:

opencontractserver/discovery/views.py:769: error: "QuerySet[Document, Document]" has no attribute "search_by_embedding"
opencontractserver/examples/structured_response_example.py:244: ...
opencontractserver/feedback/models.py:13: ...

The discovery/views.py:769 error is worth a second look before this merges — search_by_embedding missing on the QuerySet could indicate a real bug (method defined on a custom manager that isn't exposed via the queryset), not just a django-stubs gap. The other two look like straightforward annotation omissions.


Small nit: checks.py description vs. diff

The PR description says "typed three errors accumulators (one widened to list[Error | Warning])" but the diff touches four functions. The fourth (check_embedding_batch_size_configuration) must have already had a Warning-only annotation that got widened here. The description is slightly misleading but the code is correct.


Everything that looks good

  • TYPE_CHECKING guard + quoted forward refs in versioning.py — correct pattern, avoids the circular-import risk of the old module-level User = get_user_model().
  • Literal['UNSET'] for the sentinel — well-typed, matches the codebase convention.
  • folder_to_use: Optional[CorpusFolder] declaration before the branch — correct narrowing technique.
  • all_components: list[Any] = [] and result: dict[str, list[Any]] = {} — appropriate use of Any for heterogeneous collections where a proper union type would be cumbersome.
  • field: Any = Document._meta.get_field(...) in tests — correct; django-stubs returns Field (abstract) and field-specific attributes aren't available without a cast.
  • assert computed_hash is not None narrowing in the test — strengthens the test while satisfying mypy.
  • (old_hash or '')[:8] coalesce — correct guard; old_hash is Optional[str] and slicing None would fail.
  • group_components rename in _list_components — necessary to prevent the reassignment from conflicting with the tuple type; the list(group_components) copy is negligible in a management command.
  • The scoped disable_error_code = django-manager-missing for documents.models with a clear comment explaining why (django-tree-queries reverse relations) — consistent with the same pattern used for users.models.

Verdict: Ready to merge with the discovery/views.py:769 question resolved (or explicitly deferred). The runtime changes are improvements, not regressions. The typing work is solid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant