Skip to content

Migrate codebase to v2 PAWLs as canonical format end-to-end#1488

Open
JSv4 wants to merge 20 commits intomainfrom
claude/analyze-pawls-versions-ke9Pp
Open

Migrate codebase to v2 PAWLs as canonical format end-to-end#1488
JSv4 wants to merge 20 commits intomainfrom
claude/analyze-pawls-versions-ke9Pp

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented May 3, 2026

Summary

This PR completes the migration to PAWLs v2 as the canonical runtime format across the entire codebase (backend and frontend). v1 is now accepted only at well-defined I/O boundaries and is never exposed in active runtime code paths.

Key Changes

Backend

  • New pawls_io module (opencontractserver/utils/pawls_io.py): Single import boundary for all PAWLs data entering the system at runtime

    • to_canonical_v2(): Normalizes v1 (list) or v2 (dict) wire input to canonical v2 dict
    • load_canonical_v2(): Reads from Django FieldFile, file-like, Path, or pre-decoded JSON and returns v2
    • TokenView / PageView / iter_pages(): Zero-copy read-views over v2 token rows with named attribute access (no index magic)
    • to_v1_pages(): Converts v2 → v1 only at export boundaries (document export, plasmapdf hand-off)
  • Type definitions (opencontractserver/types/dicts.py): Added CompactImageMetaType, CompactTokenType, CompactPageType to describe canonical v2 shape; marked v1 types (PawlsTokenPythonType, PawlsPagePythonType) as import-boundary-only

  • Consumer migration: Updated all active code paths to use load_canonical_v2() and read-views instead of expand_pawls_pages():

    • multimodal_embeddings.py: Uses TokenView for image token extraction
    • image_tools.py: Uses iter_pages() and PageView for image listing
    • annotations/utils.py: Uses load_canonical_v2() for content modality detection
    • etl.py, doc_tasks.py, data_extract_tasks.py, etc.: All switched to v2 load boundary
  • Tests (opencontractserver/tests/test_pawls_io.py): Comprehensive coverage of format normalization, I/O, read-views, and round-trip equivalence

Frontend

  • Refactored compactPawls.ts:

    • Renamed expandPawlsPages()decodeV2Pawls() to clarify it decodes wire format to canonical v2-shape objects
    • Renamed isCompactPawlsFormat()isV2WirePawls() for clarity
    • Accepts both v1 (array) and v2 ({v:2, p:[…]}) wire input; always returns CompactPage[] (v2-canonical)
    • v1 normalization is internal — no v1 shape leaks past the decoder
  • Type definitions (frontend/src/components/types.ts):

    • Renamed TokenCompactToken, PageTokensCompactPage
    • CompactToken always has isImage (camelCase, not is_image); image metadata lives on imageMeta using v2 short keys
    • CompactPage has flat fields (index, width, height, tokens) — no nested page object
  • Consumer updates: All components updated to use CompactPage / CompactToken and decodeV2Pawls():

    • transform.tsx, pdf.ts, rest.ts, cachedRest.ts, utils.ts
    • Tests updated to match new shape
  • Cache invalidation (documentCacheManager.ts): Bumped DB version to v3 to clear stale v1-shaped entries

I/O Boundaries

v1 is accepted only at these four points:

  1. Backend import: pawls_io.load_canonical_v2() — accepts v1 or v2, returns v2
  2. Backend export: `pawls_io.to_

https://claude.ai/code/session_01PNdtRXsw1NZEaRKfipqXyn

claude added 4 commits May 3, 2026 02:51
Phase 1 of v1 -> v2 internal migration:

- New `opencontractserver/utils/pawls_io.py` is the single load
  boundary. `to_canonical_v2()` accepts v1 (list) or v2 (dict),
  always returns a v2 dict, raises on garbage. `load_canonical_v2()`
  reads from FieldFile, file-likes, or paths and normalizes.
  `TokenView`/`PageView` give zero-copy attribute access over v2 rows.
- Boundary-strict semantics: pathological v1 input that would silently
  fall back to v1 in `compact_pawls_pages` now raises, preventing
  v1 from leaking past the load boundary into active code paths.
- Five import/persist sites swapped from `compact_pawls_pages` to
  `to_canonical_v2`: parser persist, legacy import, v2 import, async
  import task, worker upload task. Same on-disk result for normal
  inputs.
- New `CompactPawlsV2Type`, `CompactPawlsPageType`, `CompactPawlsTokenType`,
  `CompactImageMetaType` aliases. Existing `PawlsPagePythonType` and
  `PawlsTokenPythonType` annotated as v1-wire-only.
- 17 new tests in `test_pawls_io.py`; existing `test_compact_pawls`
  unchanged and passing.

Internal consumers of `expand_pawls_pages` are untouched in this
phase. Frontend untouched. Parsers untouched (they emit v1 internally;
the persist boundary normalizes).
Phase 2 of v1 -> v2 internal migration. All 18 active call sites
that previously consumed v1 PAWLs via `expand_pawls_pages` now load
canonical v2 via `load_canonical_v2` and walk it through `iter_pages`
+ `TokenView`/`PageView` (or directly on the v2 dict).

Migrated consumers:
- `shared/decorators.py` (sync + async doc analyzer task injectors)
- `tasks/doc_tasks.py` (FUNSD export pipeline)
- `tasks/data_extract_tasks.py` (LLM extraction grounding)
- `annotations/{utils,models}.py` and the `populate_content_modalities`
  management command
- `llms/tools/image_tools.py` and `llms/tools/core_tools/{search,
  document_indexing,annotations}.py`
- `utils/{pdf_token_extraction,multimodal_embeddings,
  extraction_grounding,importing}.py`
- `pipeline/base/parser.py` (passes the v2 result of `to_canonical_v2`
  into `import_annotations`)

`expand_pawls_pages` retains 7 deliberate callers, all at external
boundaries:
- 5 plasmapdf `build_translation_layer` hand-offs (decorators ×2,
  three core_tools modules, extraction_grounding) — third-party API
  still consumes v1 `PawlsPagePythonType` lists.
- 2 export wire-format payload assemblers (`utils/etl.py`,
  `utils/export_v2.py`) — the `OpenContractDocExport.pawls_file_content`
  contract is documented v1 and unchanged here.

Each remaining call site has an inline comment marking it as a
deliberate v2->v1 hand-off so Phase 4's reaper can decide between
deleting them (after plasmapdf/export wire migrations) or moving them
behind a permanent `pawls_io.to_v1_pages` adaptor.

Test fixtures for `test_annotated_document_import`,
`test_task_decorators`, and `test_async_task_decorators` updated to
match the v2 in-memory contract injected by the migrated decorators
and lazy loaders.

Behavior unchanged for end users: same on-disk output, same export
wire format, same plasmapdf hand-off shape. Internal in-memory
representation is now v2 everywhere except at the named boundaries.
Phase 3 of v1 -> v2 migration. The frontend now operates on
v2-canonical typed objects in memory. The decoder is the only
place v1 wire input is tolerated (some on-disk files are still
v1 because the migration deliberately does not backfill).

Decoder rewrite (`frontend/src/utils/compactPawls.ts`):
- New API: `decodeV2Pawls(json: unknown): CompactPage[]` accepts
  v1 (top-level array) or v2 (`{v: 2, p: [...]}`) wire input and
  always returns v2-shape `CompactPage[]`. Throws on garbage.
- `isV2WirePawls(value)` type guard for the wire shape.
- `expandPawlsPages` and `isCompactPawlsFormat` removed entirely.

Type system (`frontend/src/components/types.ts`):
- v1 `Token`, `PageTokens`, `Page` removed.
- Replaced with `CompactPage`, `CompactToken`, `CompactImageMeta`.
- Page is flat (`page.width`, not `page.page.width`).
- Image fields collapsed into `CompactImageMeta` keyed with v2
  short keys (`p`, `b64`, `f`, `ch`, `ow`, `oh`, `it`).
- `is_image` -> `isImage` (always set).

Hot-path migrations:
- `annotator/types/pdf.ts` (PDFPageInfo): `tokens: CompactToken[]`,
  `t.is_image` -> `t.isImage`.
- `utils/transform.tsx`: `normalizeTokensToPdfViewport` and
  `resolvePageTokens` retyped; flattened page-shape access.
- `annotator/api/rest.ts`, `annotator/api/cachedRest.ts`: return
  type now `Promise<CompactPage[]>`.

The other recon-listed files (PDFPage, SelectionLayer,
textBlockEncoding, useTextSearch, SelectionTokens,
SelectionTokenGroup, DocumentAtom) needed no changes -- they
access tokens via fields whose names are unchanged on
`CompactToken` (`x`, `y`, `width`, `height`, `text`).

Tests:
- `compactPawls.test.ts`, `transform.normalizeTokens.test.ts`,
  `pdf.test.ts` updated to v2 wire fixtures and the new
  in-memory shape. Decoder kept one v1-wire test case to verify
  v1 tolerance.
- `yarn tsc --noEmit` clean. 61 / 61 unit tests pass.

IndexedDB cache invalidation (`services/documentCacheManager.ts`):
- DB_VERSION bumped 2 -> 3. The pawls in-memory shape changed,
  so any cached entries from before this commit would deserialize
  into v1-shape objects and break consumers that now expect v2.
- `onupgradeneeded` clears both stores on the 2 -> 3 transition.
  PDFs and text files are cheap to re-fetch.
Phase 4 of v1 -> v2 migration. Cleans up the seven remaining
`expand_pawls_pages` callers (all at deliberate external
boundaries: plasmapdf hand-off + export wire format) by routing
them through a permanent named adaptor.

`pawls_io.to_v1_pages(canonical_v2)`:
- Boundary-only v2 -> v1 adaptor. Docstring restricts use to two
  documented boundaries: plasmapdf `build_translation_layer` and
  export wire-format payload assembly
  (`OpenContractDocExport.pawls_file_content`,
  `StructuralAnnotationSetExport.pawls_file_content`).
- Active runtime code MUST NOT use it. Operate on the canonical
  v2 dict (or `PageView`/`TokenView`) instead.

Routed seven boundary call sites from
`compact_pawls.expand_pawls_pages` to `pawls_io.to_v1_pages`:
- `shared/decorators.py` (sync + async doc analyzer hand-off, ×2)
- `llms/tools/core_tools/{search,document_indexing,annotations}.py`
- `utils/extraction_grounding.py`
- `utils/{etl,export_v2}.py` (export wire-format assemblers)

After this commit, no active code in `opencontractserver/` (outside
`pawls_io.py`'s own internal use, the legacy
`compact_pawls.expand_pawls_pages` implementation, and the v1<->v2
round-trip tests) imports `expand_pawls_pages` directly. Each
boundary site has the inline comment naming the boundary intent.

Doc rewrite (`docs/architecture/pawls-format.md`):
- "Accessor Layer" section -> "Boundary Layer". Documents the four
  v1 boundaries (backend import / backend export / plasmapdf /
  frontend wire), the canonical v2 contract for active code, the
  strict 100k-tokens-per-page failure mode, and the new
  `pawls_io.py` API. Read/write path tables updated. Implementation
  table now lists `pawls_io.py` as the boundary module and
  `compact_pawls.py` as low-level codec.

CHANGELOG entry under [Unreleased] / Changed summarizing the
end-to-end migration.

Verified locally: `python3 -m py_compile` clean on all touched
backend files; `yarn tsc --noEmit` clean; `yarn test:unit`
68 / 68 pass on `compactPawls`, `transform.normalizeTokens`,
`useTextSearch`, `pdf`, `documentCacheManager` modules. Backend
test suite must be run in Docker (`docker compose -f test.yml run
django pytest opencontractserver/tests/test_pawls_io.py
opencontractserver/tests/test_compact_pawls.py …`); Docker daemon
was unavailable in this sandbox.
Comment thread opencontractserver/utils/pawls_io.py Fixed
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review — PR #1488: Migrate codebase to v2 PAWLs as canonical format

Overview

This is a substantial, well-scoped architectural migration. The core design — a single pawls_io module as the sole v1 acceptance boundary, zero-copy TokenView/PageView read-views, and explicit to_v1_pages() at external boundaries — is clean and the rationale is well documented. The changelog, architecture docs, and inline comments all reflect high care. Below are issues worth resolving before merge.


Issues

1. Double import in decorators.py — will fail isort

shared/decorators.py has two separate from opencontractserver.utils.pawls_io import lines:

from opencontractserver.utils.pawls_io import to_v1_pages
from opencontractserver.utils.etl import is_dict_instance_of_typed_dict
from opencontractserver.utils.pawls_io import load_canonical_v2

These should be a single import. As-is this will fail the isort pre-commit hook.


2. dict shortcut bypasses validation in two places

Both _resolve_v2_pawls (multimodal_embeddings.py) and the inline normalization in annotations/utils.py skip to_canonical_v2() for dict inputs:

# multimodal_embeddings.py
if isinstance(pawls_data, dict):
    return pawls_data  # ← no v2 validation

# annotations/utils.py
elif isinstance(pawls_data, dict):
    canonical = pawls_data  # ← same

A v1-era dict, a completely unrelated dict, or a malformed v2 dict would all pass through unchecked. The fix is straightforward — call to_canonical_v2(pawls_data) instead, which is already idempotent for valid v2 inputs:

if isinstance(pawls_data, dict):
    return to_canonical_v2(pawls_data)  # validates and is idempotent

3. CompactPawlsTokenType = list is dangerously permissive typing

In types/dicts.py:

CompactPawlsTokenType = list

This accepts any list — including empty lists, lists of strings, etc. The actual contract ([x, y, w, h, text] or [x, y, w, h, text, CompactImageMetaType]) is only in comments. A tighter alias would catch bugs at type-check time:

# closer approximation; or use a TypeAlias + Union if desired
CompactPawlsTokenType = list[Union[float, str, "CompactImageMetaType"]]

Even if perfect precision isn't achievable, the current alias provides no safety at all.


4. Behavior change in worker_uploads/tasks.py is a silent breaking point for large documents

The old compact_pawls_pages(pawls_content) silently fell back to v1 when any page exceeded COMPACT_PAWLS_MAX_TOKENS_PER_PAGE (100,000 tokens). The new to_canonical_v2(pawls_content) raises ValueError in that case — which bubbles up into a failed upload task. The CHANGELOG and architecture doc document this intent, but the call site in worker_uploads/tasks.py has no try/except and no fallback:

pawls_file = ContentFile(
    json.dumps(to_canonical_v2(pawls_content)).encode("utf-8"),  # ← raises on >100k tokens
    name="pawls_tokens.json",
)

This is likely fine in practice (100k-token pages are pathological), but given that v1 files may still be uploaded by external tools, it may be worth adding a narrow except ValueError with a specific error message and task failure rather than an uncaught exception that produces a generic Celery traceback.


5. CompactPawlsV2Wire exported from compactPawls.ts — leaks internal wire type

The module comment explicitly says internal wire-shape types "never leak past the decoder", but CompactPawlsV2Wire is exported:

export interface CompactPawlsV2Wire {   // ← public export
  v: 2;
  p: CompactPageV2Wire[];
}

If this is intentional (e.g. for the isV2WirePawls type guard return type) it should be noted in the JSDoc. Otherwise it should be interface only (no export) to match the stated design intent.


6. _v2_image_token_to_v1_dict in multimodal_embeddings.py — track cleanup

This backward-compat adapter converts v2 TokenView → v1 long-key dict so downstream embedders continue to receive the shape they were written for. The comment says "Phase 2 will migrate them off v1 entirely". This function and its callers are the one place where v1-shape survives inside an internal module — worth tracking in a follow-up issue so it doesn't become forgotten dead code.


7. Minor: opener("rb") if was_closed else None — use if for side effects

In pawls_io._read_text_from_source:

opener("rb") if was_closed else None

Using a conditional expression for a side effect is unusual and None as the else branch is a code smell. Standard Python idiom is:

if was_closed:
    opener("rb")

Positive Notes

  • Single-entry-point design and the four-boundary rule are clear and well-enforced across almost all call sites.
  • TokenView/PageView using __slots__ is a good performance choice for hot paths.
  • to_canonical_v2 being idempotent makes defensive calls safe everywhere.
  • The IndexedDB upgrade hook correctly clears both stores to force re-decode — this is easy to get wrong.
  • The test round-trip (expand_pawls_pages(to_canonical_v2(v1)) == expand_pawls_pages(v1)) is exactly the right invariant to pin.
  • Inline boundary-intent comments at each to_v1_pages() call site are valuable for future readers.

Summary

Items 1 (isort failure), 2 (dict bypass), and 3 (weak typing) should be fixed before merge. Items 4–7 are lower severity but worth addressing or explicitly tracking.

…rrow worker_uploads error, depublish CompactPawlsV2Wire, link cleanup TODO to #1490
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review: Migrate codebase to v2 PAWLs as canonical format end-to-end

This is a well-structured architectural migration. The boundary-layer approach — v2 everywhere internally, v1 only at four named I/O seams — is the right call, and the PR executes it consistently across a large surface area. Below are findings ranging from bugs to nits.


Potential Bugs

1. list_document_images doesn't normalize v1 input, but get_document_image does

get_document_image has:

elif isinstance(pawls_data, list):
    pawls_data = to_canonical_v2(pawls_data)

list_document_images passes the caller-supplied pawls_data straight to iter_pages(), which requires canonical v2 and raises if given a v1 list. If any existing caller passes a v1 list to list_document_images, it silently worked before and now throws. The API contract between the two sibling functions is inconsistent.

2. Frontend: decodeV2Pawls is now throw-on-bad-input instead of return-[]

normalizeV1Page throws when a page entry isn't an object:

if (typeof page !== "object" || page === null) {
  throw new Error("Invalid v1 PAWLs page: ...");
}

The callers in rest.ts / cachedRest.ts don't catch:

return axios.get(url).then((r) => decodeV2Pawls(r.data));

If the server returns malformed JSON (or an unexpected shape during a rolling deploy), the promise rejects instead of resolving to []. The previous expandPawlsPages returned [] for unrecognized formats. Whether the new error-propagating behavior is desired is a deliberate choice, but it's worth explicitly noting as a breaking-behavior change and confirming higher-level error boundaries exist.


DRY / Architecture

3. _IMAGE_KEY_REVERSE imported from a private symbol in compact_pawls.py

from opencontractserver.utils.compact_pawls import (
    _IMAGE_KEY_REVERSE,   # <-- leading underscore = internal
    compact_pawls_pages,
)

pawls_io.py uses it in TokenView.image_meta_v1. Either promote _IMAGE_KEY_REVERSE to a public name in compact_pawls.py (or move it to constants/pawls.py), or inline the small reverse-mapping dict in pawls_io.py to cut the dependency on a private symbol.

4. Duplicate v2→v1 image-token adaptor (violates DRY)

image_tools.py defines _v2_token_to_v1_image_dict(token: TokenView) and multimodal_embeddings.py defines the nearly identical _v2_image_token_to_v1_dict(token: TokenView). Both are bridge functions until get_image_as_base64/get_image_data_url are migrated. Since TokenView lives in pawls_io, the shared adaptor can live there (e.g., token_view_to_v1_dict(view: TokenView) -> dict) and be imported by both callers — CLAUDE.md is explicit about DRY.


Inconsistent v2 Access Patterns

5. populate_content_modalities.py and annotations/utils.py bypass the PageView/iter_pages API

Both files were migrated to call load_canonical_v2() but then do manual pages = canonical.get("p") or [] / rows = page_dict.get("t") or [] dict-walking instead of using the PageView/iter_pages read-views the PR introduces. This undermines the point of the ergonomic layer and creates two divergent access styles in the same PR. Since these files already import TokenView, they should use iter_pages + page.tokens for consistency.


Minor Issues

6. Redundant assert after ValueError guard in to_canonical_v2

if not is_compact_pawls_format(result):
    raise ValueError(...)
assert isinstance(result, dict)   # redundant — assert is disabled with -O
return result

The assert is dead code under python -O. Remove it or replace with a plain if … raise.

7. to_v1_pages docstring says "Accepts pre-decoded v2 dicts" but the implementation calls expand_pawls_pages which transparently passes v1 lists through

This could mislead callers into thinking passing a v1 list is an error. Either tighten the implementation to validate the input is v2, or update the docstring to acknowledge the pass-through behavior.

8. Sentinel value change in annotations/models.py is subtle

# Before:
pawls_data = []   # falsy, prevents retry
# After:
pawls_data = {}   # falsy, prevents retry

Technically correct (both are falsy), but the comment says "normalize to v2 and short-circuit on empty input." An empty dict is not a valid v2 PAWLs dict ({"v":2,"p":[]} would be), so _resolve_v2_pawls({}) correctly returns None. Still, using a typed sentinel (e.g., pawls_data = {"v": 2, "p": []}) would be more explicit about intent.


What's Good

  • The four documented v1 boundaries in the architecture doc are clear and make auditing easy.
  • TokenView.__slots__ is a nice touch for a hot-path accessor.
  • image_meta_v1 property on TokenView is a clean bridge pattern.
  • test_pawls_io.py is thorough — the round-trip equivalence test in particular is a good regression guard.
  • IndexedDB version bump with the upgrade-hook that clears stores is exactly right.
  • The inline comments at every to_v1_pages call site explaining the boundary intent are valuable.
  • Worker upload now raises loudly on unprocessable PAWLs instead of silently writing garbage — correct behavior.

Summary

The two items worth addressing before merge are: the inconsistent v1-normalization gap between list_document_images and get_document_image (#1), and the DRY violation for the image-token adaptor (#4). The private-import of _IMAGE_KEY_REVERSE (#3) and the inconsistent access-pattern style (#5) are worth fixing but not blocking.

JSv4 added 2 commits May 3, 2026 09:27
In _compact_token, the 6th-element image_meta dict was only appended when
non-empty, which silently dropped the is_image semantic for image tokens
that carried no metadata fields. The v2 contract is 'dict at index 5 ==
is_image', so we now append the dict unconditionally for image tokens.

Also align test_annotation_window_pdf fixture with v2 positional page
indexing — the mock pawls page is now at index 0 to match the
annotation's pageIndex 0 reference.
- Move duplicate _v2_*_to_v1_dict helpers from image_tools.py and
  multimodal_embeddings.py into a single token_view_to_v1_image_dict in
  pawls_io.py (DRY).
- Promote compact_pawls._IMAGE_KEY_REVERSE to a public alias
  IMAGE_META_V2_TO_V1 so pawls_io stops reaching into a private symbol.
- Drop the now-redundant assert in to_canonical_v2 (already proven by the
  is_compact_pawls_format guard); use cast() for type narrowing instead.
- Document the v1-list pass-through behavior of to_v1_pages so callers
  don't expect strict v2-only validation.
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review — PR #1488: Migrate codebase to v2 PAWLs as canonical format end-to-end

Overview

This is a well-scoped, well-executed architectural migration. The core idea — define four explicit I/O boundaries where v1 is tolerated, and treat v1 appearing anywhere else as a bug — is a significant improvement over the previous "accessor layer" approach that silently converted in every consumer. The pawls_io module is a clean single-entry-point, the read-views (TokenView/PageView) are ergonomic and zero-copy, and the documentation/changelog updates are exemplary.


Positive highlights

  • pawls_io.py is excellent: Single responsibility, well-documented, correctly separates the I/O boundary from the format normalization. The __slots__ on TokenView/PageView is a thoughtful micro-optimization.
  • _compact_token fix is a genuine semantic bug fix: Previously, an image token with no metadata fields produced a 5-element array, which was indistinguishable from a text token on the v2 read path. Appending an empty {} unconditionally for image tokens corrects the round-trip.
  • Cache invalidation is correct: Bumping DB_VERSION to 3 with an onupgradeneeded store-clear is the right approach for a breaking in-memory shape change.
  • Frontend v1 normalization is complete: decodeV2Pawls faithfully remaps is_imageisImage, image_pathimageMeta.p, etc. Tests cover all field mappings.
  • token_view_to_v1_image_dict is honest: Clearly labeled a Phase-2 bridge with a tracking issue (Retire _v2_image_token_to_v1_dict adaptor: migrate embedders + image helpers to v2 directly #1490). This is the right pattern — don't pretend the migration is 100% done when two callers still need v1 shapes.

Issues and suggestions

1. data_extract_tasks.py uses .path — breaks cloud storage

# data_extract_tasks.py ~line 795
pawls_canonical = load_canonical_v2(doc.pawls_parse_file.path)

Every other migration site uses doc.pawls_parse_file (the FieldFile itself), which goes through _read_text_from_source's FieldFile-aware path and works with S3/GCS backends. .path forces a local filesystem read and will raise NotImplementedError on remote storage backends. This should be load_canonical_v2(doc.pawls_parse_file).

2. populate_content_modalities.py bypasses iter_pages() inconsistently

# populate_content_modalities.py ~line 112
pages = pawls_data.get("p") or []
# ...then walks page_dict["t"] manually

The rest of the codebase uses iter_pages() or TokenView for access. This inline dict-walking with pages[page_idx].get("t") re-implements what iter_pages/PageView.tokens already provide. Since load_canonical_v2 was called just above, using iter_pages(pawls_data) would be cleaner and consistent — though functionally equivalent.

3. annotations/utils.py type weakening

# Before
def compute_content_modalities(
    ...
    pawls_data: Optional[list[dict[str, Any]]] = None,
# After
    pawls_data: Optional[Any] = None,

The weakened type is pragmatic (it now accepts either v1 list or v2 dict), but Optional[Union[list, dict[str, Any]]] would be more informative and still accurate. The docstring update helps, but the type itself could be tighter.

4. iter_pages has redundant version check

def iter_pages(canonical_v2):
    if not is_compact_pawls_format(canonical_v2):  # already checks v == 2
        raise ValueError(...)
    if canonical_v2.get("v") != COMPACT_PAWLS_VERSION:  # redundant
        raise ValueError(...)

is_compact_pawls_format already validates v == COMPACT_PAWLS_VERSION (which is 2). The second guard is unreachable — if the first passes, v is already verified. Minor, but confusing.

5. PageView.tokens is a generator — can only iterate once

@property
def tokens(self) -> Iterator[TokenView]:
    for row in self._page.get("t", []):
        if isinstance(row, list):
            yield TokenView(row)

Callers that do list(page.tokens) are fine, but if anyone tries to iterate page.tokens twice (e.g., checking count then iterating), the second pass yields nothing. The image_tools.py usage enumerate(page.tokens) works correctly, but PageView.__repr__ correctly avoids calling tokens (uses raw len). Consider returning list[TokenView] instead of a generator, or keeping the generator but typing it as Iterable[TokenView] to make the single-pass contract explicit. At minimum, add a note in the docstring.

6. to_canonical_v2 comment on the fallback path is slightly misleading

# ``is_compact_pawls_format`` guard above already proves the result is a dict.
return cast(dict[str, Any], result)

The comment says "the is_compact_pawls_format guard above already proves..." but the guard is checked on the result of compact_pawls_pages(raw), not before it. The comment is accurate in spirit — the if not is_compact_pawls_format(result): raise just above means by the time we reach cast(...), the format is proven. The comment placement is just slightly confusing.

7. Frontend: isImage is now a required field

Making isImage: boolean non-optional on CompactToken is the right call, but any test helpers or fixture builders outside this PR that construct CompactToken objects without isImage will break at compile time. The PR updates all known callers; just confirm CI TypeScript build passes fully.


Test coverage

  • test_pawls_io.py is thorough: covers format normalization, I/O from BytesIO/StringIO/pre-decoded, view ergonomics, and round-trip equivalence. 17 tests is appropriate.
  • Gap: There are no tests for to_v1_pages directly. It delegates to _expand_pawls_pages, which has existing coverage, so this is low-risk — but a single smoke-test would confirm the delegation works correctly.
  • Gap: token_view_to_v1_image_dict has no test. Given it's a temporary bridge, a unit test would prevent regressions while the migration is incomplete.
  • Frontend tests: Good coverage of v1↔v2 decode symmetry, including the new strict-throw behavior on garbage input.

Minor nits

  • The alt_text: None comment in image_tools.py ("alt_text not currently persisted in v2 image meta") is worth a follow-up issue if alt text was previously preserved.
  • _read_text_from_source has a documentation note that raw JSON strings are not accepted as str inputs (only file paths). Consider adding a runtime guard: if source.strip().startswith(('[', '{'))raise TypeError("Pass pre-decoded JSON as list/dict, not a raw JSON string"). Currently a raw JSON string would raise a confusing FileNotFoundError.
  • The load_canonical_v2 docstring says it accepts "Pre-decoded JSON (list or dict)" — this is handled by the isinstance(file_or_path, (list, dict)) shortcut at the top. Clean.

Summary

Approve with the cloud-storage bug fixed (item #1). The .path usage in data_extract_tasks.py is the only correctness issue — everything else is polish. The architecture is sound, the boundary enforcement is strict where it matters, and the documentation tells a complete story. Once item #1 is addressed and CI is green, this is ready to merge.

JSv4 added 6 commits May 3, 2026 15:06
CHANGELOG conflict resolved: kept both PAWLs v2 migration entry under
[Unreleased] -> Changed and the new entries from main. Removed duplicate
EmbedderSelector entry that was under both my section and main's section.
…pages usage, tighten types

Latest review (Claude bot, 2026-05-03):

- data_extract_tasks.py:796 — pass FieldFile directly to load_canonical_v2
  instead of .path. .path raises NotImplementedError on S3/GCS backends; the
  load helper handles FieldFile, file-like, and Path uniformly.
- populate_content_modalities.py + annotations/utils.py — replaced manual
  v2 dict-walking (pages.get('p'), page.get('t')) with iter_pages() +
  PageView.tokens for consistency with the rest of the migrated code.
- annotations/utils.py — tightened pawls_data signature from Optional[Any]
  to Optional[Union[list[dict[str, Any]], dict[str, Any]]] for both
  compute_content_modalities and update_annotation_modalities.
- pawls_io.iter_pages — removed redundant version check (is_compact_pawls_format
  already enforces v == COMPACT_PAWLS_VERSION); dropped now-unused import.
- pawls_io.PageView.tokens — docstring now explicitly notes the iterator
  is single-pass; new test pins the contract.
- pawls_io.to_canonical_v2 — corrected misleading 'guard above' comment;
  the is_compact_pawls_format check is below, not above, the compact call.
- pawls_io._read_text_from_source — added typed-error guard rejecting raw
  JSON-string inputs (caller hands in a list/dict instead of a path).
- New tests: ToV1PagesTests, TokenViewToV1ImageDictTests,
  test_load_raw_json_string_raises_typeerror, test_tokens_property_is_single_pass.
Frontend (compactPawls.test.ts): 12 new branch-coverage tests for the v1/v2
decoder — non-object page entries, missing/non-numeric width/height,
non-array tokens, non-string text, image tokens with no metadata, and
v2 wire edge cases.

Backend (test_multimodal_embeddings_utils.py): 18 new tests across three
classes:
- TestResolveV2Pawls: exhaustive coverage of _resolve_v2_pawls — None /
  empty list / empty dict / empty string / unsupported type / invalid v2
  dict / malformed v2 pages, plus v1->v2 normalization and v2 idempotence.
- TestGetAnnotationImageTokensV2Paths: v2 canonical dict input, garbage
  pawls_data fallback, structural_set v2 load path.
- TestExtractAndStoreAnnotationImages: full coverage of the previously
  untested extract_and_store_annotation_images — None / empty / garbage
  / no-image-tokens / OOB page / OOB token / non-dict page / v1 auto-
  normalize / valid v2 dict produces a non-empty image_content_file.

These plug the largest patch-coverage gaps reported by codecov on PR #1488
(multimodal_embeddings.py was at 43%; compactPawls.ts was at 73%).
…> 93%)

The codecov report on PR #1488 flagged the populate_content_modalities
management command at 0% patch coverage with 15 lines missing — the
command had no test file at all. New test file at
opencontractserver/tests/test_populate_content_modalities_command.py adds
11 end-to-end tests via call_command():

- --dry-run does not modify rows
- --force reprocesses pre-set annotations
- Default mode skips already-populated rows
- PAWLs path: image-only, text-only, mixed annotations
- No-document fallback to label-text hint (image keyword vs neutral)
- PAWLs load failure falls back to label hint
- Empty token refs fall back to label hint
- Processing exceptions increment error_count and report 'Errors: 1'

Final coverage: 93% (77/83 statements). The 6 remaining lines are minor
defensive branches (out-of-bounds index skips, empty-modality default)
already exercised at the helper level in test_annotation_utils.py.
mypy.ini gains an ignore_errors entry for the new
test_populate_content_modalities_command test module — same baseline
treatment used for all other Django TestCase modules in the project
that rely on the setUpTestData class-attribute pattern (mypy can't
follow Django's deferred classmethod attribute assignment without
per-test refactoring).

Other diffs are black/isort line-collapse fixups.
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review — PR #1488: Migrate codebase to v2 PAWLs as canonical format end-to-end

Overview

This is a well-planned, large-scale refactoring that establishes v2 PAWLs as the canonical runtime format end-to-end. The boundary-module approach in pawls_io.py is the right architectural call — a single load point that normalizes v1/v2 at the edges keeps the invariant enforceable and auditable. The breadth of coverage (18 backend consumers + full frontend decode pipeline) is impressive, and the accompanying tests are thorough.


What's Done Well

  • Boundary module designpawls_io.py as the single normalization entry point is clean. The v1-accepted-only-here invariant is enforced by raising ValueError when the compaction fallback triggers, rather than leaking v1 silently. That is the right trade-off.
  • TokenView / PageView — zero-copy attribute wrappers over the positional arrays eliminate index-magic across the codebase without allocating new dicts.
  • Frontend decoderdecodeV2Pawls accepting both v1 and v2 wire input and always returning CompactPage[] is a clean API. The internal normalizeV1Page to normalizeV2Page split is easy to follow.
  • IndexedDB version bump — bumping to v3 with a clear() upgrade hook is exactly right; stale v1-shape cached entries would have broken consumers silently otherwise.
  • Test coveragetest_pawls_io.py (17 tests) covers boundary semantics, round-trip equivalence, I/O variants, and read-view ergonomics. The frontend compactPawls.test.ts additions cover v1 legacy tolerance thoroughly.
  • Documentation — the "Boundary Layer" section in pawls-format.md clearly states the four allowed v1 boundaries and the strict-failure contract. Very useful for future contributors.

Issues and Suggestions

1. to_v1_pages does not enforce v2-only input at the call site

to_v1_pages is documented as a "boundary-only v2 to v1 adaptor" but it delegates directly to expand_pawls_pages, which short-circuits on a list (v1 pass-through). Callers can accidentally pass v1 data and get it back unchanged, which violates the boundary contract.

Suggestion: add a runtime guard in pawls_io.py:

def to_v1_pages(canonical_v2):
    if not isinstance(canonical_v2, dict) or not is_compact_pawls_format(canonical_v2):
        raise TypeError(
            "to_v1_pages expects a canonical v2 dict; call to_canonical_v2() first."
        )
    return _expand_pawls_pages(canonical_v2)

The existing ToV1PagesTests.test_v1_list_input_passes_through would need updating to expect the error, but that test is currently documenting the wrong behavior.

2. JSON-string heuristic in _read_text_from_source could misfire

The guard that rejects raw JSON strings uses a startswith heuristic:

if isinstance(source, str) and source.lstrip().startswith(("[", "{")):
    raise TypeError(...)

A GCS/S3-style filesystem path beginning with {bucket}/path/file.json would trigger this guard and raise TypeError instead of opening the file. The failure mode would be confusing. Consider accepting str inputs unconditionally as filesystem paths (as the docstring documents them) and letting open() give the natural FileNotFoundError for non-existent paths.

3. pawls_data = {} sentinel is implicitly falsy

In annotations/models.py, the error sentinel changed from [] to {}:

except Exception as e:
    logger.error(...)
    pawls_data = {}  # Sentinel to prevent repeated attempts

This works because if pawls_data: is False for an empty dict. However, {} is also a valid (empty) v2 canonical dict that to_canonical_v2 would reject with ValueError. If the truthiness check ever changes, the sentinel could be misinterpreted. A named sentinel constant would be clearer:

_PAWLS_LOAD_FAILED = object()
# ...
pawls_data = _PAWLS_LOAD_FAILED
# ...
if pawls_data is not _PAWLS_LOAD_FAILED and pawls_data:

4. _resolve_v2_pawls belongs in pawls_io.py, not multimodal_embeddings.py

_resolve_v2_pawls is a private normalization wrapper over to_canonical_v2 with error-swallowing semantics. Its home in multimodal_embeddings.py is surprising and makes it non-reusable. Consider exposing a try_canonical_v2 helper from pawls_io:

# pawls_io.py
def try_canonical_v2(raw):
    """Like to_canonical_v2 but returns None on failure instead of raising."""
    if not raw or not isinstance(raw, (list, dict)):
        return None
    try:
        return to_canonical_v2(raw)
    except (ValueError, TypeError) as e:
        logger.debug("PAWLs normalization failed: %s", e)
        return None

5. list(pages[page_idx].tokens) called per token reference

In annotations/utils.py and populate_content_modalities.py, the same page_idx may appear multiple times across token references (common for multi-token annotations on one page). Each reference calls list(pages[page_idx].tokens), re-materializing the generator. Consider caching the per-page token lists:

token_cache = {}
# ...
if page_idx not in token_cache:
    token_cache[page_idx] = list(pages[page_idx].tokens)
tokens = token_cache[page_idx]

Low priority but worth noting for densely-annotated large documents.

6. Minor: CompactToken.isImage is now required (was optional)

The old Token.is_image?: boolean was optional. The new CompactToken.isImage: boolean is required. This is the right design choice, but it is a breaking change for any external consumers building CompactToken objects directly. The test helpers were correctly updated with isImage: false defaults.


Nits

  • test_tokens_property_is_single_pass — the test name is slightly misleading since the test ultimately proves a fresh property access returns a new iterator. Rename to test_tokens_property_returns_fresh_iterator_each_call.
  • load_canonical_v2 raises ValueError("Empty PAWLs payload") for zero-byte sources, but the docstring does not mention this case.
  • mypy.ini adds ignore_errors = True for test_populate_content_modalities_command, a new file. Consider fixing the mypy errors rather than suppressing them.

Summary

Approve with minor suggestions. The architecture is sound, the boundary discipline is enforced in the right places, and the test coverage is comprehensive. The main actionable items are:

  1. Add a runtime type guard to to_v1_pages to enforce its v2-only contract (and update the corresponding test).
  2. Reconsider the JSON-string heuristic in _read_text_from_source for brace-prefixed paths.
  3. Address the pawls_data = {} sentinel with a clearer mechanism.

Items 4–6 are optional improvements. This is a high-quality migration PR.

annotation_window's PDF and text branches both used os.path.exists(.path)
and open(.path) — pre-existing patterns that raise NotImplementedError
on remote storage backends (S3/GCS) since FieldFile.path is only defined
for local filesystem storage. Both backends (local + cloud) now work:

- default_storage.exists(file.name) replaces os.path.exists(file.path)
  for the existence guards (lines 703, 789).
- doc.txt_extract_file.open('rb').read().decode('utf-8') replaces
  open(doc.txt_extract_file.path, encoding='utf-8') for the text read.
- doc.pawls_parse_file is already passed by FieldFile to load_canonical_v2
  (fixed earlier in this PR).
- Removed now-unused 'import os'; added 'from django.core.files.storage
  import default_storage'.

test_agent_search_tools.py mocks updated:
- @patch('...os.path.exists') -> @patch('...default_storage.exists')
- patch('builtins.open') replaced with patches at the right boundary:
  load_canonical_v2 for PDF tests (returns canonical v2 directly) and
  the FieldFile.open() classmethod for the text test. The size-limit
  test now builds a 2000-token v1 PAWLs payload (instead of mocking
  open with non-JSON bytes) and asserts a soft clamp around 500
  tokens-per-side + the annotation's own 3 tokens.

pawls_io.iter_pages: corrected return type from Iterable[PageView] to
Iterator[PageView] to match the generator implementation, with a
single-pass note in the docstring.
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review: PAWLs v2 canonical format migration (#1488)

This is a well-designed, comprehensive migration. The single-boundary pattern (pawls_io.py) is the right architectural call, and the overall execution is solid. Below are my findings organized by severity.


Correctness issues

1. list_document_images type annotation mismatch (image_tools.py)

The function signature still declares pawls_data: Optional[list] = None but the body now calls iter_pages(pawls_data) which requires a v2 dict. If a caller passes a v1 list, the iter_pages call will raise a ValueError. The get_document_image counterpart got Optional[dict[str, Any]]list_document_images should too, with the same to_canonical_v2 normalization in the guard clause.

2. _compact_token now appends an empty {} for image tokens with no metadata

In compact_pawls.py:

# Before
if img_meta:
    arr.append(img_meta)
# After
arr.append(img_meta)  # always append for image tokens

This is intentional (preserves is_image flag through round-trips). It's correct, but it's a subtle wire-format change: v2 image token arrays will now always have 6 elements even when the metadata dict is empty. Any downstream code that uses len(row) >= 6 as an is_image check (including TokenView.is_image) will correctly handle this; just worth confirming every len(row) == 5 assumption (if any exist outside the boundary module) was also updated.

3. alt_text silently dropped in list_document_images

alt_text=None,  # alt_text not currently persisted in v2 image meta

The old code was alt_text=token.get("alt_text"). The CompactImageMeta type (both backend CompactImageMetaType and frontend CompactImageMeta) defines no key for alt_text, so there's nowhere to put it in v2. If the field was ever populated (e.g. by a parser or manually), it will be silently lost after a re-parse and then lost from ImageReference on read. This should either get a v2 short key (at) or be explicitly documented as a known unsupported field with a tracking issue.


Minor issues / code quality

4. _get_v2_token bypasses PageView/iter_pages abstraction

image_tools.py:_get_v2_token accesses the raw v2 dict structure directly (pawls_data.get("p"), page_dict.get("t"), etc.) rather than using iter_pages + PageView. This works, but it couples the function to the internal layout the abstraction was designed to hide. Consider using iter_pages + list indexing, or just a PageView lookup helper:

pages = list(iter_pages(pawls_data))
if not (0 <= page_index < len(pages)):
    return None
tokens = list(pages[page_index].tokens)
...

5. Inconsistent v2-dict raw access in data_extract_tasks.py and doc_tasks.py

tokens_as_words (data_extract_tasks) and the FUNSD token walk (doc_tasks) both reach into pawls_canonical.get("p") and page_dict.get("t") directly rather than using iter_pages/PageView. These are the same pattern as _get_v2_token above. It works, but it creates three separate places that know about the wire layout — exactly what pawls_io's abstraction layer was supposed to prevent.

6. pages = list(iter_pages(canonical)) then list(pages[i].tokens) in compute_content_modalities and populate_content_modalities

Materializing all page token-iterators eagerly on every annotation evaluated means O(tokens_per_page) work for each unique page touched. For a management command that runs over thousands of annotations on large docs, consider caching {page_idx: list_of_tokens} once per document instead of re-calling list(pages[page_idx].tokens) per annotation.

7. Pre-existing bug in pawls_token_to_funsd_token not fixed (doc_tasks.py)

The variable names are confusing but the logic was faithfully copied from before:

pawls_ytop = pawls_xleft + pawls_token.width  # actually computes x-right
pawls_xright = pawls_ybottom + pawls_token.height  # actually computes y-bottom

The FUNSD box is (xleft, ytop_computed_as_xright, xright_computed_as_ybottom, ybottom) — the variable names are wrong but the values plugged into the tuple are consistent with the original. Not a regression from this PR, but worth noting as a follow-up.

8. importing.py parameter type annotation weakened

- pawls_data: list[dict] = None,
+ pawls_data=None,

This drops the annotation entirely. Optional[Union[list[dict], dict[str, Any]]] = None would be more informative.


Tests

The test coverage is genuinely excellent:

  • test_pawls_io.py: 17 unit tests covering normalization, I/O, round-trip semantics, and documented single-pass behavior of page.tokens
  • test_populate_content_modalities_command.py: end-to-end coverage of all command paths including --dry-run, --force, label fallbacks, and mixed modalities
  • Frontend: compactPawls.test.ts was extensively expanded for v1 tolerance, v2 edge cases, and error behavior; transform.normalizeTokens.test.ts and pdf.test.ts updated throughout

One gap: the getAnnotationImages path that loads PAWLs from annotation.structural_set.pawls_parse_file has no direct test in the new suite. The management command tests cover populate_content_modalities well, but the LLM agent image-retrieval path for structural annotations remains test-only via integration tests (if those exist).


Architecture / design praise

  • The strict-fail semantics on MAX_TOKENS_PER_PAGE overflow is the right call: loud failure beats a silent v1 leak through the boundary.
  • The S3/GCS compatibility fix (default_storage.exists(.name) vs os.path.exists(.path)) is a welcome bonus.
  • TokenView.__slots__ and zero-copy property access are good choices for a hot path.
  • PageView.tokens returning a fresh generator per access with clear "single-pass" documentation (plus a test that pins the contract) is a good pattern.
  • The IndexedDB version bump with a migration hook that clears stale entries is exactly correct.

Summary

The core migration is sound. Issues 1 and 3 are the ones worth addressing before merge: the list_document_images type/normalization gap could cause a runtime error with v1 list callers, and the alt_text loss should be an explicit decision with a tracking issue. Issues 4-8 are clean-up suggestions that can be done as follow-ups.

JSv4 added 3 commits May 3, 2026 15:38
The v2 PAWLs boundary migration added significant new logic that was
under-tested in the patch — codecov flagged it. This commit adds focused
unit tests for the cold paths most affected:

- test_pawls_io: cover load_canonical_v2 FieldFile-with-str-read branch,
  empty payload, fallback-to-v1 ValueError, image_meta_v1 None for text
  tokens, token_view_to_v1_image_dict, iter_pages defensive skips, and
  PageView.tokens skipping non-list rows.
- test_annotation_utils: cover compute_content_modalities normalization
  failure / non-list-non-dict pawls fallback paths.
- test_multimodal_embeddings_utils: cover _resolve_v2_pawls (falsy /
  v1-list / garbage / unsupported), extract_and_store_annotation_images
  (success + skip branches), batch_extract_annotation_images (modality
  filter).
- test_populate_content_modalities (new): full coverage of the
  populate_content_modalities mgmt command — _determine_modalities
  branches plus handle() for default / dry-run / force / per-annotation
  error paths.
- mypy.ini: register the new test files in the test-suite ignore_errors
  list (project convention).

Local coverage post-change for the touched modules: pawls_io 98%,
multimodal_embeddings 90%, annotations/utils 97%,
populate_content_modalities 93%.
…_text loss

- importing.py: ``import_annotations(pawls_data=...)`` lost its type hint
  during the v2 migration (was ``list[dict] = None``, became bare
  ``=None``). Restored as
  ``list[dict[str, Any]] | dict[str, Any] | None`` to mirror the docstring
  contract — accepts a v1 list or a canonical v2 dict.
- image_tools.py: expanded the ``alt_text=None`` comment in
  ``list_document_images`` to document that the pre-v2 read site
  (``token.get("alt_text")``) was a no-op end-to-end (no parser ever
  populated the field), so the v2 short-key omission isn't a regression
  — but if a parser ever starts emitting alt text, add an "at" key to
  CompactImageMeta and read it here.

Dismissed: review claim that ``list_document_images(pawls_data=...)`` is
mistyped — the function does not accept a ``pawls_data`` parameter at all,
the suggestion appeared to reference an outdated signature.
…com/Open-Source-Legal/OpenContracts into claude/analyze-pawls-versions-ke9Pp

# Conflicts:
#	mypy.ini
#	opencontractserver/tests/test_multimodal_embeddings_utils.py
#	opencontractserver/tests/test_pawls_io.py
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

test comment - will be replaced

@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

test2

@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

deleting test comment

@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

test4 no newlines

@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

line1 line2 line3

@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

OpenContracts is an AGPL-3.0 enterprise document analytics platform for PDFs and text-based formats. It features a Django/GraphQL backend with PostgreSQL + pgvector, a React/TypeScript frontend with Jotai state management, and pluggable document processing pipelines powered by machine learning models.

Baseline Commit Rules

  1. Always ensure all affected (or new) tests pass - backend tests suite should only be run in its entirety for good reason as it takes 30+ minutes.
  2. Always make sure typescript compiles and pre-commits pass before committing new code.
  3. Never credit Claude or Claude Code in commit messages, PR messages, comments, or any other artifacts. This includes Co-Authored-By lines, "Generated by Claude", etc.

Essential Commands

Backend (Django)

# Run backend tests (sequential, use --keepdb to speed up subsequent runs)
docker compose -f test.yml run django python manage.py test --keepdb

# Run backend tests in PARALLEL (recommended - ~4x faster)
# Uses pytest-xdist with 4 workers, --dist loadscope keeps class tests together
docker compose -f test.yml run django pytest -n 4 --dist loadscope

# Run parallel tests with auto-detected worker count (uses all CPU cores)
docker compose -f test.yml run django pytest -n auto --dist loadscope

# Run parallel tests with fresh databases (first run or after schema changes)
docker compose -f test.yml run django pytest -n 4 --dist loadscope --create-db

# Run specific test file
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications --keepdb

# Run specific test file in parallel
docker compose -f test.yml run django pytest opencontractserver/tests/test_notifications.py -n 4 --dist loadscope

# Run specific test class/method
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications.TestNotificationModel.test_create_notification --keepdb

# Apply database migrations
docker compose -f local.yml run django python manage.py migrate

# Create new migration
docker compose -f local.yml run django python manage.py makemigrations

# Django shell
docker compose -f local.yml run django python manage.py shell

# Code quality (runs automatically via pre-commit hooks)
pre-commit run --all-files

Frontend (React/TypeScript)

cd frontend

# Start development server (proxies to Django on :8000)
yarn start

# Run unit tests (Vitest) - watches by default
yarn test:unit

# Run component tests (Playwright) - CRITICAL: Use --reporter=list to prevent hanging
yarn test:ct --reporter=list

# Run component tests with grep filter
yarn test:ct --reporter=list -g "test name pattern"

# Run E2E tests
yarn test:e2e

# Coverage reports (unit tests via Vitest, component tests via Playwright + Istanbul)
yarn test:coverage:unit
yarn test:coverage:ct

# Linting and formatting
yarn lint
yarn fix-styles

# Build for production
yarn build

# Preview production build locally
yarn serve

Production Deployment

# CRITICAL: Always run migrations FIRST in production
docker compose -f production.yml --profile migrate up migrate

# Then start main services
docker compose -f production.yml up

High-Level Architecture

Backend Architecture

Stack: Django 4.x + GraphQL (Graphene) + PostgreSQL + pgvector + Celery

Key Patterns:

  1. GraphQL Schema Organization:

    • config/graphql/graphene_types.py - All GraphQL type definitions
    • config/graphql/queries.py - Query resolvers
    • config/graphql/*_mutations.py - Mutation files (organized by feature)
    • config/graphql/schema.py - Schema composition
  2. Permission System (CRITICAL - see docs/permissioning/consolidated_permissioning_guide.md):

    • Annotations & Relationships: NO individual permissions - inherited from document + corpus
    • Documents & Corpuses: Direct object-level permissions via django-guardian
    • Analyses & Extracts: Hybrid model (own permissions + corpus permissions + document filtering)
    • Formula: Effective Permission = MIN(document_permission, corpus_permission)
    • Structural items are ALWAYS read-only except for superusers
    • Use Model.objects.visible_to_user(user) pattern (NOT resolve_oc_model_queryset - DEPRECATED)
  3. AnnotatePermissionsForReadMixin:

    • Most GraphQL types inherit this mixin (see config/graphql/permissioning/permission_annotator/mixins.py)
    • Adds my_permissions, is_published, object_shared_with fields
    • Requires model to have guardian permission tables ({model}userobjectpermission_set)
    • Notifications use simple ownership model and DON'T use this mixin
  4. Django Signal Handlers:

    • Automatic notification creation on model changes (see opencontractserver/notifications/signals.py)
    • Must be imported in app's apps.py ready() method
    • Use _skip_signals attribute to prevent duplicate notifications in tests
  5. Pluggable Parser Pipeline:

    • Base classes in opencontractserver/pipeline/base/
    • Parsers, embedders, thumbnailers auto-discovered and registered
    • Multiple backends: Docling (ML-based), LlamaParse, Text
    • All convert to unified PAWLs format for frontend
  6. Agent Tool Architecture (see docs/architecture/llms/README.md):

    • CoreTool (framework-agnostic) → PydanticAIToolWrapper (pydantic-ai specific)
    • All production tools MUST be async (a-prefixed in core_tools.py). The wrapper supports sync functions for lightweight helpers/tests but does NOT wrap them in a thread pool — sync ORM calls will raise SynchronousOnlyOperation.
    • Tool fault tolerance (issue Agent Tools Not Fault Tolerant #820): operational exceptions are caught and returned as error strings to the LLM; security exceptions (PermissionError, ToolConfirmationRequired) propagate.
    • Pre-execution checks run on every call (not cached): permission validation, resource ID validation, approval gates.

Frontend Architecture

Stack: React 18 + TypeScript + Apollo Client + Jotai (atoms) + PDF.js + Vite

Key Patterns:

  1. State Management - Jotai Atoms:

    • Global state via atoms in frontend/src/atoms/ (NOT Redux/Context)
    • Key atoms: selectedFolderIdAtom, folderCorpusIdAtom, selectedMessageIdAtom
    • Derived atoms automatically update when dependencies change
    • Apollo reactive vars in frontend/src/graphql/cache.ts for UI state
    • AuthGate pattern ensures auth completes before rendering
  2. Central Routing System (see docs/frontend/routing_system.md):

    • Single source of truth: frontend/src/routing/CentralRouteManager.tsx
    • URL paths → Entity resolution via GraphQL slug queries
    • URL params ↔ Reactive vars (bidirectional sync)
    • Components consume state via reactive vars, never touch URLs directly
    • Deep linking and canonical redirects handled automatically
  3. PDF Annotation System (see .cursor/rules/pdf-viewer-and-annotator-architecture.mdc):

    • Virtualized rendering: Only visible pages (+overscan) rendered for performance
    • Binary search to find visible page range (O(log n))
    • Height caching per zoom level
    • Two-phase scroll-to-annotation system
    • Dual-layer architecture: Document layer (annotations) + Knowledge layer (summaries)
  4. Unified Filtering Architecture:

    • useVisibleAnnotations hook provides single-source-of-truth filtering for both annotations and relationship-connected annotations
    • Reads from Jotai atoms via useAnnotationDisplay, useAnnotationControls, and useAnnotationSelection (from UISettingsAtom)
    • Ensures consistency across all components
    • Forced visibility for selected items and their relationship connections
  5. Component Testing (see .cursor/rules/test-document-knowledge-base.mdc):

    • ALWAYS mount components through test wrappers (e.g., DocumentKnowledgeBaseTestWrapper)
    • Wrapper provides: MockedProvider + InMemoryCache + Jotai Provider + asset mocking
    • Use --reporter=list flag to prevent hanging
    • Increase timeouts (20s+) for PDF rendering in Chromium
    • GraphQL mocks must match variables EXACTLY (null vs undefined matters)
    • Mock same query multiple times for refetches
    • Use page.mouse for PDF canvas interactions (NOT locator.dragTo)
    • Add settle time after drag operations (500ms UI, 1000ms Apollo cache)
  6. Development Server Configuration:

    • Vite dev server on :3000 proxies to Django on :8000
    • WebSocket proxy for /wsws://localhost:8000
    • GraphQL proxy for /graphqlhttp://localhost:8000
    • REST API proxy for /apihttp://localhost:8000
    • Auth0 optional via REACT_APP_USE_AUTH0 environment variable

Data Flow Architecture

Document Processing:

  1. Upload → Parser Selection (Docling/LlamaParse/Text)
  2. Parser generates PAWLs JSON (tokens with bounding boxes)
  3. Text layer extracted from PAWLs
  4. Annotations created for structure (headers, sections, etc.)
  5. Relationships detected between elements
  6. Vector embeddings generated for search

GraphQL Permission Flow:

  1. Query resolver filters objects with .visible_to_user(user)
  2. GraphQL types resolve my_permissions via AnnotatePermissionsForReadMixin
  3. Frontend uses permissions to enable/disable UI features
  4. Mutations check permissions and return consistent errors to prevent IDOR

Critical Security Patterns

  1. IDOR Prevention:

    • Query by both ID AND user-owned field: Model.objects.get(pk=pk, recipient=user)
    • Return same error message whether object doesn't exist or belongs to another user
    • Prevents enumeration via timing or different error messages
  2. Permission Checks:

    • NEVER trust frontend - always check server-side
    • Use visible_to_user() manager method for querysets
    • Check user_has_permission_for_obj() for individual objects (in opencontractserver.utils.permissioning)
  3. XSS Prevention:

    • User-generated content in JSON fields must be escaped on frontend
    • GraphQL's GenericScalar handles JSON serialization safely
    • Document this requirement in resolver comments

Critical Concepts

  1. No dead code - when deprecating or replacing code, always try to fully replace older code and, once it's no longer in use, delete it and related texts.
  2. DRY - please always architect code for maximal dryness and always see if you can consolidate related code and remove duplicative code.
  3. Single Responsibility Principle - Generally, ensure that each module / script has a single purpose or related purpose.
  4. No magic numbers - we have constants files in opencontractserver/constants/ (backend) and frontend/src/assets/configurations/constants.ts (frontend). Use them for any hardcoded values.
  5. Don't touch old tests without permission - if pre-existing tests fail after changes, try to identify why and present user with root cause analysis. If the test logic is correct but expectations need updating due to intentional behavior changes, document the change clearly.
  6. Utility functions belong in utility files:
    • Before writing new utilities: Check existing utility files first (frontend/src/utils/, opencontractserver/utils/)
    • When reviewing code: If you find utility functions defined inline in components/views, check if they already exist in utility files. If not, consider whether they should be extracted there for reuse.
    • Frontend utilities: frontend/src/utils/formatters.ts (formatting), frontend/src/utils/files.ts (file operations), etc.
    • Backend utilities: opencontractserver/utils/ contains permissioning, PDF processing, and other shared utilities

Testing Patterns

Manual Test Scripts

Location: docs/test_scripts/

When performing manual testing (e.g., testing migrations, verifying database state, testing API endpoints interactively), always document the test steps in a markdown file under docs/test_scripts/. These scripts will later be used to build automated integration tests.

Format:

# Test: [Brief description]

## Purpose
What this test verifies.

## Prerequisites
- Required state (e.g., "migration at 0058")
- Required data (e.g., "at least one document exists")

## Steps
1. Step one with exact command
   ```bash
   docker compose -f local.yml run --rm django python manage.py shell -c "..."
  1. Step two...

Expected Results

  • What should happen after each step
  • Success criteria

Cleanup

Commands to restore original state if needed.


**When to document**:
- Migration testing (rollback, create test data, migrate forward)
- Database constraint validation
- Race condition verification
- Any manual verification requested during code review

### Backend Tests

**Location**: `opencontractserver/tests/`

**Parallel Testing** (pytest-xdist):
- Run with `-n 4` (or `-n auto`) for parallel execution across workers
- Use `--dist loadscope` to keep tests from the same class on the same worker (respects `setUpClass`)
- Each worker gets its own database (test_db_gw0, test_db_gw1, etc.)
- Use `@pytest.mark.serial` to mark tests that cannot run in parallel
- First run or after DB schema changes: add `--create-db` flag

**Patterns**:
- Use `TransactionTestCase` for tests with signals/asynchronous behavior
- Use `TestCase` for faster tests without transaction isolation
- Clear auto-created notifications when testing moderation: `Notification.objects.filter(recipient=user).delete()`
- Use `_skip_signals` attribute on instances to prevent signal handlers during fixtures

### Frontend Component Tests

**Location**: `frontend/tests/`

**Critical Requirements**:
- Mount through test wrappers that provide all required context
- GraphQL mocks must match query variables exactly
- Include mocks for empty-string variants (unexpected boot calls)
- Wait for visible evidence, not just network-idle
- Use `page.mouse` for PDF canvas interactions (NOT `locator.dragTo`)
- Add settle time after drag operations (500ms UI, 1000ms Apollo cache)

**Test Wrapper Pattern**:
```typescript
// ALWAYS use wrappers, never mount components directly
const component = await mount(
  <DocumentKnowledgeBaseTestWrapper corpusId="corpus-1" documentId="doc-1">
    <DocumentKnowledgeBase />
  </DocumentKnowledgeBaseTestWrapper>
);

Automated Documentation Screenshots

Location: docs/assets/images/screenshots/auto/ (output) | frontend/tests/utils/docScreenshot.ts (utility)

Screenshots for documentation are automatically captured during Playwright component tests and committed back to the PR branch by the screenshots.yml CI workflow.

How it works:

  1. Import docScreenshot from ./utils/docScreenshot in any .ct.tsx test file
  2. Call await docScreenshot(page, "area--component--state") after the component reaches the desired visual state
  3. Reference the image in markdown: ![Alt text](../assets/images/screenshots/auto/area--component--state.png)
  4. CI runs tests on every PR, captures screenshots, and auto-commits any changes

Naming convention (-- separates segments, - within words):

Segment Purpose Examples
area Feature area landing, badges, corpus, versioning, annotations
component Specific view or component hero-section, celebration-modal, list-view
state Visual state captured anonymous, with-data, empty, auto-award

At least 2 segments required, 3 recommended. All lowercase alphanumeric with single hyphens.

Example:

import { docScreenshot } from "./utils/docScreenshot";

// After the component renders and assertions pass:
await docScreenshot(page, "badges--celebration-modal--auto-award");

Rules:

  • Place docScreenshot() calls AFTER assertions that confirm the desired visual state
  • The filename IS the contract between tests and docs — keep names stable
  • Never manually edit files in docs/assets/images/screenshots/auto/ — they are overwritten by CI
  • Manually curated screenshots stay in docs/assets/images/screenshots/ (parent directory)

Release Screenshots (Point-in-Time)

For release notes, use releaseScreenshot to capture screenshots that are locked in amber — they show the UI at a specific release and never change.

Location: docs/assets/images/screenshots/releases/{version}/ (output)

import { releaseScreenshot } from "./utils/docScreenshot";

await releaseScreenshot(page, "v3.0.0.b3", "landing-page", { fullPage: true });

Key differences from docScreenshot:

  • Output: docs/assets/images/screenshots/releases/{version}/{name}.png
  • Write-once: If the file already exists, the function is a no-op (won't overwrite)
  • CI never touches the releases/ directory
  • Name is a simple kebab-case string (no -- segment convention)
  • Version must match v{major}.{minor}.{patch} format (with optional suffix)

When to use which:

  • docScreenshot → README, quickstart, guides (always fresh)
  • releaseScreenshot → Release notes (frozen at release time)

Authenticated Playwright Testing (Live Frontend Debugging)

When you need to interact with the running frontend as an authenticated user (e.g., debugging why a query returns empty results), use Django admin session cookies to authenticate GraphQL requests.

Architecture context: The frontend uses Auth0 for authentication, but the Django backend also accepts session cookie auth. Apollo Client sends GraphQL requests directly to http://localhost:8000/graphql/ (cross-origin from the Vite dev server at localhost:5173). Since fetch defaults to credentials: 'same-origin', browser cookies aren't sent cross-origin. The workaround is to inject the session cookie into request headers via Playwright's route interception.

Step 1: Set a password for the superuser (one-time setup):

docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.auth import get_user_model
User = get_user_model()
u = User.objects.filter(is_superuser=True).first()
u.set_password('testpass123')
u.save()
print(f'Password set for {u.username}')
"

Step 2: Playwright script pattern:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  // Collect console messages for debugging
  const consoleMsgs = [];
  page.on('console', msg => consoleMsgs.push('[' + msg.type() + '] ' + msg.text()));

  // 1. Login to Django admin to get session cookie
  await page.goto('http://localhost:8000/admin/login/');
  await page.fill('#id_username', '<superuser-username>');
  await page.fill('#id_password', 'testpass123');
  await page.click('input[type=submit]');
  await page.waitForTimeout(2000);

  // 2. Extract the session cookie
  const cookies = await context.cookies();
  const sessionCookie = cookies.find(c => c.name === 'sessionid');

  // 3. Intercept GraphQL requests to inject the session cookie
  //    (needed because Apollo sends cross-origin requests to :8000)
  await page.route('**/graphql/**', async (route) => {
    const headers = {
      ...route.request().headers(),
      'Cookie': 'sessionid=' + sessionCookie.value,
    };
    await route.continue({ headers });
  });

  // 4. Navigate to the frontend page under test
  await page.goto('http://localhost:5173/extracts');
  await page.waitForTimeout(5000);

  // 5. Inspect results
  const bodyText = await page.textContent('body');
  console.log(bodyText);

  await browser.close();
})();

Run from the frontend directory (where playwright is a dependency):

cd frontend && node /path/to/script.js

Key details:

  • The Django admin login sets a sessionid cookie for localhost
  • page.route('**/graphql/**') intercepts Apollo's requests to localhost:8000/graphql/ and injects the cookie header
  • The AuthGate will still show anonymous state (no Auth0 session), but GraphQL queries execute as the authenticated Django user
  • This is useful for verifying backend query results, permission filtering, and data flow through the real frontend
  • For verifying just the GraphQL response without the full frontend, curl with the session cookie also works:
    curl -s -X POST http://localhost:8000/graphql/ \
      -H "Content-Type: application/json" \
      -H "Cookie: sessionid=<session-key>" \
      -d '{"query":"query { extracts { edges { node { id name } } } }"}' | python3 -m json.tool

Alternative — create a session programmatically (no admin login needed):

docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.sessions.backends.db import SessionStore
from django.contrib.auth import get_user_model
User = get_user_model()
user = User.objects.filter(is_superuser=True).first()
session = SessionStore()
session['_auth_user_id'] = str(user.pk)
session['_auth_user_backend'] = 'django.contrib.auth.backends.ModelBackend'
session['_auth_user_hash'] = user.get_session_auth_hash()
session.save()
print(f'Session key: {session.session_key}')
"

Then use the printed session key directly in curl or Playwright route interception.

Documentation Locations

  • Permissioning: docs/permissioning/consolidated_permissioning_guide.md
  • Frontend Routing: docs/frontend/routing_system.md
  • Frontend Auth Flow: docs/frontend/auth_flow.md
  • PDF Data Layer: docs/architecture/PDF-data-layer.md
  • PAWLs Format (v1/v2): docs/architecture/pawls-format.md
  • Parser Pipeline: docs/pipelines/pipeline_overview.md
  • LLM Framework: docs/architecture/llms/README.md
  • Collaboration System: docs/commenting_system/README.md
  • Auth Pattern (detailed): frontend/src/docs/AUTHENTICATION_PATTERN.md
  • Documentation Screenshots: docs/development/screenshots.md
  • Query Permission Patterns: docs/architecture/query_permission_patterns.md
  • OS-Legal-Style Migration Guide: docs/frontend/os-legal-style-migration-guide.md

Branch Strategy

This project follows trunk-based development:

  • Work directly on main branch or use short-lived feature branches
  • Feature branches: feature/description-issue-number
  • Merge feature branches quickly (within a day or two)
  • Commit message format: Descriptive with issue references (e.g., "Closes Epic: Notification System #562")

Changelog Maintenance

IMPORTANT: Always update CHANGELOG.md when making significant changes to the codebase.

The changelog follows Keep a Changelog format:

## [Unreleased] - YYYY-MM-DD

### Added
- New features

### Fixed
- Bug fixes with file locations and line numbers

### Changed
- Changes to existing functionality

### Technical Details
- Implementation specifics, architectural notes

When to update:

  • New features or models added
  • Production code bugs fixed (document file location, line numbers, and impact)
  • Breaking changes to APIs or data models
  • Test suite fixes that reveal production issues
  • Database migrations
  • Architecture changes

What to include:

  • File paths and line numbers for code changes
  • Clear description of the issue and fix
  • Impact on system behavior
  • Migration notes if applicable

Pre-commit Hooks

Automatically run on commit:

  • black (Python formatting)
  • isort (import sorting)
  • flake8 (linting)
  • prettier (frontend formatting)
  • pyupgrade (Python syntax modernization)

Run manually: pre-commit run --all-files

Common Pitfalls

  1. Frontend tests hanging: Always use --reporter=list flag
  2. Permission N+1 queries: Use .visible_to_user() NOT individual permission checks. For Conversation list queries in GraphQL resolvers, use ConversationQueryOptimizer from opencontractserver.conversations.query_optimizer - it provides request-level caching to avoid repeated corpus/document visibility subqueries
  3. Missing GraphQL mocks: Check variables match exactly (null vs undefined matters), add duplicates for refetches
  4. Notification duplication in tests: Moderation methods auto-create ModerationAction records
  5. Structural annotation editing: Always read-only except for superusers
  6. Missing signal imports: Import signal handlers in apps.py ready() method
  7. PDF rendering slow in tests: Increase timeouts to 20s+ for Chromium
  8. Cache serialization crashes: Keep InMemoryCache definition inside wrapper, not test file
  9. Backend Tests Waiting > 10 seconds on Postgres to be Ready: Docker network issue. Fix with: docker compose -f test.yml down && docker kill $(docker ps -q) && docker compose -f test.yml down
  10. Empty lists on direct navigation: AuthGate pattern solves this (don't check auth status, it's always ready)
  11. URL desynchronization: Use CentralRouteManager, don't bypass routing system
  12. Jotai state not updating: Ensure atoms are properly imported and used with useAtom hook
  13. Writing sync agent tools: All agent tools must be async. The PydanticAIToolWrapper accepts sync functions but calls them directly (no thread pool) — sync Django ORM calls will raise SynchronousOnlyOperation. Use the a-prefixed async versions in core_tools.py.
  14. PydanticAI system_prompt silently dropped: When creating PydanticAIAgent, use instructions= NOT system_prompt=. The system_prompt parameter is only included when message_history is None, but OpenContracts' chat() persists a HUMAN message before calling pydantic-ai's run(), so history is always non-empty. This causes the system prompt to be silently dropped. See docs/architecture/llms/README.md for full details.
  15. Apollo cache keyArgs must use field argument names, not variable names: In cache.ts, relayStylePagination(["corpus", "name_Contains"]) uses the GraphQL field argument names (e.g., extracts(corpus: $id, name_Contains: $text)), NOT the variable names ($id, $text). Mismatched keyArgs silently fail to isolate cache entries, causing queries with different filters to share stale cached results.
  16. Playwright CT split-import rule: In frontend/tests/*.ct.tsx, keep JSX-component imports in their own import { Wrapper } from "./Wrapper" statement, separate from any helper/constant imports from the same file. Playwright CT's babel transform only rewrites component references into importRefs when every specifier in the statement is a JSX component; mixing a component with helpers leaves the component unrewritten and mount() throws. Do not let auto-import organisers merge them.
  17. Corrupted Docker iptables chains (RARE): If you see Chain 'DOCKER-ISOLATION-STAGE-2' does not exist errors, Docker's iptables chains have been corrupted during docker cycling. Run this nuclear fix:
    sudo systemctl stop docker && sudo systemctl stop docker.socket && sudo ip link delete docker0 2>/dev/null || true && sudo iptables -t nat -F && sudo iptables -t nat -X && sudo iptables -t filter -F && sudo iptables -t filter -X 2>/dev/null || true && sudo iptables -t mangle -F && sudo iptables -t mangle -X && sudo iptables -t filter -N INPUT 2>/dev/null || true && sudo iptables -t filter -N FORWARD 2>/dev/null || true && sudo iptables -t filter -N OUTPUT 2>/dev/null || true && sudo iptables -P INPUT ACCEPT && sudo iptables -P FORWARD ACCEPT && sudo iptables -P OUTPUT ACCEPT && sudo systemctl start docker
    This completely resets Docker's networking and iptables state. Docker will recreate all required chains on startup.

@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review: PAWLs v2 Canonical Format Migration

Overview

This is a well-executed large-scale format migration (~2,900 additions / ~633 deletions across 43 files). The core idea — centralizing v1↔v2 conversion behind a single boundary module and eliminating v1 shape in active runtime code — is sound and clearly executed. The documentation, changelog, and architecture doc updates are thorough.


Strengths

  • Boundary discipline is tight. Every internal read-path goes through load_canonical_v2, every plasmapdf/export site uses to_v1_pages, and every import site uses to_canonical_v2. The four documented I/O boundaries match the implementation.
  • TokenView / PageView read-views use slots for memory efficiency and provide zero-copy attribute access — the right pattern for a hot path like document rendering.
  • Test coverage is substantial. 17 backend tests in test_pawls_io.py plus a new 400-line management-command test file. The frontend test suite was rewritten to match the v2 wire fixtures and the legacy-tolerance path is explicitly tested.
  • Cache invalidation is correct. The oldVersion > 0 AND oldVersion < 3 guard in onupgradeneeded clears stale entries without affecting fresh installs.
  • Changelog and architecture doc are kept in sync — the docs/architecture/pawls-format.md rewrite clearly documents all four allowed v1 boundaries and the strict-fallback semantics.

Issues and Suggestions

1. Behavioral change in compact_pawls.py — empty {} appended for all image tokens (medium)

File: opencontractserver/utils/compact_pawls.py, _compact_token

The old guard if img_meta: arr.append(img_meta) only appended the 6th metadata element when at least one image field was present. The new code always appends {} for image tokens even if all metadata fields are absent. This is the right change for TokenView.is_image detection, but it is a silent on-disk format change for image tokens that have no metadata. Newly written files will differ from files written before this PR at those positions.

The round-trip test catches this semantically, but the CHANGELOG entry does not mention it, and there is no backend test specifically for the empty-metadata round-trip case (the frontend test_image_token_with_no_metadata_fields_produces_empty_imageMeta covers the decode side only). Recommend:

  • Adding a note to the CHANGELOG for this behavioral change.
  • Adding a backend test: v1 image token with no metadata fields to to_canonical_v2 where TokenView.is_image should be True.

2. Misleading sentinel comment in annotations/models.py (low)

File: opencontractserver/annotations/models.py, slow-path PAWLs load on error

The comment on pawls_data = {} implies extract_and_store will be called with the {} sentinel, but {} is falsy so if pawls_data: short-circuits before extract_and_store is ever reached. Suggest: # {} is falsy — prevents repeated load attempts; if pawls_data: guard below still skips extraction.

3. PageView.tokens docstring says single-pass but each property access gives a fresh generator (low)

File: opencontractserver/utils/pawls_io.py, PageView.tokens

The property is a generator function, so each attribute access produces a new generator from the beginning. The test test_tokens_property_is_single_pass pins the intended contract correctly, but single-pass is a confusing label: a standard single-pass iterator means one traversal per object, whereas here each access gives a fresh iterator. Consider rephrasing to: each property access returns a new iterator from the start; a single iterator instance is exhausted after one pass and must not be reused.

4. load_canonical_v2 path-detection heuristic is fragile (low)

File: opencontractserver/utils/pawls_io.py, _read_text_from_source

The heuristic source.lstrip().startswith would incorrectly reject a filesystem path beginning with [ or { (valid on Linux, unusual in this domain). The guard catches a common caller mistake; suggest documenting the assumption in the function docstring (str inputs are always treated as filesystem paths; pre-decoded JSON must be passed as list/dict) rather than relying solely on the heuristic.

5. decodeV2Pawls now throws on unrecognized non-null input — callers are unprotected (minor breaking change)

File: frontend/src/utils/compactPawls.ts / frontend/src/components/annotator/api/rest.ts

The old expandPawlsPages silently returned [] for unrecognized inputs. decodeV2Pawls now throws. The two call sites (rest.ts / cachedRest.ts) do not wrap this in try/catch. If the server returns malformed PAWLs data (mid-migration document, proxy error body) the error propagates as an unhandled rejection, crashing the document loading flow rather than gracefully degrading.

Throwing is correct developer ergonomics, but the API-layer callers should decide explicitly whether to propagate or catch. A minimal guard at the getPawlsLayer call site would restore the previous resilience without hiding bugs.

6. mypy.ini — two new test-file suppressions (informational)

Both test_pawls_io and test_populate_content_modalities_command are added with ignore_errors = True. Consistent with project pattern, no action required. Enabling mypy on test_pawls_io.py in particular would be cheap and would catch type-shape regressions.

7. token_view_to_v1_image_dict bridge — confirm issue exists (informational)

The function is documented as a Phase-2 bridge tracked in issue 1490. Confirming that issue exists and is linked from this PR helps future reviewers find the cleanup work without hunting commit history.


Summary

High-quality, well-scoped migration. Boundary architecture is correct, tests are comprehensive, codebase is cleaner post-migration. Two items worth addressing before merge:

  1. Item 1 — document the _compact_token on-disk format change (empty {} always appended for image tokens without metadata) in the CHANGELOG; add a backend round-trip test for the no-metadata case.
  2. Item 5 — add try/catch around decodeV2Pawls in rest.ts / cachedRest.ts to preserve graceful degradation on malformed server responses.

Everything else is low-severity and can be addressed as follow-up.

Review responses:
- Document on-disk format change for image tokens with no v1 metadata
  fields in CHANGELOG; new round-trip test in test_pawls_io.py.
- Clarify sentinel comment in annotations/models.py slow-path PAWLs load
  to accurately describe the falsy-guard behavior.
- Reword PageView.tokens docstring to distinguish per-access fresh
  iterators from single-pass instance semantics.
- Document the str-as-path assumption in load_canonical_v2 docstring.
- Wrap decodeV2Pawls in rest.ts getPawlsLayer with try/catch + log so
  malformed PAWLs payloads from the server degrade gracefully rather
  than throwing into Apollo.

Coverage:
- New tests for batch_extract_annotation_images,
  load_images_from_annotation_file, get_annotation_image_tokens fast
  path / no-source / outer-exception branches, and
  extract_and_store_annotation_images outer exception path.
- New tests for image_tools _extract_image_from_pawls (v1, v2,
  out-of-bounds, non-image, garbage) and get_annotation_images
  structural-set branches (load success, load failure, no pawls file).
@claude
Copy link
Copy Markdown

claude Bot commented May 3, 2026

Code Review

This is a well-executed architectural migration. The single-boundary design for PAWLs v1→v2 normalization is the correct approach, the documentation is thorough, and the test suite is substantial (482-line test_pawls_io.py). A few items worth checking before merge:


Bug: enumerate(page.tokens) may produce incorrect indices for malformed data

opencontractserver/llms/tools/image_tools.py (~line 127):

for token_idx, token in enumerate(page.tokens):
    if not token.is_image:
        continue
    images.append(ImageReference(token_index=token_idx, ...))

PageView.tokens silently skips non-list rows in t:

for row in self._page.get("t", []):
    if isinstance(row, list):
        yield TokenView(row)

If a page's t array contains any non-list entries, enumerate(page.tokens) produces enum-relative indices (skipping holes), but _get_v2_token uses direct rows[token_index] array access. The stored token_index would then resolve to the wrong row when fetched later.

For well-formed v2 data this can't happen. But the same hole exists in the pre-PR code (enumerate(page_tokens) with a continue on non-dict entries), so this is pre-existing rather than a regression. Worth noting since _get_v2_token uses a different lookup strategy and the invariant is implicit.


Concern: to_v1_pages silently passes through v1 lists

opencontractserver/utils/pawls_io.py (to_v1_pages):

The docstring says "a v1 list input is returned as-is for caller convenience." The issue is that callers at the v1 export boundary should only ever hold a canonical v2 dict at that point — if they somehow hold a v1 list, the silent pass-through means a bug goes undetected. Consider at minimum a logger.warning when a v1 list is received, to make the unexpected case visible:

def to_v1_pages(canonical_v2: Any) -> list[dict[str, Any]]:
    if isinstance(canonical_v2, list):
        logger.warning("to_v1_pages received a v1 list — expected canonical v2 dict")
    return _expand_pawls_pages(canonical_v2)

Data loss: alt_text is now hardcoded to None

opencontractserver/llms/tools/image_tools.py (list_document_images, ~line 133):

alt_text=None,  # no v2 short key — no parser ever populated it

The comment is honest, but this is a one-way door: if a parser starts emitting alt_text on image tokens (either via new code or via re-parsing older documents), those values will be silently dropped until someone remembers this field was removed. Given the comment already names the fix (add short key "at" to CompactImageMeta), it would be worth filing a tracking issue or adding a # TODO: reference so the gap doesn't get lost.


Minor: redundant guard in to_canonical_v2 for v2 path

opencontractserver/utils/pawls_io.py (~line 79):

if isinstance(raw, dict):
    if not is_compact_pawls_format(raw):
        raise ValueError(...)
    pages = raw.get("p")
    if not isinstance(pages, list):   # is_compact_pawls_format already verified this
        raise ValueError(...)

is_compact_pawls_format checks Array.isArray(p) / isinstance(pages, list) internally, so the second guard is dead code. Not harmful, but slightly misleading since it implies is_compact_pawls_format might pass without a valid p list.


Minor: PageView.tokens generator-property footgun (documented, but worth surfacing)

The docstring explains the behaviour correctly, but page.tokens returning a new iterator on every property access is non-obvious. The most common hazard is:

token_count = sum(1 for _ in page.tokens)   # ok, fresh iterator
first_tok   = next(iter(page.tokens))        # ok, fresh iterator — but easy to forget
for tok in page.tokens: ...                  # ok

vs:

toks = page.tokens
for tok in toks: ...    # ok
for tok in toks: ...    # SILENT NO-OP — iterator is exhausted

The PR already uses list(page.tokens) in the hot path (doc_tasks.py comment). Given this is in a shared library, it may be worth adding a one-liner example in the docstring showing the list(page.tokens) snapshot pattern next to the "calling twice" warning.


On-disk format change for image tokens without metadata

opencontractserver/utils/compact_pawls.py (_compact_token):

The PR correctly notes that image tokens with no metadata fields now get an empty {} sixth element instead of a 5-element row. This fixes a real detection bug. But it means any document currently on disk that was written with an image token that had is_image=True and no metadata fields will be:

  • Readable (5-element rows are still valid v2; TokenView.is_image returns False for them)
  • Silently not detected as image tokens until the document is re-parsed

This is documented in the PR description, but it's worth confirming that the product data doesn't actually have this edge case (image tokens written with is_image=True but none of the metadata fields set), or explicitly listing a migration/reparse step if it does.


Positive highlights

  • Architecture: the single-boundary design in pawls_io.py is clean and enforceable. Having one place that accepts v1 makes auditing and future migration straightforward.
  • TokenView/PageView with __slots__: efficient, zero-copy attribute access is the right approach for a hot path.
  • Idempotent to_canonical_v2: makes call-site code safe against double-normalization without requiring callers to track format state.
  • IndexedDB version bump + store clearing: correct handling of stale deserialized objects. The oldVersion < 3 guard is good — it won't re-clear on future upgrades.
  • Boundary call sites: every to_v1_pages call has an inline comment explaining why v1 is produced there. This makes the rule auditable.
  • Test coverage: test_pawls_io.py covers format normalization, I/O, read-views, round-trips, and the MAX_TOKENS_PER_PAGE fallback path. The mock-based test for the compaction fallback is a nice edge-case catch.
  • token_view_to_v1_image_dict is explicitly flagged as a Phase-2 bridge with a tracking issue (Retire _v2_image_token_to_v1_dict adaptor: migrate embedders + image helpers to v2 directly #1490). Good practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants