Migrate codebase to v2 PAWLs as canonical format end-to-end#1488
Migrate codebase to v2 PAWLs as canonical format end-to-end#1488
Conversation
Phase 1 of v1 -> v2 internal migration: - New `opencontractserver/utils/pawls_io.py` is the single load boundary. `to_canonical_v2()` accepts v1 (list) or v2 (dict), always returns a v2 dict, raises on garbage. `load_canonical_v2()` reads from FieldFile, file-likes, or paths and normalizes. `TokenView`/`PageView` give zero-copy attribute access over v2 rows. - Boundary-strict semantics: pathological v1 input that would silently fall back to v1 in `compact_pawls_pages` now raises, preventing v1 from leaking past the load boundary into active code paths. - Five import/persist sites swapped from `compact_pawls_pages` to `to_canonical_v2`: parser persist, legacy import, v2 import, async import task, worker upload task. Same on-disk result for normal inputs. - New `CompactPawlsV2Type`, `CompactPawlsPageType`, `CompactPawlsTokenType`, `CompactImageMetaType` aliases. Existing `PawlsPagePythonType` and `PawlsTokenPythonType` annotated as v1-wire-only. - 17 new tests in `test_pawls_io.py`; existing `test_compact_pawls` unchanged and passing. Internal consumers of `expand_pawls_pages` are untouched in this phase. Frontend untouched. Parsers untouched (they emit v1 internally; the persist boundary normalizes).
Phase 2 of v1 -> v2 internal migration. All 18 active call sites
that previously consumed v1 PAWLs via `expand_pawls_pages` now load
canonical v2 via `load_canonical_v2` and walk it through `iter_pages`
+ `TokenView`/`PageView` (or directly on the v2 dict).
Migrated consumers:
- `shared/decorators.py` (sync + async doc analyzer task injectors)
- `tasks/doc_tasks.py` (FUNSD export pipeline)
- `tasks/data_extract_tasks.py` (LLM extraction grounding)
- `annotations/{utils,models}.py` and the `populate_content_modalities`
management command
- `llms/tools/image_tools.py` and `llms/tools/core_tools/{search,
document_indexing,annotations}.py`
- `utils/{pdf_token_extraction,multimodal_embeddings,
extraction_grounding,importing}.py`
- `pipeline/base/parser.py` (passes the v2 result of `to_canonical_v2`
into `import_annotations`)
`expand_pawls_pages` retains 7 deliberate callers, all at external
boundaries:
- 5 plasmapdf `build_translation_layer` hand-offs (decorators ×2,
three core_tools modules, extraction_grounding) — third-party API
still consumes v1 `PawlsPagePythonType` lists.
- 2 export wire-format payload assemblers (`utils/etl.py`,
`utils/export_v2.py`) — the `OpenContractDocExport.pawls_file_content`
contract is documented v1 and unchanged here.
Each remaining call site has an inline comment marking it as a
deliberate v2->v1 hand-off so Phase 4's reaper can decide between
deleting them (after plasmapdf/export wire migrations) or moving them
behind a permanent `pawls_io.to_v1_pages` adaptor.
Test fixtures for `test_annotated_document_import`,
`test_task_decorators`, and `test_async_task_decorators` updated to
match the v2 in-memory contract injected by the migrated decorators
and lazy loaders.
Behavior unchanged for end users: same on-disk output, same export
wire format, same plasmapdf hand-off shape. Internal in-memory
representation is now v2 everywhere except at the named boundaries.
Phase 3 of v1 -> v2 migration. The frontend now operates on
v2-canonical typed objects in memory. The decoder is the only
place v1 wire input is tolerated (some on-disk files are still
v1 because the migration deliberately does not backfill).
Decoder rewrite (`frontend/src/utils/compactPawls.ts`):
- New API: `decodeV2Pawls(json: unknown): CompactPage[]` accepts
v1 (top-level array) or v2 (`{v: 2, p: [...]}`) wire input and
always returns v2-shape `CompactPage[]`. Throws on garbage.
- `isV2WirePawls(value)` type guard for the wire shape.
- `expandPawlsPages` and `isCompactPawlsFormat` removed entirely.
Type system (`frontend/src/components/types.ts`):
- v1 `Token`, `PageTokens`, `Page` removed.
- Replaced with `CompactPage`, `CompactToken`, `CompactImageMeta`.
- Page is flat (`page.width`, not `page.page.width`).
- Image fields collapsed into `CompactImageMeta` keyed with v2
short keys (`p`, `b64`, `f`, `ch`, `ow`, `oh`, `it`).
- `is_image` -> `isImage` (always set).
Hot-path migrations:
- `annotator/types/pdf.ts` (PDFPageInfo): `tokens: CompactToken[]`,
`t.is_image` -> `t.isImage`.
- `utils/transform.tsx`: `normalizeTokensToPdfViewport` and
`resolvePageTokens` retyped; flattened page-shape access.
- `annotator/api/rest.ts`, `annotator/api/cachedRest.ts`: return
type now `Promise<CompactPage[]>`.
The other recon-listed files (PDFPage, SelectionLayer,
textBlockEncoding, useTextSearch, SelectionTokens,
SelectionTokenGroup, DocumentAtom) needed no changes -- they
access tokens via fields whose names are unchanged on
`CompactToken` (`x`, `y`, `width`, `height`, `text`).
Tests:
- `compactPawls.test.ts`, `transform.normalizeTokens.test.ts`,
`pdf.test.ts` updated to v2 wire fixtures and the new
in-memory shape. Decoder kept one v1-wire test case to verify
v1 tolerance.
- `yarn tsc --noEmit` clean. 61 / 61 unit tests pass.
IndexedDB cache invalidation (`services/documentCacheManager.ts`):
- DB_VERSION bumped 2 -> 3. The pawls in-memory shape changed,
so any cached entries from before this commit would deserialize
into v1-shape objects and break consumers that now expect v2.
- `onupgradeneeded` clears both stores on the 2 -> 3 transition.
PDFs and text files are cheap to re-fetch.
Phase 4 of v1 -> v2 migration. Cleans up the seven remaining
`expand_pawls_pages` callers (all at deliberate external
boundaries: plasmapdf hand-off + export wire format) by routing
them through a permanent named adaptor.
`pawls_io.to_v1_pages(canonical_v2)`:
- Boundary-only v2 -> v1 adaptor. Docstring restricts use to two
documented boundaries: plasmapdf `build_translation_layer` and
export wire-format payload assembly
(`OpenContractDocExport.pawls_file_content`,
`StructuralAnnotationSetExport.pawls_file_content`).
- Active runtime code MUST NOT use it. Operate on the canonical
v2 dict (or `PageView`/`TokenView`) instead.
Routed seven boundary call sites from
`compact_pawls.expand_pawls_pages` to `pawls_io.to_v1_pages`:
- `shared/decorators.py` (sync + async doc analyzer hand-off, ×2)
- `llms/tools/core_tools/{search,document_indexing,annotations}.py`
- `utils/extraction_grounding.py`
- `utils/{etl,export_v2}.py` (export wire-format assemblers)
After this commit, no active code in `opencontractserver/` (outside
`pawls_io.py`'s own internal use, the legacy
`compact_pawls.expand_pawls_pages` implementation, and the v1<->v2
round-trip tests) imports `expand_pawls_pages` directly. Each
boundary site has the inline comment naming the boundary intent.
Doc rewrite (`docs/architecture/pawls-format.md`):
- "Accessor Layer" section -> "Boundary Layer". Documents the four
v1 boundaries (backend import / backend export / plasmapdf /
frontend wire), the canonical v2 contract for active code, the
strict 100k-tokens-per-page failure mode, and the new
`pawls_io.py` API. Read/write path tables updated. Implementation
table now lists `pawls_io.py` as the boundary module and
`compact_pawls.py` as low-level codec.
CHANGELOG entry under [Unreleased] / Changed summarizing the
end-to-end migration.
Verified locally: `python3 -m py_compile` clean on all touched
backend files; `yarn tsc --noEmit` clean; `yarn test:unit`
68 / 68 pass on `compactPawls`, `transform.normalizeTokens`,
`useTextSearch`, `pdf`, `documentCacheManager` modules. Backend
test suite must be run in Docker (`docker compose -f test.yml run
django pytest opencontractserver/tests/test_pawls_io.py
opencontractserver/tests/test_compact_pawls.py …`); Docker daemon
was unavailable in this sandbox.
Code Review — PR #1488: Migrate codebase to v2 PAWLs as canonical formatOverviewThis is a substantial, well-scoped architectural migration. The core design — a single Issues1. Double import in
|
…versions-ke9Pp # Conflicts: # CHANGELOG.md
…rrow worker_uploads error, depublish CompactPawlsV2Wire, link cleanup TODO to #1490
Code Review: Migrate codebase to v2 PAWLs as canonical format end-to-endThis is a well-structured architectural migration. The boundary-layer approach — v2 everywhere internally, v1 only at four named I/O seams — is the right call, and the PR executes it consistently across a large surface area. Below are findings ranging from bugs to nits. Potential Bugs1.
elif isinstance(pawls_data, list):
pawls_data = to_canonical_v2(pawls_data)
2. Frontend:
if (typeof page !== "object" || page === null) {
throw new Error("Invalid v1 PAWLs page: ...");
}The callers in return axios.get(url).then((r) => decodeV2Pawls(r.data));If the server returns malformed JSON (or an unexpected shape during a rolling deploy), the promise rejects instead of resolving to DRY / Architecture3. from opencontractserver.utils.compact_pawls import (
_IMAGE_KEY_REVERSE, # <-- leading underscore = internal
compact_pawls_pages,
)
4. Duplicate v2→v1 image-token adaptor (violates DRY)
Inconsistent v2 Access Patterns5. Both files were migrated to call Minor Issues6. Redundant if not is_compact_pawls_format(result):
raise ValueError(...)
assert isinstance(result, dict) # redundant — assert is disabled with -O
return resultThe 7. This could mislead callers into thinking passing a v1 list is an error. Either tighten the implementation to validate the input is v2, or update the docstring to acknowledge the pass-through behavior. 8. Sentinel value change in # Before:
pawls_data = [] # falsy, prevents retry
# After:
pawls_data = {} # falsy, prevents retryTechnically correct (both are falsy), but the comment says "normalize to v2 and short-circuit on empty input." An empty dict is not a valid v2 PAWLs dict ( What's Good
SummaryThe two items worth addressing before merge are: the inconsistent v1-normalization gap between |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
In _compact_token, the 6th-element image_meta dict was only appended when non-empty, which silently dropped the is_image semantic for image tokens that carried no metadata fields. The v2 contract is 'dict at index 5 == is_image', so we now append the dict unconditionally for image tokens. Also align test_annotation_window_pdf fixture with v2 positional page indexing — the mock pawls page is now at index 0 to match the annotation's pageIndex 0 reference.
- Move duplicate _v2_*_to_v1_dict helpers from image_tools.py and multimodal_embeddings.py into a single token_view_to_v1_image_dict in pawls_io.py (DRY). - Promote compact_pawls._IMAGE_KEY_REVERSE to a public alias IMAGE_META_V2_TO_V1 so pawls_io stops reaching into a private symbol. - Drop the now-redundant assert in to_canonical_v2 (already proven by the is_compact_pawls_format guard); use cast() for type narrowing instead. - Document the v1-list pass-through behavior of to_v1_pages so callers don't expect strict v2-only validation.
Code Review — PR #1488: Migrate codebase to v2 PAWLs as canonical format end-to-endOverviewThis is a well-scoped, well-executed architectural migration. The core idea — define four explicit I/O boundaries where v1 is tolerated, and treat v1 appearing anywhere else as a bug — is a significant improvement over the previous "accessor layer" approach that silently converted in every consumer. The Positive highlights
Issues and suggestions1.
|
CHANGELOG conflict resolved: kept both PAWLs v2 migration entry under [Unreleased] -> Changed and the new entries from main. Removed duplicate EmbedderSelector entry that was under both my section and main's section.
…pages usage, tighten types
Latest review (Claude bot, 2026-05-03):
- data_extract_tasks.py:796 — pass FieldFile directly to load_canonical_v2
instead of .path. .path raises NotImplementedError on S3/GCS backends; the
load helper handles FieldFile, file-like, and Path uniformly.
- populate_content_modalities.py + annotations/utils.py — replaced manual
v2 dict-walking (pages.get('p'), page.get('t')) with iter_pages() +
PageView.tokens for consistency with the rest of the migrated code.
- annotations/utils.py — tightened pawls_data signature from Optional[Any]
to Optional[Union[list[dict[str, Any]], dict[str, Any]]] for both
compute_content_modalities and update_annotation_modalities.
- pawls_io.iter_pages — removed redundant version check (is_compact_pawls_format
already enforces v == COMPACT_PAWLS_VERSION); dropped now-unused import.
- pawls_io.PageView.tokens — docstring now explicitly notes the iterator
is single-pass; new test pins the contract.
- pawls_io.to_canonical_v2 — corrected misleading 'guard above' comment;
the is_compact_pawls_format check is below, not above, the compact call.
- pawls_io._read_text_from_source — added typed-error guard rejecting raw
JSON-string inputs (caller hands in a list/dict instead of a path).
- New tests: ToV1PagesTests, TokenViewToV1ImageDictTests,
test_load_raw_json_string_raises_typeerror, test_tokens_property_is_single_pass.
Frontend (compactPawls.test.ts): 12 new branch-coverage tests for the v1/v2 decoder — non-object page entries, missing/non-numeric width/height, non-array tokens, non-string text, image tokens with no metadata, and v2 wire edge cases. Backend (test_multimodal_embeddings_utils.py): 18 new tests across three classes: - TestResolveV2Pawls: exhaustive coverage of _resolve_v2_pawls — None / empty list / empty dict / empty string / unsupported type / invalid v2 dict / malformed v2 pages, plus v1->v2 normalization and v2 idempotence. - TestGetAnnotationImageTokensV2Paths: v2 canonical dict input, garbage pawls_data fallback, structural_set v2 load path. - TestExtractAndStoreAnnotationImages: full coverage of the previously untested extract_and_store_annotation_images — None / empty / garbage / no-image-tokens / OOB page / OOB token / non-dict page / v1 auto- normalize / valid v2 dict produces a non-empty image_content_file. These plug the largest patch-coverage gaps reported by codecov on PR #1488 (multimodal_embeddings.py was at 43%; compactPawls.ts was at 73%).
…> 93%) The codecov report on PR #1488 flagged the populate_content_modalities management command at 0% patch coverage with 15 lines missing — the command had no test file at all. New test file at opencontractserver/tests/test_populate_content_modalities_command.py adds 11 end-to-end tests via call_command(): - --dry-run does not modify rows - --force reprocesses pre-set annotations - Default mode skips already-populated rows - PAWLs path: image-only, text-only, mixed annotations - No-document fallback to label-text hint (image keyword vs neutral) - PAWLs load failure falls back to label hint - Empty token refs fall back to label hint - Processing exceptions increment error_count and report 'Errors: 1' Final coverage: 93% (77/83 statements). The 6 remaining lines are minor defensive branches (out-of-bounds index skips, empty-modality default) already exercised at the helper level in test_annotation_utils.py.
…versions-ke9Pp # Conflicts: # CHANGELOG.md
mypy.ini gains an ignore_errors entry for the new test_populate_content_modalities_command test module — same baseline treatment used for all other Django TestCase modules in the project that rely on the setUpTestData class-attribute pattern (mypy can't follow Django's deferred classmethod attribute assignment without per-test refactoring). Other diffs are black/isort line-collapse fixups.
Code Review — PR #1488: Migrate codebase to v2 PAWLs as canonical format end-to-endOverviewThis is a well-planned, large-scale refactoring that establishes v2 PAWLs as the canonical runtime format end-to-end. The boundary-module approach in What's Done Well
Issues and Suggestions1.
|
annotation_window's PDF and text branches both used os.path.exists(.path)
and open(.path) — pre-existing patterns that raise NotImplementedError
on remote storage backends (S3/GCS) since FieldFile.path is only defined
for local filesystem storage. Both backends (local + cloud) now work:
- default_storage.exists(file.name) replaces os.path.exists(file.path)
for the existence guards (lines 703, 789).
- doc.txt_extract_file.open('rb').read().decode('utf-8') replaces
open(doc.txt_extract_file.path, encoding='utf-8') for the text read.
- doc.pawls_parse_file is already passed by FieldFile to load_canonical_v2
(fixed earlier in this PR).
- Removed now-unused 'import os'; added 'from django.core.files.storage
import default_storage'.
test_agent_search_tools.py mocks updated:
- @patch('...os.path.exists') -> @patch('...default_storage.exists')
- patch('builtins.open') replaced with patches at the right boundary:
load_canonical_v2 for PDF tests (returns canonical v2 directly) and
the FieldFile.open() classmethod for the text test. The size-limit
test now builds a 2000-token v1 PAWLs payload (instead of mocking
open with non-JSON bytes) and asserts a soft clamp around 500
tokens-per-side + the annotation's own 3 tokens.
pawls_io.iter_pages: corrected return type from Iterable[PageView] to
Iterator[PageView] to match the generator implementation, with a
single-pass note in the docstring.
Code Review: PAWLs v2 canonical format migration (#1488)This is a well-designed, comprehensive migration. The single-boundary pattern ( Correctness issues1. The function signature still declares 2. In # Before
if img_meta:
arr.append(img_meta)
# After
arr.append(img_meta) # always append for image tokensThis is intentional (preserves 3. alt_text=None, # alt_text not currently persisted in v2 image metaThe old code was Minor issues / code quality4.
pages = list(iter_pages(pawls_data))
if not (0 <= page_index < len(pages)):
return None
tokens = list(pages[page_index].tokens)
...5. Inconsistent v2-dict raw access in
6. Materializing all page token-iterators eagerly on every annotation evaluated means O(tokens_per_page) work for each unique page touched. For a management command that runs over thousands of annotations on large docs, consider caching 7. Pre-existing bug in The variable names are confusing but the logic was faithfully copied from before: pawls_ytop = pawls_xleft + pawls_token.width # actually computes x-right
pawls_xright = pawls_ybottom + pawls_token.height # actually computes y-bottomThe FUNSD box is 8. - pawls_data: list[dict] = None,
+ pawls_data=None,This drops the annotation entirely. TestsThe test coverage is genuinely excellent:
One gap: the Architecture / design praise
SummaryThe core migration is sound. Issues 1 and 3 are the ones worth addressing before merge: the |
The v2 PAWLs boundary migration added significant new logic that was under-tested in the patch — codecov flagged it. This commit adds focused unit tests for the cold paths most affected: - test_pawls_io: cover load_canonical_v2 FieldFile-with-str-read branch, empty payload, fallback-to-v1 ValueError, image_meta_v1 None for text tokens, token_view_to_v1_image_dict, iter_pages defensive skips, and PageView.tokens skipping non-list rows. - test_annotation_utils: cover compute_content_modalities normalization failure / non-list-non-dict pawls fallback paths. - test_multimodal_embeddings_utils: cover _resolve_v2_pawls (falsy / v1-list / garbage / unsupported), extract_and_store_annotation_images (success + skip branches), batch_extract_annotation_images (modality filter). - test_populate_content_modalities (new): full coverage of the populate_content_modalities mgmt command — _determine_modalities branches plus handle() for default / dry-run / force / per-annotation error paths. - mypy.ini: register the new test files in the test-suite ignore_errors list (project convention). Local coverage post-change for the touched modules: pawls_io 98%, multimodal_embeddings 90%, annotations/utils 97%, populate_content_modalities 93%.
…_text loss
- importing.py: ``import_annotations(pawls_data=...)`` lost its type hint
during the v2 migration (was ``list[dict] = None``, became bare
``=None``). Restored as
``list[dict[str, Any]] | dict[str, Any] | None`` to mirror the docstring
contract — accepts a v1 list or a canonical v2 dict.
- image_tools.py: expanded the ``alt_text=None`` comment in
``list_document_images`` to document that the pre-v2 read site
(``token.get("alt_text")``) was a no-op end-to-end (no parser ever
populated the field), so the v2 short-key omission isn't a regression
— but if a parser ever starts emitting alt text, add an "at" key to
CompactImageMeta and read it here.
Dismissed: review claim that ``list_document_images(pawls_data=...)`` is
mistyped — the function does not accept a ``pawls_data`` parameter at all,
the suggestion appeared to reference an outdated signature.
…com/Open-Source-Legal/OpenContracts into claude/analyze-pawls-versions-ke9Pp # Conflicts: # mypy.ini # opencontractserver/tests/test_multimodal_embeddings_utils.py # opencontractserver/tests/test_pawls_io.py
|
test comment - will be replaced |
|
test2 |
|
deleting test comment |
|
test4 no newlines |
|
line1 line2 line3 |
CLAUDE.mdThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. Project OverviewOpenContracts is an AGPL-3.0 enterprise document analytics platform for PDFs and text-based formats. It features a Django/GraphQL backend with PostgreSQL + pgvector, a React/TypeScript frontend with Jotai state management, and pluggable document processing pipelines powered by machine learning models. Baseline Commit Rules
Essential CommandsBackend (Django)# Run backend tests (sequential, use --keepdb to speed up subsequent runs)
docker compose -f test.yml run django python manage.py test --keepdb
# Run backend tests in PARALLEL (recommended - ~4x faster)
# Uses pytest-xdist with 4 workers, --dist loadscope keeps class tests together
docker compose -f test.yml run django pytest -n 4 --dist loadscope
# Run parallel tests with auto-detected worker count (uses all CPU cores)
docker compose -f test.yml run django pytest -n auto --dist loadscope
# Run parallel tests with fresh databases (first run or after schema changes)
docker compose -f test.yml run django pytest -n 4 --dist loadscope --create-db
# Run specific test file
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications --keepdb
# Run specific test file in parallel
docker compose -f test.yml run django pytest opencontractserver/tests/test_notifications.py -n 4 --dist loadscope
# Run specific test class/method
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications.TestNotificationModel.test_create_notification --keepdb
# Apply database migrations
docker compose -f local.yml run django python manage.py migrate
# Create new migration
docker compose -f local.yml run django python manage.py makemigrations
# Django shell
docker compose -f local.yml run django python manage.py shell
# Code quality (runs automatically via pre-commit hooks)
pre-commit run --all-filesFrontend (React/TypeScript)cd frontend
# Start development server (proxies to Django on :8000)
yarn start
# Run unit tests (Vitest) - watches by default
yarn test:unit
# Run component tests (Playwright) - CRITICAL: Use --reporter=list to prevent hanging
yarn test:ct --reporter=list
# Run component tests with grep filter
yarn test:ct --reporter=list -g "test name pattern"
# Run E2E tests
yarn test:e2e
# Coverage reports (unit tests via Vitest, component tests via Playwright + Istanbul)
yarn test:coverage:unit
yarn test:coverage:ct
# Linting and formatting
yarn lint
yarn fix-styles
# Build for production
yarn build
# Preview production build locally
yarn serveProduction Deployment# CRITICAL: Always run migrations FIRST in production
docker compose -f production.yml --profile migrate up migrate
# Then start main services
docker compose -f production.yml upHigh-Level ArchitectureBackend ArchitectureStack: Django 4.x + GraphQL (Graphene) + PostgreSQL + pgvector + Celery Key Patterns:
Frontend ArchitectureStack: React 18 + TypeScript + Apollo Client + Jotai (atoms) + PDF.js + Vite Key Patterns:
Data Flow ArchitectureDocument Processing:
GraphQL Permission Flow:
Critical Security Patterns
Critical Concepts
Testing PatternsManual Test ScriptsLocation: When performing manual testing (e.g., testing migrations, verifying database state, testing API endpoints interactively), always document the test steps in a markdown file under Format: # Test: [Brief description]
## Purpose
What this test verifies.
## Prerequisites
- Required state (e.g., "migration at 0058")
- Required data (e.g., "at least one document exists")
## Steps
1. Step one with exact command
```bash
docker compose -f local.yml run --rm django python manage.py shell -c "..."
Expected Results
CleanupCommands to restore original state if needed. Automated Documentation ScreenshotsLocation: Screenshots for documentation are automatically captured during Playwright component tests and committed back to the PR branch by the How it works:
Naming convention (
At least 2 segments required, 3 recommended. All lowercase alphanumeric with single hyphens. Example: import { docScreenshot } from "./utils/docScreenshot";
// After the component renders and assertions pass:
await docScreenshot(page, "badges--celebration-modal--auto-award");Rules:
Release Screenshots (Point-in-Time)For release notes, use Location: import { releaseScreenshot } from "./utils/docScreenshot";
await releaseScreenshot(page, "v3.0.0.b3", "landing-page", { fullPage: true });Key differences from
When to use which:
Authenticated Playwright Testing (Live Frontend Debugging)When you need to interact with the running frontend as an authenticated user (e.g., debugging why a query returns empty results), use Django admin session cookies to authenticate GraphQL requests. Architecture context: The frontend uses Auth0 for authentication, but the Django backend also accepts session cookie auth. Apollo Client sends GraphQL requests directly to Step 1: Set a password for the superuser (one-time setup): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.auth import get_user_model
User = get_user_model()
u = User.objects.filter(is_superuser=True).first()
u.set_password('testpass123')
u.save()
print(f'Password set for {u.username}')
"Step 2: Playwright script pattern: const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
// Collect console messages for debugging
const consoleMsgs = [];
page.on('console', msg => consoleMsgs.push('[' + msg.type() + '] ' + msg.text()));
// 1. Login to Django admin to get session cookie
await page.goto('http://localhost:8000/admin/login/');
await page.fill('#id_username', '<superuser-username>');
await page.fill('#id_password', 'testpass123');
await page.click('input[type=submit]');
await page.waitForTimeout(2000);
// 2. Extract the session cookie
const cookies = await context.cookies();
const sessionCookie = cookies.find(c => c.name === 'sessionid');
// 3. Intercept GraphQL requests to inject the session cookie
// (needed because Apollo sends cross-origin requests to :8000)
await page.route('**/graphql/**', async (route) => {
const headers = {
...route.request().headers(),
'Cookie': 'sessionid=' + sessionCookie.value,
};
await route.continue({ headers });
});
// 4. Navigate to the frontend page under test
await page.goto('http://localhost:5173/extracts');
await page.waitForTimeout(5000);
// 5. Inspect results
const bodyText = await page.textContent('body');
console.log(bodyText);
await browser.close();
})();Run from the frontend directory (where cd frontend && node /path/to/script.jsKey details:
Alternative — create a session programmatically (no admin login needed): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.sessions.backends.db import SessionStore
from django.contrib.auth import get_user_model
User = get_user_model()
user = User.objects.filter(is_superuser=True).first()
session = SessionStore()
session['_auth_user_id'] = str(user.pk)
session['_auth_user_backend'] = 'django.contrib.auth.backends.ModelBackend'
session['_auth_user_hash'] = user.get_session_auth_hash()
session.save()
print(f'Session key: {session.session_key}')
"Then use the printed session key directly in curl or Playwright route interception. Documentation Locations
Branch StrategyThis project follows trunk-based development:
Changelog MaintenanceIMPORTANT: Always update The changelog follows Keep a Changelog format: ## [Unreleased] - YYYY-MM-DD
### Added
- New features
### Fixed
- Bug fixes with file locations and line numbers
### Changed
- Changes to existing functionality
### Technical Details
- Implementation specifics, architectural notesWhen to update:
What to include:
Pre-commit HooksAutomatically run on commit:
Run manually: Common Pitfalls
|
Code Review: PAWLs v2 Canonical Format MigrationOverviewThis is a well-executed large-scale format migration (~2,900 additions / ~633 deletions across 43 files). The core idea — centralizing v1↔v2 conversion behind a single boundary module and eliminating v1 shape in active runtime code — is sound and clearly executed. The documentation, changelog, and architecture doc updates are thorough. Strengths
Issues and Suggestions1. Behavioral change in compact_pawls.py — empty {} appended for all image tokens (medium)File: opencontractserver/utils/compact_pawls.py, _compact_token The old guard if img_meta: arr.append(img_meta) only appended the 6th metadata element when at least one image field was present. The new code always appends {} for image tokens even if all metadata fields are absent. This is the right change for TokenView.is_image detection, but it is a silent on-disk format change for image tokens that have no metadata. Newly written files will differ from files written before this PR at those positions. The round-trip test catches this semantically, but the CHANGELOG entry does not mention it, and there is no backend test specifically for the empty-metadata round-trip case (the frontend test_image_token_with_no_metadata_fields_produces_empty_imageMeta covers the decode side only). Recommend:
2. Misleading sentinel comment in annotations/models.py (low)File: opencontractserver/annotations/models.py, slow-path PAWLs load on error The comment on pawls_data = {} implies extract_and_store will be called with the {} sentinel, but {} is falsy so if pawls_data: short-circuits before extract_and_store is ever reached. Suggest: # {} is falsy — prevents repeated load attempts; if pawls_data: guard below still skips extraction. 3. PageView.tokens docstring says single-pass but each property access gives a fresh generator (low)File: opencontractserver/utils/pawls_io.py, PageView.tokens The property is a generator function, so each attribute access produces a new generator from the beginning. The test test_tokens_property_is_single_pass pins the intended contract correctly, but single-pass is a confusing label: a standard single-pass iterator means one traversal per object, whereas here each access gives a fresh iterator. Consider rephrasing to: each property access returns a new iterator from the start; a single iterator instance is exhausted after one pass and must not be reused. 4. load_canonical_v2 path-detection heuristic is fragile (low)File: opencontractserver/utils/pawls_io.py, _read_text_from_source The heuristic source.lstrip().startswith would incorrectly reject a filesystem path beginning with [ or { (valid on Linux, unusual in this domain). The guard catches a common caller mistake; suggest documenting the assumption in the function docstring (str inputs are always treated as filesystem paths; pre-decoded JSON must be passed as list/dict) rather than relying solely on the heuristic. 5. decodeV2Pawls now throws on unrecognized non-null input — callers are unprotected (minor breaking change)File: frontend/src/utils/compactPawls.ts / frontend/src/components/annotator/api/rest.ts The old expandPawlsPages silently returned [] for unrecognized inputs. decodeV2Pawls now throws. The two call sites (rest.ts / cachedRest.ts) do not wrap this in try/catch. If the server returns malformed PAWLs data (mid-migration document, proxy error body) the error propagates as an unhandled rejection, crashing the document loading flow rather than gracefully degrading. Throwing is correct developer ergonomics, but the API-layer callers should decide explicitly whether to propagate or catch. A minimal guard at the getPawlsLayer call site would restore the previous resilience without hiding bugs. 6. mypy.ini — two new test-file suppressions (informational)Both test_pawls_io and test_populate_content_modalities_command are added with ignore_errors = True. Consistent with project pattern, no action required. Enabling mypy on test_pawls_io.py in particular would be cheap and would catch type-shape regressions. 7. token_view_to_v1_image_dict bridge — confirm issue exists (informational)The function is documented as a Phase-2 bridge tracked in issue 1490. Confirming that issue exists and is linked from this PR helps future reviewers find the cleanup work without hunting commit history. SummaryHigh-quality, well-scoped migration. Boundary architecture is correct, tests are comprehensive, codebase is cleaner post-migration. Two items worth addressing before merge:
Everything else is low-severity and can be addressed as follow-up. |
Review responses: - Document on-disk format change for image tokens with no v1 metadata fields in CHANGELOG; new round-trip test in test_pawls_io.py. - Clarify sentinel comment in annotations/models.py slow-path PAWLs load to accurately describe the falsy-guard behavior. - Reword PageView.tokens docstring to distinguish per-access fresh iterators from single-pass instance semantics. - Document the str-as-path assumption in load_canonical_v2 docstring. - Wrap decodeV2Pawls in rest.ts getPawlsLayer with try/catch + log so malformed PAWLs payloads from the server degrade gracefully rather than throwing into Apollo. Coverage: - New tests for batch_extract_annotation_images, load_images_from_annotation_file, get_annotation_image_tokens fast path / no-source / outer-exception branches, and extract_and_store_annotation_images outer exception path. - New tests for image_tools _extract_image_from_pawls (v1, v2, out-of-bounds, non-image, garbage) and get_annotation_images structural-set branches (load success, load failure, no pawls file).
Code ReviewThis is a well-executed architectural migration. The single-boundary design for PAWLs v1→v2 normalization is the correct approach, the documentation is thorough, and the test suite is substantial (482-line Bug:
|
Summary
This PR completes the migration to PAWLs v2 as the canonical runtime format across the entire codebase (backend and frontend). v1 is now accepted only at well-defined I/O boundaries and is never exposed in active runtime code paths.
Key Changes
Backend
New
pawls_iomodule (opencontractserver/utils/pawls_io.py): Single import boundary for all PAWLs data entering the system at runtimeto_canonical_v2(): Normalizes v1 (list) or v2 (dict) wire input to canonical v2 dictload_canonical_v2(): Reads from Django FieldFile, file-like, Path, or pre-decoded JSON and returns v2TokenView/PageView/iter_pages(): Zero-copy read-views over v2 token rows with named attribute access (no index magic)to_v1_pages(): Converts v2 → v1 only at export boundaries (document export, plasmapdf hand-off)Type definitions (
opencontractserver/types/dicts.py): AddedCompactImageMetaType,CompactTokenType,CompactPageTypeto describe canonical v2 shape; marked v1 types (PawlsTokenPythonType,PawlsPagePythonType) as import-boundary-onlyConsumer migration: Updated all active code paths to use
load_canonical_v2()and read-views instead ofexpand_pawls_pages():multimodal_embeddings.py: UsesTokenViewfor image token extractionimage_tools.py: Usesiter_pages()andPageViewfor image listingannotations/utils.py: Usesload_canonical_v2()for content modality detectionetl.py,doc_tasks.py,data_extract_tasks.py, etc.: All switched to v2 load boundaryTests (
opencontractserver/tests/test_pawls_io.py): Comprehensive coverage of format normalization, I/O, read-views, and round-trip equivalenceFrontend
Refactored
compactPawls.ts:expandPawlsPages()→decodeV2Pawls()to clarify it decodes wire format to canonical v2-shape objectsisCompactPawlsFormat()→isV2WirePawls()for clarity{v:2, p:[…]}) wire input; always returnsCompactPage[](v2-canonical)Type definitions (
frontend/src/components/types.ts):Token→CompactToken,PageTokens→CompactPageCompactTokenalways hasisImage(camelCase, notis_image); image metadata lives onimageMetausing v2 short keysCompactPagehas flat fields (index,width,height,tokens) — no nestedpageobjectConsumer updates: All components updated to use
CompactPage/CompactTokenanddecodeV2Pawls():transform.tsx,pdf.ts,rest.ts,cachedRest.ts,utils.tsCache invalidation (
documentCacheManager.ts): Bumped DB version to v3 to clear stale v1-shaped entriesI/O Boundaries
v1 is accepted only at these four points:
pawls_io.load_canonical_v2()— accepts v1 or v2, returns v2https://claude.ai/code/session_01PNdtRXsw1NZEaRKfipqXyn