fix: guard None offset and empty page-range in TOC processing by voidborne-d · Pull Request #217 · VectifyAI/PageIndex

voidborne-d · 2026-04-04T18:08:07Z

Summary

Two related bugs in the TOC-with-page-numbers processing pipeline that cause silent failures or hard crashes when indexing certain PDFs.

Bug 1 — `TypeError` crash when page-offset is `None`

Function: add_page_offset_to_toc_json
Triggered by: calculate_page_offset() returning None (no title/page pairs matched)

calculate_page_offset() explicitly returns None when it cannot find any matching (title, page-number, physical-index) pairs — which happens when:

The TOC page numbers do not correspond to any of the first toc_check_page_num pages, or
The document has an unconventional layout where TOC page numbers and physical positions are mismatched.

add_page_offset_to_toc_json() then tried:

data[i]['physical_index'] = data[i]['page'] + offset  # offset is None → TypeError

Fix: Add an early-return guard — when offset is None, return data unchanged and let process_none_page_numbers handle the unresolved items:

def add_page_offset_to_toc_json(data, offset):
    if offset is None:
        return data
    ...

Bug 2 — Empty LLM context for last (and first) TOC item

Function: process_none_page_numbers
Triggered by: The last TOC entry lacking a physical_index

The function computes a page search window range(prev_physical_index, next_physical_index + 1):

next_physical_index defaulted to -1 ("no next item"), so range(prev, 0) is always empty for any prev > 0.
prev_physical_index defaulted to 0, producing list_index = 0 - start_index = -1, which the bounds check silently skips (page 1 excluded).

With an empty window, page_contents = [] → the LLM receives an empty string → it cannot locate the section → the last TOC entry is permanently unresolved.

Fix:

next_physical_index default → end_index = len(page_list) + start_index - 1 (last page of document)
prev_physical_index default → start_index (first valid page index, not 0)

Tests added (17, all passing)

TestAddPageOffsetToTocJson (11 tests)

None offset returns data unchanged, no crash
Integer / zero / negative offsets work correctly
Items without page key are untouched
Pipeline simulation: empty pairs → None offset → no crash

TestProcessNonePageNumbers (6 tests)

Last item gets non-empty [prev..end_index] range
First item list_index default no longer negative
Single-page document handled correctly
Middle items still use actual neighbor indices
Regression documents old vs new default behaviour
No-op when all items already have physical_index

Two related bugs in the TOC-with-page-numbers pipeline: Bug 1 — TypeError crash when offset is None (add_page_offset_to_toc_json) calculate_page_offset() legitimately returns None when no title/page pairs can be matched (e.g. TOC pages don't overlap the first N content pages, or the document has an unconventional layout). add_page_offset_to_toc_json() then attempted data[i]['page'] + None, raising: TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' Fix: early-return the input unchanged when offset is None, letting process_none_page_numbers handle the remaining items. Bug 2 — Empty LLM context window for last/first TOC item (process_none_page_numbers) next_physical_index defaulted to -1 ('no next item found'), so range(prev, -1 + 1) == range(prev, 0) was always empty. The LLM was called with an empty content string and returned garbage, so the last section in the TOC was never located correctly. Fix: default next_physical_index to end_index (= len(page_list) + start_index - 1), giving the search window all remaining pages. Secondary: prev_physical_index defaulted to 0 for the first item. With start_index=1, list_index = 0 - start_index = -1 which was silently skipped by the bounds check, excluding the first page. Fix: default prev_physical_index to start_index instead of 0. Tests added (17, all passing): TestAddPageOffsetToTocJson (11 tests) - None offset returns data unchanged, no crash - integer / zero / negative offsets work correctly - items without 'page' key are untouched - pipeline simulation: empty pairs → None offset → no crash TestProcessNonePageNumbers (6 tests) - last item gets non-empty [prev..end_index] range - first item list_index default no longer negative - single-page document handled correctly - middle items still use actual neighbor indices - regression test documents the old vs new default behaviour - no-op when all items already have physical_index

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Combines fixes from PRs VectifyAI#217, VectifyAI#210, and additional guards for all remaining crash sites in toc_transformer and related functions. Fixes: - TypeError: int + NoneType when calculate_page_offset returns None (VectifyAI#153) - KeyError on dict access when extract_json returns {} (VectifyAI#163) - AttributeError: NoneType has no startswith (VectifyAI#199) - KeyError: 'table_of_contents' when LLM output is malformed - AttributeError: 'dict' has no extend / 'str' has no get Changes: - add_page_offset_to_toc_json: guard None offset, return data unchanged - process_none_page_numbers: fix prev/next defaults (start_index/end_index) - toc_detector_single_page: .get() with 'no' default - check_if_toc_extraction_is_complete: .get() with 'no' default - check_if_toc_transformation_is_complete: .get() with 'no' default - detect_page_index: .get() with 'no' default - toc_transformer: isinstance + .get() for table_of_contents access - toc_transformer: None/isinstance guard on new_complete.startswith - single_toc_item_index_fixer: .get() for physical_index - meta_processor: isinstance(item, dict) filter - process_no_toc: isinstance guard for generate_toc_init/continue results

claude bot reviewed Apr 4, 2026

View reviewed changes

sicko7947 mentioned this pull request Apr 6, 2026

fix: comprehensive crash guards for malformed LLM output #218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: guard None offset and empty page-range in TOC processing#217

fix: guard None offset and empty page-range in TOC processing#217
voidborne-d wants to merge 1 commit intoVectifyAI:mainfrom
voidborne-d:fix/toc-offset-none-and-empty-page-range

voidborne-d commented Apr 4, 2026

Uh oh!

claude bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

voidborne-d commented Apr 4, 2026

Summary

Bug 1 — TypeError crash when page-offset is None

Bug 2 — Empty LLM context for last (and first) TOC item

Tests added (17, all passing)

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1 — `TypeError` crash when page-offset is `None`