fix: comprehensive crash guards for malformed LLM output by sicko7947 · Pull Request #218 · VectifyAI/PageIndex

sicko7947 · 2026-04-06T09:06:51Z

Summary

Comprehensive fix for multiple crash sites triggered when LLMs return malformed or unexpected output during ToC processing. Combines and extends fixes from PRs #217 and #210, plus guards for additional unpatched crash sites.

Problems Fixed

Error	Root Cause	Fix
`TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'` (#153)	`calculate_page_offset()` returns `None`	Guard in `add_page_offset_to_toc_json`
`KeyError: 'toc_detected'` / `'completed'` / `'table_of_contents'` (#163)	`extract_json()` returns `{}` on malformed LLM output	`.get()` with safe defaults
`AttributeError: 'NoneType' object has no attribute 'startswith'` (#199)	`llm_completion()` returns `None`	`isinstance` + None guard
`AttributeError: 'str' object has no attribute 'get'`	LLM returns string instead of dict	`isinstance(item, dict)` filter
`AttributeError: 'dict' object has no attribute 'extend'`	`generate_toc_init()` returns non-list	`isinstance(..., list)` guard
Empty page ranges (last ToC item gets no content)	`next_physical_index = -1` → `range(prev, 0)` is empty	Default to `end_index`
Off-by-one (page 1 skipped)	`prev_physical_index = 0` for first item	Default to `start_index`

Changes

All changes are in pageindex/page_index.py:

add_page_offset_to_toc_json: Return data unchanged when offset is None
process_none_page_numbers: Fix prev/next defaults (start_index/end_index)
toc_detector_single_page: .get('toc_detected', 'no')
check_if_toc_extraction_is_complete: .get('completed', 'no')
check_if_toc_transformation_is_complete: .get('completed', 'no')
detect_page_index: .get('page_index_given_in_toc', 'no')
toc_transformer: isinstance + .get() for table_of_contents access (×2 sites)
toc_transformer: None + isinstance guard on new_complete.startswith()
single_toc_item_index_fixer: .get('physical_index')
meta_processor: isinstance(item, dict) filter
process_no_toc: isinstance guards for generate_toc_init/generate_toc_continue results

Context

We run PageIndex at scale (~28K tender documents) on Cloud Run with gpt-oss-20b as the primary model. These crashes affect ~5-15% of documents depending on structure complexity. The unguarded dict/attribute access means one malformed LLM response crashes the entire indexing pipeline for that document, even though the data could be gracefully recovered.

Relates to: #153, #163, #199, #203

Combines fixes from PRs VectifyAI#217, VectifyAI#210, and additional guards for all remaining crash sites in toc_transformer and related functions. Fixes: - TypeError: int + NoneType when calculate_page_offset returns None (VectifyAI#153) - KeyError on dict access when extract_json returns {} (VectifyAI#163) - AttributeError: NoneType has no startswith (VectifyAI#199) - KeyError: 'table_of_contents' when LLM output is malformed - AttributeError: 'dict' has no extend / 'str' has no get Changes: - add_page_offset_to_toc_json: guard None offset, return data unchanged - process_none_page_numbers: fix prev/next defaults (start_index/end_index) - toc_detector_single_page: .get() with 'no' default - check_if_toc_extraction_is_complete: .get() with 'no' default - check_if_toc_transformation_is_complete: .get() with 'no' default - detect_page_index: .get() with 'no' default - toc_transformer: isinstance + .get() for table_of_contents access - toc_transformer: None/isinstance guard on new_complete.startswith - single_toc_item_index_fixer: .get() for physical_index - meta_processor: isinstance(item, dict) filter - process_no_toc: isinstance guard for generate_toc_init/continue results

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copilot

Pull request overview

This PR hardens the PageIndex TOC processing pipeline against malformed/unexpected LLM outputs to prevent document-level crashes during indexing.

Changes:

Added safe-default accessors (.get(...)) for several LLM-returned JSON fields to avoid KeyErrors.
Introduced additional type/None guards around TOC transformation and TOC generation flows to prevent AttributeError/TypeErrors on unexpected return types.
Fixed default page-range bounds in process_none_page_numbers so first/last TOC items receive a valid non-empty search window.

Comments suppressed due to low confidence (1)

pageindex/page_index.py:429

add_page_offset_to_toc_json() assumes every data[i] is a dict (data[i].get(...)). If toc_transformer() returns a list containing non-dicts (possible with malformed-but-parseable LLM output), this will still crash with AttributeError. Consider skipping non-dict items (or normalizing upstream) before accessing .get.

    if offset is None:
        return data
    for i in range(len(data)):
        if data[i].get('page') is not None and isinstance(data[i]['page'], int):
            data[i]['physical_index'] = data[i]['page'] + offset
            del data[i]['page']

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T09:11:33Z

pageindex/page_index.py

    response = llm_completion(model=model, prompt=prompt)
    # print('response', response)
    json_content = extract_json(response)    
-    return json_content['toc_detected']
+    return json_content.get('toc_detected', 'no')


extract_json() can return a non-dict (e.g., a JSON list when the LLM outputs [...]). In that case, calling .get(...) will raise AttributeError and reintroduce a crash path. Consider guarding with isinstance(json_content, dict) (and defaulting to 'no') before accessing toc_detected.

Copilot · 2026-04-06T09:11:34Z

pageindex/page_index.py

    prompt = prompt + '\n Document:\n' + content + '\n Table of contents:\n' + toc
    response = llm_completion(model=model, prompt=prompt)
    json_content = extract_json(response)
-    return json_content['completed']
+    return json_content.get('completed', 'no')


extract_json() may return a list if the model outputs a JSON array; in that case json_content.get('completed', ...) will raise. Add an isinstance(json_content, dict) guard (and default to 'no') so malformed-but-parseable outputs don’t crash this completeness check.

Copilot · 2026-04-06T09:11:34Z

pageindex/page_index.py

    prompt = prompt + '\n Raw Table of contents:\n' + content + '\n Cleaned Table of contents:\n' + toc
    response = llm_completion(model=model, prompt=prompt)
    json_content = extract_json(response)
-    return json_content['completed']
+    return json_content.get('completed', 'no')


Same issue as above: extract_json() can return a list (valid JSON) and json_content.get('completed', ...) would crash. Guard with isinstance(json_content, dict) (or handle list outputs explicitly) before accessing completed.

Copilot · 2026-04-06T09:11:34Z

pageindex/page_index.py

    response = llm_completion(model=model, prompt=prompt)
    json_content = extract_json(response)
-    return json_content['page_index_given_in_toc']
+    return json_content.get('page_index_given_in_toc', 'no')


extract_json() can successfully parse a JSON list if the model responds with [...]. In that case, json_content.get('page_index_given_in_toc', ...) will raise AttributeError. Add an isinstance(json_content, dict) guard (default 'no') to avoid crashing on parseable-but-unexpected outputs.

Suggested change

return json_content.get('page_index_given_in_toc', 'no')

if isinstance(json_content, dict):

return json_content.get('page_index_given_in_toc', 'no')

return 'no'

Copilot · 2026-04-06T09:11:34Z

pageindex/page_index.py

    if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model)
    if if_complete == "yes" and finish_reason == "finished":
        last_complete = extract_json(last_complete)
-        cleaned_response=convert_page_to_int(last_complete['table_of_contents'])
+        cleaned_response=convert_page_to_int(last_complete.get('table_of_contents', []) if isinstance(last_complete, dict) else [])
        return cleaned_response


convert_page_to_int() assumes each element is a dict and will misbehave or crash if table_of_contents contains non-dicts (e.g., list of strings from malformed LLM output). Since toc_transformer() returns this value and downstream code assumes dict items, consider validating that table_of_contents is a list of dicts (filtering or coercing) before calling convert_page_to_int() / returning.

Copilot · 2026-04-06T09:11:35Z

pageindex/page_index.py

 def process_none_page_numbers(toc_items, page_list, start_index=1, model=None):
+    end_index = len(page_list) + start_index - 1
    for i, item in enumerate(toc_items):
        if "physical_index" not in item:
            # logger.info(f"fix item: {item}")


process_none_page_numbers() still assumes each TOC item is a dict and that a 'page' key exists (later del item_copy['page'] / del item['page']). With malformed LLM output (missing keys or non-dict items), this can raise (e.g., TypeError/KeyError) before the later meta_processor dict-filter runs. Consider adding isinstance(item, dict) checks and using pop('page', None) instead of unconditional del.

Copilot · 2026-04-06T09:11:35Z

pageindex/page_index.py

    prompt = toc_extractor_prompt + '\nSection Title:\n' + str(section_title) + '\nDocument pages:\n' + content
    response = await llm_acompletion(model=model, prompt=prompt)
    json_content = extract_json(response)    
-    return convert_physical_index_to_int(json_content['physical_index'])
+    return convert_physical_index_to_int(json_content.get('physical_index'))


extract_json() can return a non-dict (e.g., a JSON list), in which case json_content.get('physical_index') will raise. Add an isinstance(json_content, dict) guard (and default None) before accessing physical_index to keep this fixer resilient to malformed LLM output.

Copilot AI review requested due to automatic review settings April 6, 2026 09:06

claude bot reviewed Apr 6, 2026

View reviewed changes

Copilot started reviewing on behalf of sicko7947 April 6, 2026 09:07 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: comprehensive crash guards for malformed LLM output#218

fix: comprehensive crash guards for malformed LLM output#218
sicko7947 wants to merge 1 commit intoVectifyAI:mainfrom
sicko7947:fix/comprehensive-crash-guards

sicko7947 commented Apr 6, 2026

Uh oh!

claude bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sicko7947 commented Apr 6, 2026

Summary

Problems Fixed

Changes

Context

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants