Skip to content

fix: comprehensive crash guards for malformed LLM output#218

Open
sicko7947 wants to merge 1 commit intoVectifyAI:mainfrom
sicko7947:fix/comprehensive-crash-guards
Open

fix: comprehensive crash guards for malformed LLM output#218
sicko7947 wants to merge 1 commit intoVectifyAI:mainfrom
sicko7947:fix/comprehensive-crash-guards

Conversation

@sicko7947
Copy link
Copy Markdown

Summary

Comprehensive fix for multiple crash sites triggered when LLMs return malformed or unexpected output during ToC processing. Combines and extends fixes from PRs #217 and #210, plus guards for additional unpatched crash sites.

Problems Fixed

Error Root Cause Fix
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' (#153) calculate_page_offset() returns None Guard in add_page_offset_to_toc_json
KeyError: 'toc_detected' / 'completed' / 'table_of_contents' (#163) extract_json() returns {} on malformed LLM output .get() with safe defaults
AttributeError: 'NoneType' object has no attribute 'startswith' (#199) llm_completion() returns None isinstance + None guard
AttributeError: 'str' object has no attribute 'get' LLM returns string instead of dict isinstance(item, dict) filter
AttributeError: 'dict' object has no attribute 'extend' generate_toc_init() returns non-list isinstance(..., list) guard
Empty page ranges (last ToC item gets no content) next_physical_index = -1range(prev, 0) is empty Default to end_index
Off-by-one (page 1 skipped) prev_physical_index = 0 for first item Default to start_index

Changes

All changes are in pageindex/page_index.py:

  • add_page_offset_to_toc_json: Return data unchanged when offset is None
  • process_none_page_numbers: Fix prev/next defaults (start_index/end_index)
  • toc_detector_single_page: .get('toc_detected', 'no')
  • check_if_toc_extraction_is_complete: .get('completed', 'no')
  • check_if_toc_transformation_is_complete: .get('completed', 'no')
  • detect_page_index: .get('page_index_given_in_toc', 'no')
  • toc_transformer: isinstance + .get() for table_of_contents access (×2 sites)
  • toc_transformer: None + isinstance guard on new_complete.startswith()
  • single_toc_item_index_fixer: .get('physical_index')
  • meta_processor: isinstance(item, dict) filter
  • process_no_toc: isinstance guards for generate_toc_init/generate_toc_continue results

Context

We run PageIndex at scale (~28K tender documents) on Cloud Run with gpt-oss-20b as the primary model. These crashes affect ~5-15% of documents depending on structure complexity. The unguarded dict/attribute access means one malformed LLM response crashes the entire indexing pipeline for that document, even though the data could be gracefully recovered.

Relates to: #153, #163, #199, #203

Combines fixes from PRs VectifyAI#217, VectifyAI#210, and additional guards for all
remaining crash sites in toc_transformer and related functions.

Fixes:
- TypeError: int + NoneType when calculate_page_offset returns None (VectifyAI#153)
- KeyError on dict access when extract_json returns {} (VectifyAI#163)
- AttributeError: NoneType has no startswith (VectifyAI#199)
- KeyError: 'table_of_contents' when LLM output is malformed
- AttributeError: 'dict' has no extend / 'str' has no get

Changes:
- add_page_offset_to_toc_json: guard None offset, return data unchanged
- process_none_page_numbers: fix prev/next defaults (start_index/end_index)
- toc_detector_single_page: .get() with 'no' default
- check_if_toc_extraction_is_complete: .get() with 'no' default
- check_if_toc_transformation_is_complete: .get() with 'no' default
- detect_page_index: .get() with 'no' default
- toc_transformer: isinstance + .get() for table_of_contents access
- toc_transformer: None/isinstance guard on new_complete.startswith
- single_toc_item_index_fixer: .get() for physical_index
- meta_processor: isinstance(item, dict) filter
- process_no_toc: isinstance guard for generate_toc_init/continue results
Copilot AI review requested due to automatic review settings April 6, 2026 09:06
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the PageIndex TOC processing pipeline against malformed/unexpected LLM outputs to prevent document-level crashes during indexing.

Changes:

  • Added safe-default accessors (.get(...)) for several LLM-returned JSON fields to avoid KeyErrors.
  • Introduced additional type/None guards around TOC transformation and TOC generation flows to prevent AttributeError/TypeErrors on unexpected return types.
  • Fixed default page-range bounds in process_none_page_numbers so first/last TOC items receive a valid non-empty search window.
Comments suppressed due to low confidence (1)

pageindex/page_index.py:429

  • add_page_offset_to_toc_json() assumes every data[i] is a dict (data[i].get(...)). If toc_transformer() returns a list containing non-dicts (possible with malformed-but-parseable LLM output), this will still crash with AttributeError. Consider skipping non-dict items (or normalizing upstream) before accessing .get.
    if offset is None:
        return data
    for i in range(len(data)):
        if data[i].get('page') is not None and isinstance(data[i]['page'], int):
            data[i]['physical_index'] = data[i]['page'] + offset
            del data[i]['page']

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 119 to +122
response = llm_completion(model=model, prompt=prompt)
# print('response', response)
json_content = extract_json(response)
return json_content['toc_detected']
return json_content.get('toc_detected', 'no')
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_json() can return a non-dict (e.g., a JSON list when the LLM outputs [...]). In that case, calling .get(...) will raise AttributeError and reintroduce a crash path. Consider guarding with isinstance(json_content, dict) (and defaulting to 'no') before accessing toc_detected.

Copilot uses AI. Check for mistakes.
Comment on lines 137 to +140
prompt = prompt + '\n Document:\n' + content + '\n Table of contents:\n' + toc
response = llm_completion(model=model, prompt=prompt)
json_content = extract_json(response)
return json_content['completed']
return json_content.get('completed', 'no')
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_json() may return a list if the model outputs a JSON array; in that case json_content.get('completed', ...) will raise. Add an isinstance(json_content, dict) guard (and default to 'no') so malformed-but-parseable outputs don’t crash this completeness check.

Copilot uses AI. Check for mistakes.
Comment on lines 155 to +158
prompt = prompt + '\n Raw Table of contents:\n' + content + '\n Cleaned Table of contents:\n' + toc
response = llm_completion(model=model, prompt=prompt)
json_content = extract_json(response)
return json_content['completed']
return json_content.get('completed', 'no')
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: extract_json() can return a list (valid JSON) and json_content.get('completed', ...) would crash. Guard with isinstance(json_content, dict) (or handle list outputs explicitly) before accessing completed.

Copilot uses AI. Check for mistakes.
response = llm_completion(model=model, prompt=prompt)
json_content = extract_json(response)
return json_content['page_index_given_in_toc']
return json_content.get('page_index_given_in_toc', 'no')
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_json() can successfully parse a JSON list if the model responds with [...]. In that case, json_content.get('page_index_given_in_toc', ...) will raise AttributeError. Add an isinstance(json_content, dict) guard (default 'no') to avoid crashing on parseable-but-unexpected outputs.

Suggested change
return json_content.get('page_index_given_in_toc', 'no')
if isinstance(json_content, dict):
return json_content.get('page_index_given_in_toc', 'no')
return 'no'

Copilot uses AI. Check for mistakes.
Comment on lines 296 to 300
if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model)
if if_complete == "yes" and finish_reason == "finished":
last_complete = extract_json(last_complete)
cleaned_response=convert_page_to_int(last_complete['table_of_contents'])
cleaned_response=convert_page_to_int(last_complete.get('table_of_contents', []) if isinstance(last_complete, dict) else [])
return cleaned_response
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_page_to_int() assumes each element is a dict and will misbehave or crash if table_of_contents contains non-dicts (e.g., list of strings from malformed LLM output). Since toc_transformer() returns this value and downstream code assumes dict items, consider validating that table_of_contents is a list of dicts (filtering or coercing) before calling convert_page_to_int() / returning.

Copilot uses AI. Check for mistakes.
Comment on lines 668 to 672
def process_none_page_numbers(toc_items, page_list, start_index=1, model=None):
end_index = len(page_list) + start_index - 1
for i, item in enumerate(toc_items):
if "physical_index" not in item:
# logger.info(f"fix item: {item}")
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_none_page_numbers() still assumes each TOC item is a dict and that a 'page' key exists (later del item_copy['page'] / del item['page']). With malformed LLM output (missing keys or non-dict items), this can raise (e.g., TypeError/KeyError) before the later meta_processor dict-filter runs. Consider adding isinstance(item, dict) checks and using pop('page', None) instead of unconditional del.

Copilot uses AI. Check for mistakes.
Comment on lines 769 to +772
prompt = toc_extractor_prompt + '\nSection Title:\n' + str(section_title) + '\nDocument pages:\n' + content
response = await llm_acompletion(model=model, prompt=prompt)
json_content = extract_json(response)
return convert_physical_index_to_int(json_content['physical_index'])
return convert_physical_index_to_int(json_content.get('physical_index'))
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_json() can return a non-dict (e.g., a JSON list), in which case json_content.get('physical_index') will raise. Add an isinstance(json_content, dict) guard (and default None) before accessing physical_index to keep this fixer resilient to malformed LLM output.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants