fix: comprehensive crash guards for malformed LLM output#218
fix: comprehensive crash guards for malformed LLM output#218sicko7947 wants to merge 1 commit intoVectifyAI:mainfrom
Conversation
Combines fixes from PRs VectifyAI#217, VectifyAI#210, and additional guards for all remaining crash sites in toc_transformer and related functions. Fixes: - TypeError: int + NoneType when calculate_page_offset returns None (VectifyAI#153) - KeyError on dict access when extract_json returns {} (VectifyAI#163) - AttributeError: NoneType has no startswith (VectifyAI#199) - KeyError: 'table_of_contents' when LLM output is malformed - AttributeError: 'dict' has no extend / 'str' has no get Changes: - add_page_offset_to_toc_json: guard None offset, return data unchanged - process_none_page_numbers: fix prev/next defaults (start_index/end_index) - toc_detector_single_page: .get() with 'no' default - check_if_toc_extraction_is_complete: .get() with 'no' default - check_if_toc_transformation_is_complete: .get() with 'no' default - detect_page_index: .get() with 'no' default - toc_transformer: isinstance + .get() for table_of_contents access - toc_transformer: None/isinstance guard on new_complete.startswith - single_toc_item_index_fixer: .get() for physical_index - meta_processor: isinstance(item, dict) filter - process_no_toc: isinstance guard for generate_toc_init/continue results
There was a problem hiding this comment.
Pull request overview
This PR hardens the PageIndex TOC processing pipeline against malformed/unexpected LLM outputs to prevent document-level crashes during indexing.
Changes:
- Added safe-default accessors (
.get(...)) for several LLM-returned JSON fields to avoidKeyErrors. - Introduced additional type/None guards around TOC transformation and TOC generation flows to prevent
AttributeError/TypeErrors on unexpected return types. - Fixed default page-range bounds in
process_none_page_numbersso first/last TOC items receive a valid non-empty search window.
Comments suppressed due to low confidence (1)
pageindex/page_index.py:429
add_page_offset_to_toc_json()assumes everydata[i]is a dict (data[i].get(...)). Iftoc_transformer()returns a list containing non-dicts (possible with malformed-but-parseable LLM output), this will still crash withAttributeError. Consider skipping non-dict items (or normalizing upstream) before accessing.get.
if offset is None:
return data
for i in range(len(data)):
if data[i].get('page') is not None and isinstance(data[i]['page'], int):
data[i]['physical_index'] = data[i]['page'] + offset
del data[i]['page']
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| response = llm_completion(model=model, prompt=prompt) | ||
| # print('response', response) | ||
| json_content = extract_json(response) | ||
| return json_content['toc_detected'] | ||
| return json_content.get('toc_detected', 'no') |
There was a problem hiding this comment.
extract_json() can return a non-dict (e.g., a JSON list when the LLM outputs [...]). In that case, calling .get(...) will raise AttributeError and reintroduce a crash path. Consider guarding with isinstance(json_content, dict) (and defaulting to 'no') before accessing toc_detected.
| prompt = prompt + '\n Document:\n' + content + '\n Table of contents:\n' + toc | ||
| response = llm_completion(model=model, prompt=prompt) | ||
| json_content = extract_json(response) | ||
| return json_content['completed'] | ||
| return json_content.get('completed', 'no') |
There was a problem hiding this comment.
extract_json() may return a list if the model outputs a JSON array; in that case json_content.get('completed', ...) will raise. Add an isinstance(json_content, dict) guard (and default to 'no') so malformed-but-parseable outputs don’t crash this completeness check.
| prompt = prompt + '\n Raw Table of contents:\n' + content + '\n Cleaned Table of contents:\n' + toc | ||
| response = llm_completion(model=model, prompt=prompt) | ||
| json_content = extract_json(response) | ||
| return json_content['completed'] | ||
| return json_content.get('completed', 'no') |
There was a problem hiding this comment.
Same issue as above: extract_json() can return a list (valid JSON) and json_content.get('completed', ...) would crash. Guard with isinstance(json_content, dict) (or handle list outputs explicitly) before accessing completed.
| response = llm_completion(model=model, prompt=prompt) | ||
| json_content = extract_json(response) | ||
| return json_content['page_index_given_in_toc'] | ||
| return json_content.get('page_index_given_in_toc', 'no') |
There was a problem hiding this comment.
extract_json() can successfully parse a JSON list if the model responds with [...]. In that case, json_content.get('page_index_given_in_toc', ...) will raise AttributeError. Add an isinstance(json_content, dict) guard (default 'no') to avoid crashing on parseable-but-unexpected outputs.
| return json_content.get('page_index_given_in_toc', 'no') | |
| if isinstance(json_content, dict): | |
| return json_content.get('page_index_given_in_toc', 'no') | |
| return 'no' |
| if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model) | ||
| if if_complete == "yes" and finish_reason == "finished": | ||
| last_complete = extract_json(last_complete) | ||
| cleaned_response=convert_page_to_int(last_complete['table_of_contents']) | ||
| cleaned_response=convert_page_to_int(last_complete.get('table_of_contents', []) if isinstance(last_complete, dict) else []) | ||
| return cleaned_response |
There was a problem hiding this comment.
convert_page_to_int() assumes each element is a dict and will misbehave or crash if table_of_contents contains non-dicts (e.g., list of strings from malformed LLM output). Since toc_transformer() returns this value and downstream code assumes dict items, consider validating that table_of_contents is a list of dicts (filtering or coercing) before calling convert_page_to_int() / returning.
| def process_none_page_numbers(toc_items, page_list, start_index=1, model=None): | ||
| end_index = len(page_list) + start_index - 1 | ||
| for i, item in enumerate(toc_items): | ||
| if "physical_index" not in item: | ||
| # logger.info(f"fix item: {item}") |
There was a problem hiding this comment.
process_none_page_numbers() still assumes each TOC item is a dict and that a 'page' key exists (later del item_copy['page'] / del item['page']). With malformed LLM output (missing keys or non-dict items), this can raise (e.g., TypeError/KeyError) before the later meta_processor dict-filter runs. Consider adding isinstance(item, dict) checks and using pop('page', None) instead of unconditional del.
| prompt = toc_extractor_prompt + '\nSection Title:\n' + str(section_title) + '\nDocument pages:\n' + content | ||
| response = await llm_acompletion(model=model, prompt=prompt) | ||
| json_content = extract_json(response) | ||
| return convert_physical_index_to_int(json_content['physical_index']) | ||
| return convert_physical_index_to_int(json_content.get('physical_index')) |
There was a problem hiding this comment.
extract_json() can return a non-dict (e.g., a JSON list), in which case json_content.get('physical_index') will raise. Add an isinstance(json_content, dict) guard (and default None) before accessing physical_index to keep this fixer resilient to malformed LLM output.
Summary
Comprehensive fix for multiple crash sites triggered when LLMs return malformed or unexpected output during ToC processing. Combines and extends fixes from PRs #217 and #210, plus guards for additional unpatched crash sites.
Problems Fixed
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'(#153)calculate_page_offset()returnsNoneadd_page_offset_to_toc_jsonKeyError: 'toc_detected'/'completed'/'table_of_contents'(#163)extract_json()returns{}on malformed LLM output.get()with safe defaultsAttributeError: 'NoneType' object has no attribute 'startswith'(#199)llm_completion()returnsNoneisinstance+ None guardAttributeError: 'str' object has no attribute 'get'isinstance(item, dict)filterAttributeError: 'dict' object has no attribute 'extend'generate_toc_init()returns non-listisinstance(..., list)guardnext_physical_index = -1→range(prev, 0)is emptyend_indexprev_physical_index = 0for first itemstart_indexChanges
All changes are in
pageindex/page_index.py:add_page_offset_to_toc_json: Return data unchanged when offset is Noneprocess_none_page_numbers: Fix prev/next defaults (start_index/end_index)toc_detector_single_page:.get('toc_detected', 'no')check_if_toc_extraction_is_complete:.get('completed', 'no')check_if_toc_transformation_is_complete:.get('completed', 'no')detect_page_index:.get('page_index_given_in_toc', 'no')toc_transformer:isinstance+.get()fortable_of_contentsaccess (×2 sites)toc_transformer: None + isinstance guard onnew_complete.startswith()single_toc_item_index_fixer:.get('physical_index')meta_processor:isinstance(item, dict)filterprocess_no_toc:isinstanceguards forgenerate_toc_init/generate_toc_continueresultsContext
We run PageIndex at scale (~28K tender documents) on Cloud Run with gpt-oss-20b as the primary model. These crashes affect ~5-15% of documents depending on structure complexity. The unguarded dict/attribute access means one malformed LLM response crashes the entire indexing pipeline for that document, even though the data could be gracefully recovered.
Relates to: #153, #163, #199, #203