Skip to content

Fix AttributeError crashes when LLM returns malformed JSON#200

Closed
martindecz wants to merge 1 commit intoVectifyAI:mainfrom
martindecz:fix/malformed-json-crashes
Closed

Fix AttributeError crashes when LLM returns malformed JSON#200
martindecz wants to merge 1 commit intoVectifyAI:mainfrom
martindecz:fix/malformed-json-crashes

Conversation

@martindecz
Copy link
Copy Markdown

Summary

Fixes two AttributeError crashes that occur when the LLM returns malformed JSON output, causing extract_json() to fail and downstream functions to receive wrong types.

Closes #199

Changes

  1. pageindex/utils.py - extract_json() now returns [] (empty list) instead of {} (empty dict) on parse failure. All callers expect a list of dicts, so the fallback value should match that contract.

  2. pageindex/page_index.py - process_no_toc() now validates that generate_toc_init() and generate_toc_continue() return lists before calling .extend(). If they return a non-list type (e.g. dict from failed JSON parsing), the code logs a warning and either resets to an empty list or skips the result.

  3. pageindex/page_index.py - meta_processor() list comprehension now filters out non-dict items with isinstance(item, dict) before calling .get(), preventing 'str' object has no attribute 'get' errors.

Reproduction context

These crashes are reliably triggered when using a local vLLM endpoint with smaller models that frequently produce malformed JSON (extra data, empty responses, malformed structure). All 7 test documents (Office docs converted to PDF) failed with these errors.

Test plan

  • Process PDFs with a model that produces clean JSON output (no regression)
  • Process PDFs with a model that produces malformed JSON output (crashes prevented, graceful degradation)

Fix two crashes caused by malformed LLM JSON output:

1. extract_json() in utils.py now returns [] instead of {} on parse
   failure, matching the list-of-dicts type expected by all callers.

2. process_no_toc() in page_index.py now validates that
   generate_toc_init() and generate_toc_continue() return lists
   before calling .extend().

3. meta_processor() list comprehension now filters out non-dict
   items to prevent 'str' object has no attribute 'get' errors.

Fixes VectifyAI#199
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

martindecz added a commit to martindecz/PageIndex that referenced this pull request Mar 29, 2026
@martindecz
Copy link
Copy Markdown
Author

Closing this PR — our type guards were based on incorrect assumptions about the logger interface (JsonLogger doesn't have .warning()) and the fallback return type change from {} to [] broke more call sites than it fixed (6 out of 13 callers expect dict). The root cause is better addressed in extract_json() itself with robust JSON repair, not in individual callers. Apologies for the noise.

@martindecz martindecz closed this Mar 29, 2026
martindecz pushed a commit to martindecz/PageIndex that referenced this pull request Mar 29, 2026
Reverts the type guards added in f8e5a92, 3034bd4, and 9a5da0c.
These guards were based on incorrect assumptions:
- JsonLogger has no .warning() method, so guards themselves crashed
- Changing extract_json fallback from {} to [] broke 6 of 13 callers
- Only covered 3 of 13 vulnerable call sites

The correct fix is in extract_json/json_repair (kept in 516e791).
See issue VectifyAI#199 for context.
@martindecz martindecz deleted the fix/malformed-json-crashes branch March 29, 2026 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AttributeError crashes in meta_processor and process_no_toc when LLM returns malformed JSON

1 participant