Currently the LLM extraction prompt receives only plain text from tesseract (src.Text in llmextract.go:78). The raw TSV output (src.Data) is stored on TextSource but never included in the prompt.
Tesseract TSV preserves information that plain text loses:
- Spatial layout: per-word bounding boxes (left, top, width, height)
- Confidence scores: per-word OCR confidence
- Structural hierarchy: block, paragraph, and line groupings
This could help the LLM make better extraction decisions, especially for invoices and forms with tabular data where spatial relationships matter.
Tradeoff: TSV is ~5-10x more verbose than plain text, which increases token consumption. This may hurt small local models (e.g. qwen3:0.6b) on token budget. Consider making this configurable or only sending TSV for specific document types.
Currently the LLM extraction prompt receives only plain text from tesseract (
src.Textinllmextract.go:78). The raw TSV output (src.Data) is stored onTextSourcebut never included in the prompt.Tesseract TSV preserves information that plain text loses:
This could help the LLM make better extraction decisions, especially for invoices and forms with tabular data where spatial relationships matter.
Tradeoff: TSV is ~5-10x more verbose than plain text, which increases token consumption. This may hurt small local models (e.g. qwen3:0.6b) on token budget. Consider making this configurable or only sending TSV for specific document types.