feat(extract): send tesseract TSV output to LLM instead of plain text

Currently the LLM extraction prompt receives only plain text from tesseract (`src.Text` in `llmextract.go:78`). The raw TSV output (`src.Data`) is stored on `TextSource` but never included in the prompt.

Tesseract TSV preserves information that plain text loses:
- **Spatial layout**: per-word bounding boxes (left, top, width, height)
- **Confidence scores**: per-word OCR confidence
- **Structural hierarchy**: block, paragraph, and line groupings

This could help the LLM make better extraction decisions, especially for invoices and forms with tabular data where spatial relationships matter.

**Tradeoff**: TSV is ~5-10x more verbose than plain text, which increases token consumption. This may hurt small local models (e.g. qwen3:0.6b) on token budget. Consider making this configurable or only sending TSV for specific document types.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extract): send tesseract TSV output to LLM instead of plain text #699

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(extract): send tesseract TSV output to LLM instead of plain text #699

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions