Skip to content

Question: Training HunyuanOCR for Information Extraction with BBox output #96

@Docubytes

Description

@Docubytes

Hello,

Thank you for open-sourcing HunyuanOCR repository. This is a great project.

I am working on invoice document processing and would like to fine-tune HunyuanOCR not only for OCR text recognition, but also as an information extraction model that returns text along with bounding box (BBox) coordinates.

From the repository, I understand that training data is provided in train.jsonl / test.jsonl format with special placeholder tokens (e.g. <hy_place_holder_no_112>, <hy_place_holder_no_110>) to encode text and bounding boxes.

Could you please confirm:

Whether HunyuanOCR can be fine-tuned to output OCR results with bounding boxes for downstream information extraction tasks (such as invoices or forms)?

If yes, could you kindly share a minimal training sample (one image + corresponding train.jsonl entry) that demonstrates the recommended format for text + BBox supervision?

This would help us ensure that we are following the correct data format and training approach.

Thank you very much for your time and support.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions