Skip to content

Feedback and Issues with HunyuanOCR Model Inference and Recognition Performance #95

@relic-yuexi

Description

@relic-yuexi

First and foremost, thank you to the team for developing the impressive HunyuanOCR model! I’ve been testing it and have some feedback regarding inference compatibility and recognition performance.

  1. Inference Compatibility Issue with vLLM

When using the model with vllm (Python 3.12), I encountered errors not present in Python 3.10. The root cause appears to be multiprocessing method conflicts. A related issue is documented here:
vllm-project/vllm#32611 (comment)

Solution: The following code works correctly by setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn' and properly cleaning resources:

import os
os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'


from PIL import Image
from transformers import AutoProcessor
import torch.distributed as dist

def clean_repeated_substrings(text):
    """Clean repeated substrings in text"""
    n = len(text)
    if n < 8000:
        return text
    for length in range(2, n // 10 + 1):
        candidate = text[-length:]
        count = 0
        i = n - length
        while i >= 0 and text[i:i + length] == candidate:
            count += 1
            i -= length
        if count >= 10:
            return text[:n - length * (count - 1)]
    return text

def main():
    from vllm import LLM, SamplingParams
    
    model_path = "tencent/HunyuanOCR"
    llm = LLM(model=model_path, trust_remote_code=True, mm_processor_cache_gb=0, enable_prefix_caching=False)
    processor = AutoProcessor.from_pretrained(model_path)
    sampling_params = SamplingParams(temperature=0, max_tokens=16384)

    img_path = "page_000.png"
    img = Image.open(img_path)
    messages = [
        {"role": "system", "content": ""},
        {"role": "user", "content": [
            {"type": "image", "image": img_path},
            {"type": "text", "text": (
                "提取文档图片中正文的所有信息用markdown格式表示,其中页眉、页脚部分忽略,表格用html格式表达,文档中公式用latex格式表示,按照阅读顺序组织进行解析。"
            )}
        ]}
    ]
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = {"prompt": prompt, "multi_modal_data": {"image": [img]}}
    output = llm.generate([inputs], sampling_params)[0]
    print(clean_repeated_substrings(output.outputs[0].text))
    
    # 正确清理资源
    del llm
    
    # 销毁分布式进程组(如果存在)
    if dist.is_initialized():
        dist.destroy_process_group()

if __name__ == "__main__":
    main()
  1. Recognition Performance Feedback

I tested the model on a 300 DPI version of HunyuanOCR_Technical_Report.pdf. Here are key observations:

Strengths (Excellent Performance)

• ✅ Page 5: Correctly identified bold text.

• ✅ Page 9: Inline formula recognition worked perfectly.

• ✅ Page 11: Formula recognition was outstanding!

• ✅ Page 11: Table 3 was recognized successfully.

• ✅ Page 13: Tables 5 and 6 were accurate (simple tables).

• ✅ References: All content was extracted (though italics were missed).

Areas for Improvement

⚠️ Page 1: Text should appear before image coordinates (order mismatch; see attached screenshot).

Image

⚠️ Page 3: Table recognition needs improvement (see screenshot).

Image

⚠️ Page 10: Table 2 was misrecognized.

Image

⚠️ Page 12: Table 4 has minor formatting errors.

Image

⚠️ Page 13: Bold text within tables was not detected.

Image

⚠️ Page 18: Table 7 has poor formatting.

Image

⚠️ Page 20: Figure 4’s coordinates were not recognized.

Image

Screenshots

I’ve attached screenshots highlighting these cases (not shown here; would be included in the actual issue).

Conclusion

HunyuanOCR shows great potential, especially in formula and structured data extraction. Addressing the above issues would make it even more robust. Thanks again for your work—I look forward to future updates!

Related Script:

https://gist.github.com/relic-yuexi/f34672b01d3e07d886f9a625f2558ed3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions