Feedback and Issues with HunyuanOCR Model Inference and Recognition Performance

First and foremost, thank you to the team for developing the impressive HunyuanOCR model! I’ve been testing it and have some feedback regarding inference compatibility and recognition performance.  

1. Inference Compatibility Issue with vLLM

When using the model with vllm (Python 3.12), I encountered errors not present in Python 3.10. The root cause appears to be multiprocessing method conflicts. A related issue is documented here:  
https://github.com/vllm-project/vllm/issues/32611#issuecomment-3776818236  

Solution: The following code works correctly by setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn' and properly cleaning resources:  

```
import os
os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'


from PIL import Image
from transformers import AutoProcessor
import torch.distributed as dist

def clean_repeated_substrings(text):
    """Clean repeated substrings in text"""
    n = len(text)
    if n < 8000:
        return text
    for length in range(2, n // 10 + 1):
        candidate = text[-length:]
        count = 0
        i = n - length
        while i >= 0 and text[i:i + length] == candidate:
            count += 1
            i -= length
        if count >= 10:
            return text[:n - length * (count - 1)]
    return text

def main():
    from vllm import LLM, SamplingParams
    
    model_path = "tencent/HunyuanOCR"
    llm = LLM(model=model_path, trust_remote_code=True, mm_processor_cache_gb=0, enable_prefix_caching=False)
    processor = AutoProcessor.from_pretrained(model_path)
    sampling_params = SamplingParams(temperature=0, max_tokens=16384)

    img_path = "page_000.png"
    img = Image.open(img_path)
    messages = [
        {"role": "system", "content": ""},
        {"role": "user", "content": [
            {"type": "image", "image": img_path},
            {"type": "text", "text": (
                "提取文档图片中正文的所有信息用markdown格式表示，其中页眉、页脚部分忽略，表格用html格式表达，文档中公式用latex格式表示，按照阅读顺序组织进行解析。"
            )}
        ]}
    ]
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = {"prompt": prompt, "multi_modal_data": {"image": [img]}}
    output = llm.generate([inputs], sampling_params)[0]
    print(clean_repeated_substrings(output.outputs[0].text))
    
    # 正确清理资源
    del llm
    
    # 销毁分布式进程组（如果存在）
    if dist.is_initialized():
        dist.destroy_process_group()

if __name__ == "__main__":
    main()
```


2. Recognition Performance Feedback

I tested the model on a 300 DPI version of HunyuanOCR_Technical_Report.pdf. Here are key observations:  

Strengths (Excellent Performance)

• ✅ Page 5: Correctly identified bold text.  

• ✅ Page 9: Inline formula recognition worked perfectly.  

• ✅ Page 11: Formula recognition was outstanding!  

• ✅ Page 11: Table 3 was recognized successfully.  

• ✅ Page 13: Tables 5 and 6 were accurate (simple tables).  

• ✅ References: All content was extracted (though italics were missed).  

Areas for Improvement

• ⚠️ Page 1: Text should appear before image coordinates (order mismatch; see attached screenshot).  

<img width="1154" height="894" alt="Image" src="https://github.com/user-attachments/assets/860dfea1-7671-4650-ae8c-5a9f1ad2e41a" />

• ⚠️ Page 3: Table recognition needs improvement (see screenshot).  

<img width="2095" height="816" alt="Image" src="https://github.com/user-attachments/assets/554ed576-db31-4d75-baf7-58460c989389" />

• ⚠️ Page 10: Table 2 was misrecognized.  

<img width="2088" height="588" alt="Image" src="https://github.com/user-attachments/assets/2f68a235-bae7-4970-96e1-11839930ed9b" />

• ⚠️ Page 12: Table 4 has minor formatting errors.  

<img width="2153" height="843" alt="Image" src="https://github.com/user-attachments/assets/14fb2f40-8082-4226-a90d-6cd3014e56ed" />

• ⚠️ Page 13: Bold text within tables was not detected.  

<img width="2026" height="323" alt="Image" src="https://github.com/user-attachments/assets/46de8004-2c62-4a80-90c7-7c1876fe2437" />

• ⚠️ Page 18: Table 7 has poor formatting.  

<img width="2108" height="936" alt="Image" src="https://github.com/user-attachments/assets/37e3f997-5db7-407a-9319-279749f660b6" />

• ⚠️ Page 20: Figure 4’s coordinates were not recognized.  

<img width="2176" height="338" alt="Image" src="https://github.com/user-attachments/assets/5d87bf7d-9562-4432-ae80-4ff38eae4f4d" />

Screenshots

I’ve attached screenshots highlighting these cases (not shown here; would be included in the actual issue).  

Conclusion

HunyuanOCR shows great potential, especially in formula and structured data extraction. Addressing the above issues would make it even more robust. Thanks again for your work—I look forward to future updates!  

Related Script:

https://gist.github.com/relic-yuexi/f34672b01d3e07d886f9a625f2558ed3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback and Issues with HunyuanOCR Model Inference and Recognition Performance #95

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feedback and Issues with HunyuanOCR Model Inference and Recognition Performance #95

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions