Skip to content

Fix GenAI server LLM inference on SOC_ACCELERATOR devices (Hailo-10H)#39

Open
jordanskole wants to merge 4 commits into
hailo-ai:masterfrom
jordanskole:master
Open

Fix GenAI server LLM inference on SOC_ACCELERATOR devices (Hailo-10H)#39
jordanskole wants to merge 4 commits into
hailo-ai:masterfrom
jordanskole:master

Conversation

@jordanskole

Copy link
Copy Markdown

Summary

  • Fix HAILO_NOT_IMPLEMENTED during model creation on SOC_ACCELERATOR devices by using the MemoryView overload of create_infer_model(), which correctly routes to
    InferModelHrpcClient on VDeviceHrpcClient
  • Increase WAIT_FOR_OPERATION_TIMEOUT from 10s to 120s to accommodate large HEF transfers over PCIe RPC (e.g., ~2.3 GB for Qwen2.5-1.5B on Raspberry Pi 5)
  • Fix use-after-free in TokenEmbedder where the Eigen::Map backing memory (from the HEF buffer) is freed after model creation completes, causing SIGSEGV during token
    generation. Call set_resource_guard(hef_buffer) to keep the buffer alive. This bug affects all device types, not just Hailo-10H.

Files changed

  • genai/llm/llm_inference_manager.hpp / .cpp — pass hef_buffer through, use MemoryView overload
  • genai/llm/llm_server.hpp / .cpp — thread hef_buffer through creation chain, call set_resource_guard()
  • genai/vlm/vlm_server.hpp / .cpp — same fixes for VLM path
  • genai/utils.hpp — increase WAIT_FOR_OPERATION_TIMEOUT

Test plan

  • Verified LLM inference works end-to-end on Hailo-10H (RPi 5) with Qwen2.5-1.5B-Instruct
  • Model loads without HAILO_NOT_IMPLEMENTED or HAILO_TIMEOUT
  • 1000+ tokens generated without SIGSEGV

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant