Fix GenAI server LLM inference on SOC_ACCELERATOR devices (Hailo-10H) by jordanskole · Pull Request #39 · hailo-ai/hailort

jordanskole · 2026-02-04T18:23:53Z

Summary

Fix HAILO_NOT_IMPLEMENTED during model creation on SOC_ACCELERATOR devices by using the MemoryView overload of create_infer_model(), which correctly routes to
InferModelHrpcClient on VDeviceHrpcClient
Increase WAIT_FOR_OPERATION_TIMEOUT from 10s to 120s to accommodate large HEF transfers over PCIe RPC (e.g., ~2.3 GB for Qwen2.5-1.5B on Raspberry Pi 5)
Fix use-after-free in TokenEmbedder where the Eigen::Map backing memory (from the HEF buffer) is freed after model creation completes, causing SIGSEGV during token
generation. Call set_resource_guard(hef_buffer) to keep the buffer alive. This bug affects all device types, not just Hailo-10H.

Files changed

genai/llm/llm_inference_manager.hpp / .cpp — pass hef_buffer through, use MemoryView overload
genai/llm/llm_server.hpp / .cpp — thread hef_buffer through creation chain, call set_resource_guard()
genai/vlm/vlm_server.hpp / .cpp — same fixes for VLM path
genai/utils.hpp — increase WAIT_FOR_OPERATION_TIMEOUT

Test plan

Verified LLM inference works end-to-end on Hailo-10H (RPi 5) with Qwen2.5-1.5B-Instruct
Model loads without HAILO_NOT_IMPLEMENTED or HAILO_TIMEOUT
1000+ tokens generated without SIGSEGV

20260204 jls hailo server

jordanskole added 4 commits February 4, 2026 12:47

working locally

8403bb6

no build in github, otherwise working hailo_server

d0b15ad

explaination

b9bab64

Merge pull request #1 from jordanskole/20260204-jls-hailo_server

cf6986d

20260204 jls hailo server

Provide feedback