## Model Information
- **Model Name:** Diagnostic-Reasoning-Q3X1
- **Organization:** Clinical-Reasoning-Hub, UAEU College of Medicine
- **Model Size:** 8B parameters
- **Base Model:** Qwen3-8B fine-tuned with clinical reasoning methodology
- **HuggingFace:** https://huggingface.co/Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1
- **Contact:** adnangha@uaeu.ac.ae
## Results
[pentabrid_v9_full_report.json](https://github.com/user-attachments/files/25393574/pentabrid_v9_full_report.json)
## Evaluation Details
- **Method:** Zero-shot CoT (generation-based)
- **Max tokens:** 8,192
- **Temperature:** 0.0 (greedy)
- **Framework:** vLLM on NVIDIA H100 80GB
- **Inference mode:** BF16 full precision
## Key Highlight
Diagnostic-Reasoning-Q3X1 is an 8B parameter model that achieves competitive
performance with models 8-84x larger on expert-level medical reasoning.
To the best of our knowledge, this is the first sub-10B model submitted to MedXpertQA.
=========================================================
## PENTABRID V9 — COMPLETE EVALUATION REPORT
=========================================================
GENERATION-BASED SCORES:
MedQA (USMLE): 67.0% (log-lik: 66.3%, +0.7pp)
MedMCQA: 58.9% (log-lik: 58.6%, +0.3pp)
PubMedQA: 69.5% (log-lik: 66.6%, +2.9pp)
MMLU Clinical Knowledge: 85.3% (log-lik: 86.4%, -1.1pp)
EXPERT-LEVEL BENCHMARK:
MedXpertQA Text: 24.9% (3rd globally, 1st sub-10B)
LOG-LIKELIHOOD OVERALL: 76.4% (7-benchmark average)
LEADERBOARD HIGHLIGHTS:
- Beats LLaMA-3.3-70B on MedXpertQA (8B vs 70B)
- Beats DeepSeek-V3 on MedXpertQA (8B vs 671B)
- Beats MedReason-8B on MedQA by +5.3pp
- First sub-10B model on MedXpertQA leaderboard
## Submission File
[Attached: pentabrid_v9_full_report.json]
Body ( score): 24.9% on MedXpertQA Text as lightweight 8B parameter AI model, higher than Deep Seek V3; Lama 70B and many other heavy weight AI models