Skip to content

[Leaderboard Submission] Diagnostic-Reasoning-Q3X1 (Qwen3-8B, 8B params) #11

@adnanagha-AI

Description

@adnanagha-AI

Body ( score): 24.9% on MedXpertQA Text as lightweight 8B parameter AI model, higher than Deep Seek V3; Lama 70B and many other heavy weight AI models

## Model Information
- **Model Name:** Diagnostic-Reasoning-Q3X1
- **Organization:** Clinical-Reasoning-Hub, UAEU College of Medicine
- **Model Size:** 8B parameters
- **Base Model:** Qwen3-8B fine-tuned with clinical reasoning methodology
- **HuggingFace:** https://huggingface.co/Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1
- **Contact:** adnangha@uaeu.ac.ae

## Results

[pentabrid_v9_full_report.json](https://github.com/user-attachments/files/25393574/pentabrid_v9_full_report.json)

## Evaluation Details
- **Method:** Zero-shot CoT (generation-based)
- **Max tokens:** 8,192
- **Temperature:** 0.0 (greedy)
- **Framework:** vLLM on NVIDIA H100 80GB
- **Inference mode:** BF16 full precision

## Key Highlight
Diagnostic-Reasoning-Q3X1 is an 8B parameter model that achieves competitive 
performance with models 8-84x larger on expert-level medical reasoning. 
To the best of our knowledge, this is the first sub-10B model submitted to MedXpertQA.

=========================================================
## PENTABRID V9 — COMPLETE EVALUATION REPORT
=========================================================

  GENERATION-BASED SCORES:
    MedQA (USMLE):                67.0%  (log-lik: 66.3%, +0.7pp)
    MedMCQA:                         58.9%  (log-lik: 58.6%, +0.3pp)
    PubMedQA:                         69.5%  (log-lik: 66.6%, +2.9pp)
    MMLU Clinical Knowledge:   85.3%  (log-lik: 86.4%, -1.1pp)

  EXPERT-LEVEL BENCHMARK:
    MedXpertQA Text:                24.9%  (3rd globally, 1st sub-10B)

  LOG-LIKELIHOOD OVERALL:      76.4% (7-benchmark average)

  LEADERBOARD HIGHLIGHTS:
    - Beats LLaMA-3.3-70B on MedXpertQA (8B vs 70B)
    - Beats DeepSeek-V3 on MedXpertQA (8B vs 671B)
    - Beats MedReason-8B on MedQA by +5.3pp
    - First sub-10B model on MedXpertQA leaderboard

## Submission File
[Attached: pentabrid_v9_full_report.json]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions