A multi-agent evaluation framework for Visual Question Answering on financial charts, supporting both ChartQAPro and FinMME. The framework decomposes chart QA into an explicit Plan → Inspect → Explain loop, producing fully traceable evaluation artifacts for each sample.
| Dataset | HF handle | Notes |
|---|---|---|
| ChartQAPro | ahmed-masry/ChartQAPro |
Multi-turn chart QA with factoid/mcq/unanswerable tasks. Images cached under data/chartqapro_images/ by default. |
| FinMME | luojunyu/FinMME |
Financial multi-modal evaluation benchmark (only a train split on HF; use slicing like train[:1000]). Images cached under data/finmme_images/ by default. |
Select the dataset at runtime with --dataset {chartqapro|finmme}; all downstream tooling (Langfuse registration, output directories) key off the same slug.
Unlike single-pass VLM approaches, AgentFinVQA coordinates multiple specialized agents:
- PlannerAgent — text-only LLM that generates a structured inspection plan without seeing the image
- OcrReaderTool — focused VLM call that transcribes all visible text from the chart
- VisionAgent — CrewAI-orchestrated agent that executes the plan and produces an answer
- VerifierAgent — second VLM that critiques the draft answer and confirms or revises it
Each run produces a Model Evaluation Packet (MEP) — a portable JSON artifact capturing the full trace: inspection plan, vision reasoning, verifier critique, tool call logs, timestamps, and errors. MEPs enable reproducible evaluation, post-hoc explainability analysis, and model comparison across VLM backends.
Input Sample (question, chart image, expected answer)
│
▼
PlannerAgent (text-only LLM)
• Produces a structured JSON inspection plan
• MCQ-aware: checks/eliminates each choice; multi-select guidance
• Does NOT see the image
│ plan.steps
▼
OcrReaderTool
• Single VLM call focused on text transcription
• Produces structured chart metadata (axes, legend, data labels)
│ ocr_text + chart_type
▼
LegendGrounderTool (conditional)
• Triggered for line/bar/scatter/area/pie/donut charts
• Maps legend labels → color descriptions + RGB + line style
• Compliance check: re-runs if legend entries are missing
│ legend_map
▼
VisionAgent (CrewAI + VisionQATool)
• Executes the plan using OCR text and legend map as ground truth
• Single-select MCQ / multi-select MCQ / open-ended answer paths
• Produces answer + explanation + per-choice confidence analysis
│ draft_answer
▼
Forced-Choice Retry (conditional)
• If vision returns UNANSWERABLE and MCQ choices exist:
re-runs vision with explicit "FORCED CHOICE" instruction
│ draft_answer (revised if retry triggered)
▼
VerifierAgent (CrewAI + VerifierTool)
• Reviews draft answer against chart image
• Adds reluctance hint when vision confidence is high (≥ 0.85)
• Verdict: CONFIRM or REVISE + self-reported confidence
• Confidence gate: downgrades low-confidence revisions (< 0.75)
│
▼
MEP (Model Evaluation Packet)
• JSON artifact stored to disk
• Optionally traced in Langfuse
| Run | Accuracy | Δ vs baseline | Key change |
|---|---|---|---|
no_legend_grounding |
48.0% | — | Baseline |
fixes_v1 |
50.4% | +2.4 pp | Legend grounding, caption injection, token limits |
fixes_v2 |
51.6% | +3.6 pp | Disable thinking tokens, MCQ choices to verifier |
fixes_v3 |
51.6% | +3.6 pp | Thinking budget = 512 |
fixes_v4_g3flash |
56.0% | +8.0 pp | Gemini 3 Flash, forced-choice retry, MCQ-aware planner |
fixes_v5_multiselect |
69.4% | +21.4 pp | Full multi-select MCQ support |
fixes_v7_g3flash_conf_gate |
69.6% | +21.6 pp | Confidence gate fix, fresh g3flash run |
fixes_v8_g3flash_color_area |
71.2% (n = 1,250) | +23.2 pp | Color-area OpenCV pre-hint; see results.md §8b |
fixes_v9_g3flash_related_sents |
71.3% (n = 1,250) | +23.3 pp | Verifier + related_sentences + caption cross-check; ~2.4× tighter latency tail vs v8 (p95 87 s vs 209 s) |
fixes_v10_g3flash_choice_conflict |
71.1% (n = 1,250) | +23.1 pp | v9 + high-confidence choice-conflict flag for verifier |
vs. FinMME paper (Table 3, Gemini Flash 2.0 = 51.85%): our best 250-ID ladder run achieves +17.8 pp (v7 mean answer_accuracy vs paper headline — metric families differ).
Fair same-model baseline (Gemini-3 Flash Preview structured zero-shot vs agent):
- Primary (matched n = 1,250 train IDs): zero-shot mean
answer_accuracy63.56% vs agents — v8 71.24% (+7.68 pp, exact +8.72 pp, McNemar χ² = 68.21, p ≈ 1.1×10⁻¹⁶); v9 71.28% (+7.72 pp, exact +8.16 pp, χ² = 61.45, p ≈ 4.5×10⁻¹⁵); v10 71.08% (+7.52 pp, exact +7.84 pp, χ² = 57.37, p ≈ 3.6×10⁻¹⁴). All three crush zero-shot; pairwise between agents nothing is significant (v9 vs v8 p = 0.56; v10 vs v8 p = 0.34; v9 vs v10 p = 0.75). v9's distinctive contribution is latency-tail tightening, not extra accuracy. Full zero-shot train file: 11,099 rows — always join onsample_idbefore comparing. - Legacy 250-ID snapshot (strict exact, ablation era): zero-shot 52.8% vs agent v7 62.8% → +10.0 pp — useful historically; see
results.md§8b for context.
Note: the initial zero-shot Gemini-3 export had parser-related empty predictions; robust extraction + repair recovered many rows before the full 11k re-run.
Detailed per-run analysis, per-type breakdowns, and paper comparison are in notebooks/results_analysis.ipynb.
For camera-ready citation numbers, see markdown/camera_ready_metrics.md.
The development environment is managed with uv.
Install core dependencies:
uv sync
source .venv/bin/activateInstall the agentic pipeline dependencies (CrewAI, Google GenAI, Streamlit dashboard):
uv sync --group agentic-xai-eval
source .venv/bin/activateInstall all dependencies including docs:
uv sync --all-groups
source .venv/bin/activateCopy .env.example to .env and fill in your API keys:
cp .env.example .envRequired environment variables:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key (for planner / verifier backends) |
GEMINI_API_KEY |
Google Gemini API key (for vision backend) |
LANGFUSE_PUBLIC_KEY |
Langfuse public key (optional — enables tracing) |
LANGFUSE_SECRET_KEY |
Langfuse secret key (optional) |
LANGFUSE_HOST |
Langfuse host URL (optional — defaults to cloud) |
Generate MEPs for a subset of ChartQAPro:
uv run --env-file .env -m agentfinvqa.runner.run_generate_meps \
--dataset chartqapro \
--split test \
--n 200 \
--config openai_gemini \
--workers 8 \
--out meps/To target FinMME, switch --dataset finmme. The loader automatically writes FinMME charts to data/finmme_images/ unless you override --image_dir. Note: Hugging Face only exposes a train split for luojunyu/FinMME. Any request for test is remapped to train internally, so use slicing (e.g. --split train[:200]) to simulate held-out subsets.
--split— Hugging Face split and optional row slice (e.g.test,test[1000:],train[:500]). This selects which rows of the dataset are loaded.--n— Maximum number of perceived samples to process after that slice. Use0or a negative value for no cap (process the entire loaded slice). Positivenstops early once enough samples are materialized.
So “run the whole test split” is typically:
--split test --n 0A partial slice with no further cap:
--split 'test[1000:]' --n 0The recommended entrypoints for all datasets are scripts/run_batch.py and its bash wrapper scripts/run_batch.sh. These are dataset-agnostic and run generation + post-evaluation in a single MEP pass (metrics, traces, failure taxonomy, and summary in one go):
scripts/run_batch.sh \
--dataset chartqapro \
--split test \
--n 500 \
--config gemini_gemini \
--workers 8 \
--post_eval \
--use_judge \
--langfuse \
--resume \
--eval_label chartqapro_test_n500To skip generation and run post-eval on existing MEPs only:
scripts/run_batch.sh \
--dataset chartqapro \
--split test \
--config gemini_gemini \
--eval_only \
--use_judge \
--langfuse \
--eval_label chartqapro_test_n500Both commands default to loading .env from the repo root. --langfuse pushes all numeric eval scores (accuracy, judge rubric scores) back to the originating Langfuse traces.
Verifier ablation: pass --no_verifier to skip the VerifierAgent (Pass 2.5); the pipeline keeps the planner/vision draft without a revise step. Supported by scripts/run_batch.py, scripts/run_finmme_batch.py, and scripts/submit_pipeline.sh (see below).
Submit a complete run (generation and post-eval) as one SLURM job:
sbatch scripts/slurm_run_batch.slrmEnvironment variables (DATASET, SPLIT, N, CONFIG, WORKERS, LANGFUSE, RESUME, NO_VERIFIER, and model overrides) are passed through from the environment or from submit_pipeline.sh via --export. Set NO_VERIFIER=1 before sbatch if you call slurm_run_batch.slrm without the submit helper.
To run post-eval on MEPs that already exist:
scripts/submit_eval.sh \
--dataset chartqapro \
--split test \
--use_judge \
--langfuse \
--out_label chartqapro_test_n500This submits slurm_eval_only.slrm as a single SLURM job. You can chain it after a generation job:
scripts/submit_eval.sh \
--dataset chartqapro \
--split test \
--use_judge \
--langfuse \
--after <JOB_ID>For large runs, use the chained pipeline that separates MEP generation from LLM judge evaluation. This uses the Gemini Batch API for judge scoring (50% cost reduction, async):
scripts/submit_pipeline.sh \
--dataset finmme \
--split "train[3000:5000]" \
--n 2000 \
--workers 8 \
--langfuse \
--resume \
--planner_model gemini-2.5-flash \
--vision_model gemini-2.5-flash \
--ocr_model gemini-2.5-flash-lite \
--verifier_model gemini-2.5-flash \
--judge_model gemini-2.5-flash-liteDefaults (you usually do not need to repeat model flags) — submit_pipeline.sh already defaults to --config gemini_gemini, --workers 8, and the same planner/vision/OCR/verifier/judge models as above. Override only what you change. Add --langfuse and/or --resume when you want tracing or skip-existing MEPs. For a verifier-off ablation, add --no_verifier.
Full split without counting rows — use --n 0 (see the Sample selection subsection above):
scripts/submit_pipeline.sh \
--dataset chartqapro \
--split test \
--n 0 \
--no_verifier \
--resumeNot the same as run_batch.sh --post_eval — Job 1 in this chain runs MEP generation only (via slurm_run_batch.slrm → run_batch.sh without --post_eval). Job 2 submits prompts to the Gemini Batch API for async judge scoring. For local / threaded post-eval in one process (metrics, traces, taxonomy, summary written immediately), use scripts/run_batch.sh with --post_eval (and --use_judge if you want the LLM judge path during that step) instead of this two-stage pipeline.
This submits two SLURM jobs chained with --dependency=afterok:
| Job | Script | What it does |
|---|---|---|
| 1 | slurm_run_batch.slrm |
MEP generation |
| 2 | slurm_submit_judge_batch.slrm |
Uploads all judge prompts to Gemini Batch API and exits immediately |
Where MEPs and batch metrics go
- MEP directory (generation output):
meps/<CONFIG>/<dataset>/<split>/when the verifier is on (default). With--no_verifier, MEPs go undermeps/<CONFIG>/<dataset>/no_verifier/<split>/so verifier-on and verifier-off runs do not overwrite each other. Example:meps/gemini_gemini/chartqapro/test/vsmeps/gemini_gemini/chartqapro/no_verifier/test/. - Batch judge file (job 2):
output/metrics_<out_label>.jsonlplusoutput/metrics_<out_label>.jsonl.batch_state.json. If you omit--out_label, the script sets<out_label>to{dataset}_{sanitized_split}and appends_no_verifierwhen--no_verifieris set (e.g.chartqapro_test_no_verifier), so different runs do not overwritemetrics_test.jsonl.
Job 2 only runs if job 1 succeeds. When job 2 completes it prints the commands to check status and retrieve results:
# Check if Gemini batch job is done
python3 -m agentfinvqa.eval.eval_outputs_batch status \
--state output/metrics_<label>.jsonl.batch_state.json
# Download results when ready
python3 -m agentfinvqa.eval.eval_outputs_batch retrieve \
--state output/metrics_<label>.jsonl.batch_state.jsonThe batch job display name in the Gemini console follows the metrics filename (from --out_label or the auto-generated label above).
See notebooks/run_pipeline.ipynb for an interactive walkthrough.
For a single VLM call per sample (no agents), use baselines/run_zeroshot.py. Outputs metrics JSONL in the same schema as the agent pipeline for easy comparison.
- Structured prompt (default) — rules plus JSON
answer/explanationformat. - Minimal prompt — bare
Question: … / Answer:(first line of the model reply is scored); use--prompt_style minimal.
On SLURM:
baselines/submit_zeroshot.sh --dataset chartqapro --split test --prompt_style minimalThe run_batch.sh scripts above handle the full eval pipeline automatically via --post_eval or --eval_only. All four artifacts are produced in a single MEP pass:
| Artifact | Path |
|---|---|
| Per-sample metrics | output/metrics_<label>.jsonl |
| Trace metrics | output/trace_metrics_<label>.jsonl |
| Failure taxonomy | output/taxonomy_<label>.jsonl |
| Summary CSV | output/summary_<label>.csv |
If --langfuse is set, all numeric scores are pushed back to the originating Langfuse traces.
To evaluate an existing MEP directory with the low-level CLI directly:
uv run -m agentfinvqa.eval.eval_outputs \
--mep_dir meps/gemini_gemini/chartqapro/test \
--out output/metrics.jsonl \
--judge_model gemini-2.5-flash-liteuv run streamlit run src/agentfinvqa/eval/dashboard.pyThe dashboard auto-discovers MEP directories under meps/ and metric files under output/. Use the sidebar to select paths and filters.
See notebooks/analysis.ipynb for detailed analysis examples.
src/agentfinvqa/
├── agents/ # PlannerAgent, VisionAgent, VerifierAgent
│ └── prompts/ # System prompt templates
├── datasets/ # ChartQAPro dataset loader
├── eval/ # Evaluation utilities, metrics, Streamlit dashboard
├── langfuse_integration/ # Observability: tracing, client, dataset registration
├── mep/ # Model Evaluation Packet schema and I/O
├── runner/ # End-to-end pipeline runner
├── tools/ # OcrReaderTool, VisionQATool (CrewAI tools)
└── utils/ # Hashing, strict JSON parsing, timing
notebooks/
├── run_pipeline.ipynb # Interactive pipeline walkthrough
└── analysis.ipynb # MEP analysis and visualization
uv sync --dev
source .venv/bin/activateuv run pre-commit run --all-filesNote for Vector Institute HPC users: The Compute Canada pip configuration (set via
PIP_CONFIG_FILE) interferes with pre-commit's environment setup, causing source builds of Rust-based tools (ruff, typos) instead of downloading pre-built wheels. To avoid this, either run with:PIP_CONFIG_FILE=/dev/null uv run pre-commit run --all-filesOr add
export PIP_CONFIG_FILE=/dev/nullto your~/.bashrc.
Run the full test suite:
uv run pytestRun a specific test file:
uv run pytest tests/agentfinvqa/test_legend_grounding.py
uv run pytest tests/agentfinvqa/test_finmme_loader.pyRun a specific test class or test:
uv run pytest tests/agentfinvqa/test_legend_grounding.py::TestComplianceCheck
uv run pytest tests/agentfinvqa/test_legend_grounding.py::TestLegendGrounderTool::test_api_error_returns_fallback_jsonRun with coverage:
uv run pytest --cov=src/agentfinvqa --cov-report=term-missingRun only integration tests (marked with @pytest.mark.integration_test):
uv run pytest -m integration_test| Test file | What it covers |
|---|---|
test_finmme_loader.py |
FinMME dataset loader — option parsing, question type mapping, sample construction, multi-letter answers |
test_legend_grounding.py |
Legend grounding pipeline stage — formatting, prompt injection, MEP schema, gate logic, compliance check, LegendGrounderTool with mocked API calls |
test_placeholder.py |
Placeholder (no-op) to keep pytest from exiting with "no tests collected" |
test_legend_grounding.py covers the full legend grounding feature added between the OCR and vision stages:
TestFormatLegendGroundingBlock— prompt block formatting: empty/None inputs, header, per-entry label/color/style/confidence rendering, missing-field robustnessTestBuildVisionTaskDescription—legend_mapinjected into vision prompt, absent when empty/None,prepend_instructionappears first, OCR block coexists with legend blockTestMEPLegendGrounding— schema defaults, all fields settable,MEP.to_dict()serialisation,NonecaseTestLegendGroundingGate— all 7 gated chart types (line,bar,scatter,area,bar_grouped,bar_stacked,combination) pass;pie,table,dashboardand others are blocked; single-legend andgrounder=Noneare skippedTestComplianceCheck— label present/absent in explanation, any-label match, case-insensitive, empty explanation/map, entries withoutlabelkeyTestLegendGrounderTool—pop_tracesflush-and-clear, unknown backend returns error JSON, Gemini success path with trace appended, API crash fallback, prompt template structure