Improve experiment integrity: provider flexibility, proposal alignment checks

JihaoXin · claude · JihaoXin · commit 8ce17bb53a57 · 2026-04-15T22:15:09.000Z
- Experimenter: read config.yaml for available API keys, support multi-provider
  fallback instead of hardcoding a single provider
- Reviewer: add proposal alignment check — compare paper experiments against
  idea.md to catch methodology substitution (e.g., text classification instead
  of end-to-end platform testing)
- Planner: read idea.md to verify experiment alignment when creating action plans
- Pipeline: evaluate_completeness now checks if all systems from project_context.md
  were actually used, and flags blocked/missing credentials as critical gaps

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/ark/pipeline.py b/ark/pipeline.py
@@ -1861,6 +1861,13 @@ def _evaluate_completeness(self, research_idea: str, findings_summary: str,
 2. Are baselines properly compared?
 3. Are the results statistically significant?
 4. Are there obvious gaps that need more experiments?
+5. Read `auto_research/state/project_context.md` and check: were ALL external systems
+   listed there actually installed, configured, and used in experiments? If any system
+   was listed but never used (e.g., never started, never called its API, never imported
+   its package), that is a critical gap.
+6. Check `results/environment_setup.json` and `results/credentials_needed.json` — are
+   there any systems marked as "blocked" or credentials still missing? Those represent
+   incomplete experiments.
 
 Output your evaluation in JSON format:
 ```json
diff --git a/ark/templates/agents/experimenter.prompt b/ark/templates/agents/experimenter.prompt
@@ -16,27 +16,34 @@ You are responsible for setting up real experimental environments, running genui
    - This file contains pre-researched, web-verified system information
    - Trust this as the primary source for what to install and how
 2. Read `auto_research/state/experiment_plan.yaml` for the experiment plan
-3. Before installing anything, CHECK if it already exists on the system:
+3. Read `config.yaml` to discover available API keys and LLM providers
+   - The project may already have configured keys (e.g., `gemini_api_key`, `openrouter_api_key`, `anthropic_api_key`)
+   - Use whatever provider is available — do NOT hardcode a specific provider
+   - Priority: use the project's configured provider first, then fall back to others
+4. Before installing anything, CHECK if it already exists on the system:
    - `which <tool>` — check if CLI tool is on PATH
    - `<tool> --version` — check version
    - `ps aux | grep <service>` — check if a service/daemon is already running
    - Only install if the tool is truly missing
-4. For each required system that needs installation:
+5. For each required system that needs installation:
    - All installations must be isolated to the project — do NOT install globally or
      pollute the shared system environment
    - Try at least 2-3 methods if the first one fails
    - Run the verification command to confirm it works
-5. Save setup results to `results/environment_setup.json`
-6. If API keys or credentials are missing:
+6. Save setup results to `results/environment_setup.json`
+7. If API keys or credentials are missing:
    - Write exactly what is needed to `results/credentials_needed.json`
    - Format: {"needed": [{"key": "ANTHROPIC_API_KEY", "provider": "Anthropic", "purpose": "LLM inference", "required_for": ["exp2", "exp3"]}]}
    - Do NOT silently skip experiments that need credentials
 
 ### Phase 2: Write and Run Experiments
-7. Write experiment scripts that **import and use the installed packages** — not re-implementations
-8. Run experiments and collect results
-9. Verify results are genuine (check logs, spot-check data, confirm the process actually ran)
-10. Update `auto_research/state/findings.yaml` with findings
+8. Write experiment scripts that **import and use the installed packages** — not re-implementations
+9. When scripts need LLM API calls, they MUST read keys from `config.yaml` at runtime
+   - Support multiple providers: if the configured provider fails, try other available keys
+   - Never hardcode a single provider — the user may have Gemini but not Anthropic, or vice versa
+10. Run experiments and collect results
+11. Verify results are genuine (check logs, spot-check data, confirm the process actually ran)
+12. Update `auto_research/state/findings.yaml` with findings
 
 ## Research Integrity (MANDATORY — violation = worthless paper)
 
diff --git a/ark/templates/agents/planner.prompt b/ark/templates/agents/planner.prompt
@@ -14,6 +14,7 @@ You are responsible for analyzing the Reviewer's review report, classifying issu
 Read the following files:
 1. `auto_research/state/latest_review.md` - Latest review report
 2. `auto_research/state/memory.yaml` - Iteration history
+3. `auto_research/state/idea.md` - Original research proposal (check experiment alignment)
 
 ## Issue Classification
 
diff --git a/ark/templates/agents/reviewer.prompt b/ark/templates/agents/reviewer.prompt
@@ -89,6 +89,13 @@ You are reviewing the **paper** (main.tex and its compiled PDF), not the raw dat
 - Do NOT cross-reference claims against raw JSON result files. If a claim seems unsupported, flag it based on what the paper itself presents (missing error bars, no ablation, unclear methodology, etc.).
 - Do NOT flag `[INTEGRITY]` issues based on file inspection. If the paper's numbers look internally inconsistent (e.g., a table contradicts a figure, or the text claims N=100 but the table shows N=4), flag it as a **methodological concern**, not a fabrication accusation.
 
+## Proposal Alignment Check
+
+Read `auto_research/state/idea.md` to understand the original research proposal. Then check:
+- Does the paper's experimental methodology match what was proposed? (e.g., if the proposal says "evaluate on platform X", does the paper actually run experiments on platform X, or does it substitute with a text classification benchmark?)
+- Are the key systems/platforms mentioned in the proposal reflected in the experiments?
+- If there is a gap between proposed methodology and actual experiments, flag it as a Major Issue.
+
 ## Review Dimensions (score 1-10 per dimension)
 
 | Dimension | Criteria | Weight |