Skip to content

Commit 8ce17bb

Browse files
JihaoXinclaude
andcommitted
Improve experiment integrity: provider flexibility, proposal alignment checks
- Experimenter: read config.yaml for available API keys, support multi-provider fallback instead of hardcoding a single provider - Reviewer: add proposal alignment check — compare paper experiments against idea.md to catch methodology substitution (e.g., text classification instead of end-to-end platform testing) - Planner: read idea.md to verify experiment alignment when creating action plans - Pipeline: evaluate_completeness now checks if all systems from project_context.md were actually used, and flags blocked/missing credentials as critical gaps Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent ed7e068 commit 8ce17bb

File tree

4 files changed

+30
-8
lines changed

4 files changed

+30
-8
lines changed

ark/pipeline.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1861,6 +1861,13 @@ def _evaluate_completeness(self, research_idea: str, findings_summary: str,
18611861
2. Are baselines properly compared?
18621862
3. Are the results statistically significant?
18631863
4. Are there obvious gaps that need more experiments?
1864+
5. Read `auto_research/state/project_context.md` and check: were ALL external systems
1865+
listed there actually installed, configured, and used in experiments? If any system
1866+
was listed but never used (e.g., never started, never called its API, never imported
1867+
its package), that is a critical gap.
1868+
6. Check `results/environment_setup.json` and `results/credentials_needed.json` — are
1869+
there any systems marked as "blocked" or credentials still missing? Those represent
1870+
incomplete experiments.
18641871
18651872
Output your evaluation in JSON format:
18661873
```json

ark/templates/agents/experimenter.prompt

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,27 +16,34 @@ You are responsible for setting up real experimental environments, running genui
1616
- This file contains pre-researched, web-verified system information
1717
- Trust this as the primary source for what to install and how
1818
2. Read `auto_research/state/experiment_plan.yaml` for the experiment plan
19-
3. Before installing anything, CHECK if it already exists on the system:
19+
3. Read `config.yaml` to discover available API keys and LLM providers
20+
- The project may already have configured keys (e.g., `gemini_api_key`, `openrouter_api_key`, `anthropic_api_key`)
21+
- Use whatever provider is available — do NOT hardcode a specific provider
22+
- Priority: use the project's configured provider first, then fall back to others
23+
4. Before installing anything, CHECK if it already exists on the system:
2024
- `which <tool>` — check if CLI tool is on PATH
2125
- `<tool> --version` — check version
2226
- `ps aux | grep <service>` — check if a service/daemon is already running
2327
- Only install if the tool is truly missing
24-
4. For each required system that needs installation:
28+
5. For each required system that needs installation:
2529
- All installations must be isolated to the project — do NOT install globally or
2630
pollute the shared system environment
2731
- Try at least 2-3 methods if the first one fails
2832
- Run the verification command to confirm it works
29-
5. Save setup results to `results/environment_setup.json`
30-
6. If API keys or credentials are missing:
33+
6. Save setup results to `results/environment_setup.json`
34+
7. If API keys or credentials are missing:
3135
- Write exactly what is needed to `results/credentials_needed.json`
3236
- Format: {"needed": [{"key": "ANTHROPIC_API_KEY", "provider": "Anthropic", "purpose": "LLM inference", "required_for": ["exp2", "exp3"]}]}
3337
- Do NOT silently skip experiments that need credentials
3438

3539
### Phase 2: Write and Run Experiments
36-
7. Write experiment scripts that **import and use the installed packages** — not re-implementations
37-
8. Run experiments and collect results
38-
9. Verify results are genuine (check logs, spot-check data, confirm the process actually ran)
39-
10. Update `auto_research/state/findings.yaml` with findings
40+
8. Write experiment scripts that **import and use the installed packages** — not re-implementations
41+
9. When scripts need LLM API calls, they MUST read keys from `config.yaml` at runtime
42+
- Support multiple providers: if the configured provider fails, try other available keys
43+
- Never hardcode a single provider — the user may have Gemini but not Anthropic, or vice versa
44+
10. Run experiments and collect results
45+
11. Verify results are genuine (check logs, spot-check data, confirm the process actually ran)
46+
12. Update `auto_research/state/findings.yaml` with findings
4047

4148
## Research Integrity (MANDATORY — violation = worthless paper)
4249

ark/templates/agents/planner.prompt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ You are responsible for analyzing the Reviewer's review report, classifying issu
1414
Read the following files:
1515
1. `auto_research/state/latest_review.md` - Latest review report
1616
2. `auto_research/state/memory.yaml` - Iteration history
17+
3. `auto_research/state/idea.md` - Original research proposal (check experiment alignment)
1718

1819
## Issue Classification
1920

ark/templates/agents/reviewer.prompt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,13 @@ You are reviewing the **paper** (main.tex and its compiled PDF), not the raw dat
8989
- Do NOT cross-reference claims against raw JSON result files. If a claim seems unsupported, flag it based on what the paper itself presents (missing error bars, no ablation, unclear methodology, etc.).
9090
- Do NOT flag `[INTEGRITY]` issues based on file inspection. If the paper's numbers look internally inconsistent (e.g., a table contradicts a figure, or the text claims N=100 but the table shows N=4), flag it as a **methodological concern**, not a fabrication accusation.
9191

92+
## Proposal Alignment Check
93+
94+
Read `auto_research/state/idea.md` to understand the original research proposal. Then check:
95+
- Does the paper's experimental methodology match what was proposed? (e.g., if the proposal says "evaluate on platform X", does the paper actually run experiments on platform X, or does it substitute with a text classification benchmark?)
96+
- Are the key systems/platforms mentioned in the proposal reflected in the experiments?
97+
- If there is a gap between proposed methodology and actual experiments, flag it as a Major Issue.
98+
9299
## Review Dimensions (score 1-10 per dimension)
93100

94101
| Dimension | Criteria | Weight |

0 commit comments

Comments
 (0)