[ English | ไธญๆ ]
Yifan Gaoยนยฒ, Haoyue Liยน, Feng Yuanยน, Xin Gaoยน*, Weiran Huangยฒยณ*, Xiaosong Wangยฒโด*
ยน USTC ยท ยฒ Shanghai Innovation Institute ยท ยณ SJTU ยท โด Shanghai AI Lab ยท * Corresponding authors
๐ Read the paper (PDF) ยท alphaXiv ยท arXiv
Headline results (28 days, zero human intervention):
- 40 complete manuscripts written end-to-end
- Cost-efficient: only $20โ30 per paper in LLM API spend
- Beats the strongest per-dataset baseline (chosen from 14 established architectures including nnU-Net) on 24 of 31 datasets under identical training budgets
- CamylaBench: contamination-free benchmark of 31 datasets, built exclusively from 2025 publications
- Stronger long-horizon orchestration on the experiment stage. During experiment execution, driven by cost-efficient backends (GLM-4.7 + MiniMax-M2.5), Camyla outperforms AI Scientist, autoresearch Claude Code (Opus 4.6), and autoresearch Codex (GPT-5.4-xhigh) on execution success, completion rate, and fidelity to the original proposal
Manuscript quality โ double-blind evaluation. Camyla-generated manuscripts were mixed with real 2025 publications and judged without reviewers knowing which was AI-written. Four independent panels โ 5 senior reviewers, 10 junior reviewers, 5 different AI models, and the Stanford Agentic Reviewer โ all place Camyla's output between the T1 and T2 tier of contemporary medical-imaging journals. The T1 anchor is IEEE TMI / Medical Image Analysis; the T2 band is JCR Q1 medical-imaging journals:
| Tier | Journals | Papers | Representative venues |
|---|---|---|---|
| T1 (top-tier) | 2 | 10 | IEEE Transactions on Medical Imaging; Medical Image Analysis |
| T2 (JCR Q1) | 7 | 35 | IEEE Journal of Biomedical and Health Informatics; Artificial Intelligence in Medicine; et al. |
| T3 | 9 | 45 | International Journal of Computer Assisted Radiology and Surgery; Biomedical Physics & Engineering Express; et al. |
| Total | 18 | 90 |
- 2026-04-13 โ Initial public release of the code, the paper, and CamylaBench (31 pre-formatted datasets, ๐ฆ Google Drive). We are still cleaning the full CamylaTrace-232K trajectory dataset (per-run logs + intermediate artifacts); release will follow shortly.
Camyla is an automated research pipeline that takes a medical image segmentation task, searches the literature for relevant ideas, proposes research hypotheses, runs end-to-end deep-learning experiments (Baseline โ Creative Research โ Ablation), and writes up the results as a publication-ready paper.
It combines:
- Quality-Weighted Branch Exploration (QWBE) over experiment configurations
- OpenHands-driven code generation and iterative debugging
- Multi-source literature search (ArXiv, OpenAlex, PubMed, Semantic Scholar)
- A paper agent that drafts, compiles, and cites LaTeX papers (Elsevier format supported)
- A flexible LLM routing layer (
llm_endpoints+llm_roles) so you can mix providers (OpenRouter, GLM, MiniMax, โฆ) without touching code
Status. Research prototype, preparing for open-source release. APIs may change.
A 10-paper subset of the manuscripts Camyla produced end-to-end โ no hand-editing,
LaTeX compiled as-is by the pipeline. Each PDF lives under
assets/paper_pdf/.
git clone https://github.com/yifangao112/Camyla.git camyla
cd camyla
python -m venv .venv && source .venv/bin/activate # or: conda create -n camyla python=3.10
pip install -r requirements.txt
# Sister packages (segmentation framework + raw-data conversion agent)
pip install git+https://github.com/yifangao112/CamylaNet.git
pip install git+https://github.com/yifangao112/nnPrep.gitA Python 3.10+ environment is expected. GPU is required for actually running the segmentation experiments (baseline uses CamylaNet / nnU-Net v2).
Set CamylaNet's data-path environment variables (same convention as CamylaNet's own README โ the baseline pipeline reads and writes under these paths):
export camylanet_raw="/path/to/camylanet_raw"
export camylanet_preprocessed="/path/to/camylanet_preprocessed"
export camylanet_results="/path/to/camylanet_results" # baseline artifacts land hereThen copy the example Camyla config and edit it:
cp config_example.yaml config.yamlAt minimum, set:
- At least one entry under
llm_endpointswith a validapi_key(or export the matchingapi_key_env, e.g.OPENROUTER_API_KEY) default_endpointโ the endpoint name most roles will use
See Configuration below for the full layout, role-routing semantics, and N-way model competition. The shipped config_example.yaml already has a complete multi-provider setup you can copy from.
Camyla operates on a dataset that already lives in nnU-Net v2 layout under
$camylanet_raw/Dataset{ID}_{Abbr}/. You have three options:
- Use CamylaBench (recommended). The 31 datasets used in the paper are
pre-formatted and ready to drop in โ download from
๐ฆ Google Drive
and extract each
Dataset{ID}_{Abbr}/folder under$camylanet_raw/. The ready-made idea descriptions inideas/match these IDs. - Bring your own, already in nnU-Net v2 format. Just point
camylanet_rawat it. - Bring your own raw data in some arbitrary layout. Use nnPrep (an LLM agent that converts arbitrary medical segmentation datasets into the nnU-Net v2 format) โ or write your own conversion script.
Pick an idea from ideas/ (31 ready-to-use dataset descriptions) and launch:
python launch_camyla.py \
--config config.yaml \
--load_ideas ideas/900.json \
--idea_idx 0On first run for a given dataset, Camyla automatically runs the baseline pipeline
(trainer screening + full training for everything that passes) under
$camylanet_results. Subsequent runs reuse the artifacts and skip this step.
The pipeline writes everything to experiments/<date>_<idea>_attempt_<id>/:
experiments/2026-04-12_liver_segmentation_attempt_0/
โโโ idea.json / idea.md # task spec
โโโ config.yaml # resolved config used for this run
โโโ logs/0-run/
โ โโโ experiment_report.md # human-readable summary
โ โโโ experiment_results/ # metrics, checkpoints, plots
โโโ research_proposals/ # auto-generated proposals
โโโ paper/ # LaTeX + compiled PDF (if writeup enabled)
Common flags:
| Flag | Purpose |
|---|---|
--resume_from_checkpoint PATH |
Resume from a previous checkpoint.pkl |
--skip_writeup / --skip_review |
Run experiments only, skip paper generation |
--debug-baseline |
Fake the baseline metrics so Stage 2 runs immediately (dev only) |
--verbose |
DEBUG-level logging |
- Phase 1-3 (idea generation). Searches 1-4 literature sources, extracts open research challenges, and generates multiple proposals scored by an assessment LLM.
- Stage 1-3 (experiment). QWBE expands a tree of code variants. Stage 2 can run
N-way model competition by listing multiple
experiment.code.candidates. - Paper Agent. Takes the final experiment results and produces a cited paper.
Every LLM call Camyla makes resolves through a two-layer system:
llm_endpointsโ your named LLM connections (one per provider/backend).llm_rolesโ every internal component ("role") picks an endpoint and optionally overrides its model, temperature, ormax_tokens.
Anything you don't configure at the role level falls back to default_endpoint.
Each entry is an OpenAI-compatible endpoint. The shape is fixed:
llm_endpoints:
my_openrouter:
api_key: "" # inline key (leave empty โ use env)
api_key_env: OPENROUTER_API_KEY # env var name to read when api_key is empty
base_url: "https://openrouter.ai/api/v1"
model: "deepseek/deepseek-v3.2" # default model for this endpoint
temperature: 0.5 # default temperatureKey resolution order: non-empty api_key > environment variable named by
api_key_env > empty (error at first call).
Naming is free-form. my_openrouter, cheap_backend, gpt4 โ all valid.
Roles reference endpoints by name, so changing models across the whole pipeline
is just swapping one string. Common backends we've tested:
| Backend | base_url |
Notes |
|---|---|---|
| OpenRouter | https://openrouter.ai/api/v1 |
one key, 300+ models |
| DashScope (Qwen) | https://coding.dashscope.aliyuncs.com/v1 |
cheap Qwen/GLM routing |
| MiniMax | https://api.minimaxi.com/v1 |
M-series models |
| OpenAI | https://api.openai.com/v1 |
native |
| Local vLLM / Ollama | http://localhost:8000/v1 |
any OpenAI-compatible server |
The global fallback. Every role without an explicit endpoint uses this one.
default_endpoint: my_openrouterA role is one logical LLM use inside the pipeline (feedback, paper writer, idea generator, etc.). You only specify the fields you want to override:
llm_roles:
# Tree-search roles
feedback: { temperature: 0.9, max_tokens: 8192 }
log_summary: { temperature: 1.0 }
# Idea-generation roles โ swap to a cheaper model for these
literature_backbone: { model: google/gemini-3-flash-preview }
challenge_extraction: { temperature: 0.3 }
# Paper agent sub-agents: `_default` applies to the whole group unless
# an individual sub-agent overrides it.
paper_agent:
_default: { temperature: 0.6 }
BibtexAgent: { model: z-ai/glm-4.7, temperature: 0.3 }
IdeaGenerationAgent: { model: google/gemini-3-flash-preview, temperature: 0.8 }
# Paper writing roles can be routed to a different endpoint entirely.
paper_writing:
latex_editor: { endpoint: my_dashscope, temperature: 0.7 }
image_generator: { endpoint: my_openrouter,
model: google/gemini-3.1-flash-image-preview,
aspect_ratio: "16:9", image_size: "2K" }Override precedence for any given role: role fields > its endpoint's defaults.
You can point a role at a completely different endpoint with endpoint: <name>.
Under experiment.code.candidates, list the endpoint names that should compete
as code authors for Stage 2. Camyla will run one branch per candidate and keep
the strongest.
experiment:
code:
candidates: [my_dashscope, my_minimax]
max_tokens: 16384Set it to a single-element list to disable competition.
Literature search keys live under api_keys (same inline/env fallback pattern):
api_keys:
s2: { value: "", env: S2_API_KEY } # Semantic Scholar
ncbi: { value: "", env: NCBI_API_KEY } # PubMedCamyla works without these, but rate limits will be tighter.
The rest of the config controls what Camyla does, not which LLMs it uses:
idea_generation.*โ how many papers to search, how to score proposals, how many generator personalities to ensembleexperiment.stages.*โ per-stage iteration budgets (Stage 1 = baseline replication, Stage 2 = creative research, Stage 3 = ablation)experiment.openhands.*โ OpenHands coder settings (python path, iteration cap, condenser)experiment.search.*โ QWBE hyperparameters (UCB constant, debug probability, draft count)
All of these have sensible defaults in config_example.yaml;
you usually only touch stages.*_max_iters when you want a faster / cheaper run.
camyla/
โโโ LICENSE
โโโ README.md # this file
โโโ config_example.yaml # documented config template
โโโ requirements.txt
โโโ launch_camyla.py # main entry point
โโโ ideas/ # 31 ready-made idea JSONs
โโโ camyla/ # core package
โโโ model_config.py # LLM config loader (get_endpoint/get_role)
โโโ baseline/ # screening + full training before QWBE
โโโ infrastructure/literature/ # arxiv / openalex / pubmed / multi-source
โโโ paper_agent/ # LaTeX writer + plotters + bibtex agent
โโโ tools/ # OpenAlex / Semantic Scholar tools
โโโ treesearch/ # QWBE core, parallel agents, OpenHands coder
โโโ utils/
- CamylaNet โ segmentation framework built on nnU-Net v2, shipping a curated set of CNN / Transformer / state-space backbones. Camyla's baseline stage runs CamylaNet trainers.
- nnPrep โ LLM agent that converts arbitrary medical segmentation datasets into the nnU-Net v2 format consumed by CamylaNet (and therefore by Camyla).
- The baseline stage currently uses CamylaNet trainers. Swapping in another
baseline framework requires a corresponding skill under
skills/frameworks/. - OpenHands runs code in your local Python env โ make sure you point
experiment.openhands.python_pathat an env with the needed packages. - Long runs: a full idea (proposals โ 3 stages โ paper) typically takes several hours to a day on a single A100-class GPU.
If you find this project useful in academic work, please cite:
@misc{gao2026camyla,
title = {Camyla: Scaling Autonomous Research in Medical Image Segmentation},
author = {Gao, Yifan and Li, Haoyue and Yuan, Feng and Gao, Xin and Huang, Weiran and Wang, Xiaosong},
year = {2026},
eprint = {2604.10696},
archivePrefix = {arXiv},
primaryClass = {cs.AI}
}Camyla builds on ideas and code from two upstream projects:
- AI Scientist (Sakana AI) โ pioneered the autonomous research-agent paradigm that inspired Camyla's overall pipeline design.
- nnU-Net โ the self-configuring segmentation framework that Camyla's baseline stage (via CamylaNet) is built on.
Released under the Apache License, Version 2.0 โ see LICENSE.


