[ English | 中文 ]
Yifan Gao¹², Haoyue Li¹, Feng Yuan¹, Xin Gao¹*, Weiran Huang²³*, Xiaosong Wang²⁴*
¹ USTC · ² Shanghai Innovation Institute · ³ SJTU · ⁴ Shanghai AI Lab · * Corresponding authors
📑 Read the paper (PDF) · alphaXiv · arXiv
Headline results (28 days, zero human intervention):
- 40 complete manuscripts written end-to-end
- Cost-efficient: only $20–30 per paper in LLM API spend
- Beats the strongest per-dataset baseline (chosen from 14 established architectures including nnU-Net) on 24 of 31 datasets under identical training budgets
- CamylaBench: contamination-free benchmark of 31 datasets, built exclusively from 2025 publications
- Stronger long-horizon orchestration on the experiment stage. During experiment execution, driven by cost-efficient backends (GLM-4.7 + MiniMax-M2.5), Camyla outperforms AI Scientist, autoresearch Claude Code (Opus 4.6), and autoresearch Codex (GPT-5.4-xhigh) on execution success, completion rate, and fidelity to the original proposal
Manuscript quality — double-blind evaluation. Camyla-generated manuscripts were mixed with real 2025 publications and judged without reviewers knowing which was AI-written. Four independent panels — 5 senior reviewers, 10 junior reviewers, 5 different AI models, and the Stanford Agentic Reviewer — all place Camyla's output between the T1 and T2 tier of contemporary medical-imaging journals. The T1 anchor is IEEE TMI / Medical Image Analysis; the T2 band is JCR Q1 medical-imaging journals:
| Tier | Journals | Papers | Representative venues |
|---|---|---|---|
| T1 (top-tier) | 2 | 10 | IEEE Transactions on Medical Imaging; Medical Image Analysis |
| T2 (JCR Q1) | 7 | 35 | IEEE Journal of Biomedical and Health Informatics; Artificial Intelligence in Medicine; et al. |
| T3 | 9 | 45 | International Journal of Computer Assisted Radiology and Surgery; Biomedical Physics & Engineering Express; et al. |
| Total | 18 | 90 |
- 2026-04-13 — Initial public release of the code, the paper, and CamylaBench (31 pre-formatted datasets, 📦 Google Drive). We are still cleaning the full CamylaTrace-232K trajectory dataset (per-run logs + intermediate artifacts); release will follow shortly.
Camyla is an automated research pipeline that takes a medical image segmentation task, searches the literature for relevant ideas, proposes research hypotheses, runs end-to-end deep-learning experiments (Baseline → Creative Research → Ablation), and writes up the results as a publication-ready paper.
It combines:
- Quality-Weighted Branch Exploration (QWBE) over experiment configurations
- OpenHands-driven code generation and iterative debugging
- Multi-source literature search (ArXiv, OpenAlex, PubMed, Semantic Scholar)
- A paper agent that drafts, compiles, and cites LaTeX papers (Elsevier format supported)
- A flexible LLM routing layer (
llm_endpoints+llm_roles) so you can mix providers (OpenRouter, GLM, MiniMax, …) without touching code
Status. Research prototype, preparing for open-source release. APIs may change.
A 10-paper subset of the manuscripts Camyla produced end-to-end — no hand-editing,
LaTeX compiled as-is by the pipeline. Each PDF lives under
assets/paper_pdf/.
| # | Title | Modality / task |
|---|---|---|
| 1 | Cross-Directional Feature Lattice for Brain Tumor Segmentation | MRI · brain tumor |
| 2 | Scale-Frequency Adaptive Fusion for Multiple Sclerosis Lesion Segmentation | MRI · MS lesions |
| 3 | Hierarchical Context Gating for Neonatal Brain Lesion Segmentation | MRI · neonatal HIE |
| 4 | Cross-Scale Mutual Refinement for Bronchoalveolar Lavage Fluid Cell Segmentation | Microscopy · BALF cells |
| 5 | Symmetry-Aware Cascaded Attention for Panoramic Tooth Segmentation | Dental X-ray · tooth |
| 6 | Specular-Residual Decoupled Encoding for Surgical Scene Segmentation | Laparoscopy · surgical scene |
| 7 | Adaptive Scale-Aware Feature Integration for Liver Lesion Segmentation | CT · liver lesion |
| 8 | Boundary-Hierarchical Decomposition for Fetal Brain Tissue Segmentation | MRI · fetal brain |
| 9 | Vessel-Guided Boundary Residual Networks for Dermatological Vessel Segmentation | OCTA · dermatological vessel |
| 10 | Hierarchical Resolution-Retentive Feature Encoding for Brain Metastasis Segmentation | MRI · brain metastasis |
git clone https://github.com/yifangao112/Camyla.git camyla
cd camyla
python -m venv .venv && source .venv/bin/activate # or: conda create -n camyla python=3.10
pip install -r requirements.txt
# Sister packages (segmentation framework + raw-data conversion agent)
pip install git+https://github.com/yifangao112/CamylaNet.git
pip install git+https://github.com/yifangao112/nnPrep.gitA Python 3.10+ environment is expected. GPU is required for actually running the segmentation experiments (baseline uses CamylaNet / nnU-Net v2).
Set CamylaNet's data-path environment variables (same convention as CamylaNet's own README — the baseline pipeline reads and writes under these paths):
export camylanet_raw="/path/to/camylanet_raw"
export camylanet_preprocessed="/path/to/camylanet_preprocessed"
export camylanet_results="/path/to/camylanet_results" # baseline artifacts land hereThen copy the example Camyla config and edit it:
cp config_example.yaml config.yamlAt minimum, set:
- At least one entry under
llm_endpointswith a validapi_key(or export the matchingapi_key_env, e.g.OPENROUTER_API_KEY) default_endpoint— the endpoint name most roles will use
See Configuration below for the full layout, role-routing semantics, and N-way model competition. The shipped config_example.yaml already has a complete multi-provider setup you can copy from.
Camyla operates on a dataset that already lives in nnU-Net v2 layout under
$camylanet_raw/Dataset{ID}_{Abbr}/. You have three options:
- Use CamylaBench (recommended). The 31 datasets used in the paper are
pre-formatted and ready to drop in — download from
📦 Google Drive
and extract each
Dataset{ID}_{Abbr}/folder under$camylanet_raw/. The ready-made idea descriptions inideas/match these IDs. - Bring your own, already in nnU-Net v2 format. Just point
camylanet_rawat it. - Bring your own raw data in some arbitrary layout. Use nnPrep (an LLM agent that converts arbitrary medical segmentation datasets into the nnU-Net v2 format) — or write your own conversion script.
Pick an idea from ideas/ (31 ready-to-use dataset descriptions) and launch:
python launch_camyla.py \
--config config.yaml \
--load_ideas ideas/900.json \
--idea_idx 0On first run for a given dataset, Camyla automatically runs the baseline pipeline
(trainer screening + full training for everything that passes) under
$camylanet_results. Subsequent runs reuse the artifacts and skip this step.
The pipeline writes everything to experiments/<date>_<idea>_attempt_<id>/:
experiments/2026-04-12_liver_segmentation_attempt_0/
├── idea.json / idea.md # task spec
├── config.yaml # resolved config used for this run
├── logs/0-run/
│ ├── experiment_report.md # human-readable summary
│ └── experiment_results/ # metrics, checkpoints, plots
├── research_proposals/ # auto-generated proposals
└── paper/ # LaTeX + compiled PDF (if writeup enabled)
Common flags:
| Flag | Purpose |
|---|---|
--resume_from_checkpoint PATH |
Resume from a previous checkpoint.pkl |
--skip_writeup / --skip_review |
Run experiments only, skip paper generation |
--debug-baseline |
Fake the baseline metrics so Stage 2 runs immediately (dev only) |
--verbose |
DEBUG-level logging |
- Phase 1-3 (idea generation). Searches 1-4 literature sources, extracts open research challenges, and generates multiple proposals scored by an assessment LLM.
- Stage 1-3 (experiment). QWBE expands a tree of code variants. Stage 2 can run
N-way model competition by listing multiple
experiment.code.candidates. - Paper Agent. Takes the final experiment results and produces a cited paper.
Every LLM call Camyla makes resolves through a two-layer system:
llm_endpoints— your named LLM connections (one per provider/backend).llm_roles— every internal component ("role") picks an endpoint and optionally overrides its model, temperature, ormax_tokens.
Anything you don't configure at the role level falls back to default_endpoint.
Each entry is an OpenAI-compatible endpoint. The shape is fixed:
llm_endpoints:
my_openrouter:
api_key: "" # inline key (leave empty → use env)
api_key_env: OPENROUTER_API_KEY # env var name to read when api_key is empty
base_url: "https://openrouter.ai/api/v1"
model: "deepseek/deepseek-v3.2" # default model for this endpoint
temperature: 0.5 # default temperatureKey resolution order: non-empty api_key > environment variable named by
api_key_env > empty (error at first call).
Naming is free-form. my_openrouter, cheap_backend, gpt4 — all valid.
Roles reference endpoints by name, so changing models across the whole pipeline
is just swapping one string. Common backends we've tested:
| Backend | base_url |
Notes |
|---|---|---|
| OpenRouter | https://openrouter.ai/api/v1 |
one key, 300+ models |
| DashScope (Qwen) | https://coding.dashscope.aliyuncs.com/v1 |
cheap Qwen/GLM routing |
| MiniMax | https://api.minimaxi.com/v1 |
M-series models |
| OpenAI | https://api.openai.com/v1 |
native |
| Local vLLM / Ollama | http://localhost:8000/v1 |
any OpenAI-compatible server |
The global fallback. Every role without an explicit endpoint uses this one.
default_endpoint: my_openrouterA role is one logical LLM use inside the pipeline (feedback, paper writer, idea generator, etc.). You only specify the fields you want to override:
llm_roles:
# Tree-search roles
feedback: { temperature: 0.9, max_tokens: 8192 }
log_summary: { temperature: 1.0 }
# Idea-generation roles — swap to a cheaper model for these
literature_backbone: { model: google/gemini-3-flash-preview }
challenge_extraction: { temperature: 0.3 }
# Paper agent sub-agents: `_default` applies to the whole group unless
# an individual sub-agent overrides it.
paper_agent:
_default: { temperature: 0.6 }
BibtexAgent: { model: z-ai/glm-4.7, temperature: 0.3 }
IdeaGenerationAgent: { model: google/gemini-3-flash-preview, temperature: 0.8 }
# Paper writing roles can be routed to a different endpoint entirely.
paper_writing:
latex_editor: { endpoint: my_dashscope, temperature: 0.7 }
image_generator: { endpoint: my_openrouter,
model: google/gemini-3.1-flash-image-preview,
aspect_ratio: "16:9", image_size: "2K" }Override precedence for any given role: role fields > its endpoint's defaults.
You can point a role at a completely different endpoint with endpoint: <name>.
Under experiment.code.candidates, list the endpoint names that should compete
as code authors for Stage 2. Camyla will run one branch per candidate and keep
the strongest.
experiment:
code:
candidates: [my_dashscope, my_minimax]
max_tokens: 16384Set it to a single-element list to disable competition.
Literature search keys live under api_keys (same inline/env fallback pattern):
api_keys:
s2: { value: "", env: S2_API_KEY } # Semantic Scholar
ncbi: { value: "", env: NCBI_API_KEY } # PubMedCamyla works without these, but rate limits will be tighter.
The rest of the config controls what Camyla does, not which LLMs it uses:
idea_generation.*— how many papers to search, how to score proposals, how many generator personalities to ensembleexperiment.stages.*— per-stage iteration budgets (Stage 1 = baseline replication, Stage 2 = creative research, Stage 3 = ablation)experiment.openhands.*— OpenHands coder settings (python path, iteration cap, condenser)experiment.search.*— QWBE hyperparameters (UCB constant, debug probability, draft count)
All of these have sensible defaults in config_example.yaml;
you usually only touch stages.*_max_iters when you want a faster / cheaper run.
camyla/
├── LICENSE
├── README.md # this file
├── config_example.yaml # documented config template
├── requirements.txt
├── launch_camyla.py # main entry point
├── ideas/ # 31 ready-made idea JSONs
└── camyla/ # core package
├── model_config.py # LLM config loader (get_endpoint/get_role)
├── baseline/ # screening + full training before QWBE
├── infrastructure/literature/ # arxiv / openalex / pubmed / multi-source
├── paper_agent/ # LaTeX writer + plotters + bibtex agent
├── tools/ # OpenAlex / Semantic Scholar tools
├── treesearch/ # QWBE core, parallel agents, OpenHands coder
└── utils/
- CamylaNet — segmentation framework built on nnU-Net v2, shipping a curated set of CNN / Transformer / state-space backbones. Camyla's baseline stage runs CamylaNet trainers.
- nnPrep — LLM agent that converts arbitrary medical segmentation datasets into the nnU-Net v2 format consumed by CamylaNet (and therefore by Camyla).
- The baseline stage currently uses CamylaNet trainers. Swapping in another
baseline framework requires a corresponding skill under
skills/frameworks/. - OpenHands runs code in your local Python env — make sure you point
experiment.openhands.python_pathat an env with the needed packages. - Long runs: a full idea (proposals → 3 stages → paper) typically takes several hours to a day on a single A100-class GPU.
If you find this project useful in academic work, please cite:
@misc{gao2026camyla,
title = {Camyla: Scaling Autonomous Research in Medical Image Segmentation},
author = {Gao, Yifan and Li, Haoyue and Yuan, Feng and Gao, Xin and Huang, Weiran and Wang, Xiaosong},
year = {2026},
eprint = {2604.10696},
archivePrefix = {arXiv},
primaryClass = {cs.AI}
}Camyla builds on ideas and code from two upstream projects:
- AI Scientist (Sakana AI) — pioneered the autonomous research-agent paradigm that inspired Camyla's overall pipeline design.
- nnU-Net — the self-configuring segmentation framework that Camyla's baseline stage (via CamylaNet) is built on.
Released under the Apache License, Version 2.0 — see LICENSE.


