Current validation practice undermines surgical AI development

This repository contains the code accompanying the paper Reinke et al., “Current validation practice undermines surgical AI development”. It provides fully documented analyses for three experiments that demonstrate common validation pitfalls in surgical AI benchmarking.

Reproducibility and data availability

The analyses in this repository operate on performance results from surgical AI benchmarking benchmarks. These result tables cannot be shared publicly, as participant-level consent for data redistribution was not obtained.

As a consequence:

All experiments are implemented as Jupyter notebooks.
Running the notebooks end-to-end requires access to the corresponding anonymized challenge result tables, which can only be granted to reviewers.
Even without access to these data files, the notebooks display all results, figures, and outputs reported in the paper and allow full inspection of the analysis logic.

No raw images, videos, or annotations are accessed. All analyses are based exclusively on precomputed, tabular performance data.

Repository structure

The repository consists of three Jupyter notebooks, one per experiment:

Experiment 1 – Dependent test samples inflate confidence exp1_dependent-test-samples-inflate-confidence.ipynb
Experiment 2 – Averages hide critical failures exp2_averages-hide-critical-failures.ipynb
Experiment 3 – Aggregation choices can flip the winner exp3_aggregation-choices-can-flip-the-winner.ipynb

Each notebook is self-contained and includes detailed methodological explanations, explicit documentation of data assumptions, all analysis steps required to reproduce the reported figures and conclusions.

Experiment 1: Dependent test samples inflate confidence

This experiment demonstrates how ignoring hierarchical dependencies in temporally structured surgical video data leads to severely underestimated uncertainty, i.e., overly narrow confidence intervals.

Core idea

Surgical video data is inherently hierarchical as multiple correlated frames originate from the same patient case (video). We compare two resampling strategies for estimating 95% bootstrap confidence intervals:

Naive bootstrap: resampling individual frames, implicitly assuming independence.
Hierarchical bootstrap: resampling videos/patients first, then frames within each selected video, explicitly accounting for dependencies.

Tasks and datasets

Binary instrument segmentation (RobustMIS 2019)
- 10 challenge submissions
- Metrics: Dice Similarity Coefficient (DSC), Normalized Surface Dice (NSD)
- Hierarchy: patient/video level (n = 10)
Surgical action triplet recognition (CholecT45)
- Precomputed Swin-Base predictions
- Metrics: mean Average Precision (mAP), class-weighted mAP, top-5 accuracy
- Hierarchy: patient/video level (n = 45)
- Cross-validation folds handled separately

Experiment 2: Averages hide critical failures

This experiment shows that global (non-stratified) aggregation of performance metrics can conceal clinically critical failure modes that only become visible under stratified analysis.

Core idea

Performance is often summarized as a single global score, implicitly assuming that errors are evenly distributed across conditions. However, rare but safety-critical visual conditions can cause substantial performance drops that are masked by global aggregation.

The experiment contrasts:

Non-stratified aggregation: median performance over all frames.
Stratified aggregation: median performance restricted to frames exhibiting specific visual artifacts.

Task and dataset

Task: Multi-instance instrument segmentation
Dataset: RobustMIS 2019 challenge results
Algorithms: 7 challenge submissions
Metric: Multi-instance Dice Similarity Coefficient (MI_DSC)

Artifact-based stratification

Stratification is performed using structured frame-level metadata describing visual artifacts and image properties. Each frame may contain multiple artifacts; subsets are therefore not mutually exclusive. Considered conditions include:

Blood
Motion
Reflections
Smoke
Instrument(s) covered by material
Overexposed instruments
Underexposed instruments
Intersecting instruments
Low-artifact scenes (≤ 1 annotated artifact)

Uncertainty estimation

Uncertainty of performance differences is estimated using hierarchical bootstrapping to calculate confidence intervals (see Experiment 1).

Experiment 3: Aggregation choices can flip the winner

This experiment illustrates how different, yet reasonable, aggregation strategies applied to the same fixed results can lead to substantially different algorithm rankings, including changes in the apparent winner.

Core idea

Surgical video analysis data is multi-level (frames, phases, videos). Reported performance scores and rankings depend critically on how results are aggregated across these levels, yet aggregation schemes are often underspecified or omitted in practice.

Experimental setup

Task: Binary instrument segmentation ()
Data: RobustMIS 2019 challenge results
Algorithms: 10 challenge submissions
Metric: DSC
Aggregation operator: 5th percentile (as used in the original challenge)

Six aggregation strategies are compared, including frame-wise, video-wise, phase-wise, and clinically weighted phase-wise aggregation.

Software dependencies

All notebooks use standard Python libraries for data handling, statistical resampling, metric computation, and visualization. All required imports are explicitly listed at the top of each notebook.

How to run

Each experiment is implemented as a Jupyter notebook.

Open the corresponding .ipynb file.
Run the notebook top-to-bottom (Kernel → Restart & Run All).

Execution requires access to the underlying challenge result tables. If these files are not available, the notebooks still allow full inspection of the analysis code and display the results reported in the paper.

Citation

If you use this code, please cite:

@article{reinke2025current,
  title={Current validation practice undermines surgical AI development},
  author={Reinke, Annika and Li, Ziying O and Tizabi, Minu D and Andr{\'e}, Pascaline and Knopp, Marcel and Rother, Mika M and Machado, Ines P and Altieri, Maria S and Alapatt, Deepak and Bano, Sophia and others},
  journal={arXiv preprint arXiv:2511.03769},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
exp1_dependent-test-samples-inflate-confidence.ipynb		exp1_dependent-test-samples-inflate-confidence.ipynb
exp2_averages-hide-critical-failures.ipynb		exp2_averages-hide-critical-failures.ipynb
exp3_aggregation-choices-can-flip-the-winner.ipynb		exp3_aggregation-choices-can-flip-the-winner.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Current validation practice undermines surgical AI development

Reproducibility and data availability

Repository structure

Experiment 1: Dependent test samples inflate confidence

Core idea

Tasks and datasets

Experiment 2: Averages hide critical failures

Core idea

Task and dataset

Artifact-based stratification

Uncertainty estimation

Experiment 3: Aggregation choices can flip the winner

Core idea

Experimental setup

Software dependencies

How to run

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Current validation practice undermines surgical AI development

Reproducibility and data availability

Repository structure

Experiment 1: Dependent test samples inflate confidence

Core idea

Tasks and datasets

Experiment 2: Averages hide critical failures

Core idea

Task and dataset

Artifact-based stratification

Uncertainty estimation

Experiment 3: Aggregation choices can flip the winner

Core idea

Experimental setup

Software dependencies

How to run

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages