SidRichardsQuantum
diff --git a/‎.github/workflows/tests.yml‎
Lines changed: 37 additions & 1 deletion b/‎.github/workflows/tests.yml‎
Lines changed: 37 additions & 1 deletion
diff --git a/‎MANIFEST.in‎
Lines changed: 1 addition & 0 deletions b/‎MANIFEST.in‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 20 additions & 1 deletion b/‎README.md‎
Lines changed: 20 additions & 1 deletion
diff --git a/‎RESEARCH.md‎
Lines changed: 106 additions & 0 deletions b/‎RESEARCH.md‎
Lines changed: 106 additions & 0 deletions
diff --git a/‎common/__init__.py‎
Lines changed: 8 additions & 0 deletions b/‎common/__init__.py‎
Lines changed: 8 additions & 0 deletions
@@ -32,7 +32,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: ["3.10", "3.11", "3.12"]
+        python-version: ["3.10", "3.11", "3.12", "3.13"]
 
     env:
       PYTHONUTF8: "1"
@@ -100,3 +100,39 @@ jobs:
       - name: Run full test suite
         run: |
           pytest ${{ inputs['pytest-args'] || '-q -o addopts=' }}
+
+  numpy2-compat:
+    name: NumPy 2 compatibility probe
+    runs-on: ubuntu-latest
+    if: github.event_name == 'schedule' || github.event_name == 'workflow_dispatch'
+    continue-on-error: true
+
+    env:
+      PYTHONUTF8: "1"
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.13"
+          cache: "pip"
+          cache-dependency-path: |
+            pyproject.toml
+
+      - name: Install with NumPy 2 override
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install -e ".[dev]" pyscf
+          python -m pip install --upgrade --no-deps "numpy>=2,<3" "scipy>=1.17,<2"
+
+      - name: Debug environment
+        run: |
+          python --version
+          python -m pip freeze | sed -n '1,200p'
+
+      - name: Run fast tests against NumPy 2
+        run: |
+          pytest -q
@@ -2,6 +2,7 @@
 include README.md
 include USAGE.md
 include THEORY.md
+include RESEARCH.md
 include CHANGELOG.md
 include LICENSE
 include pyproject.toml
 
@@ -370,7 +370,8 @@ Use these in order:
 1. [`README.md`](README.md) for orientation and quickstart
 2. [`USAGE.md`](USAGE.md) for CLI and Python entrypoints
 3. [`THEORY.md`](THEORY.md) for algorithmic background
-4. [`notebooks/README_notebooks.md`](notebooks/README_notebooks.md) for notebook navigation
+4. [`RESEARCH.md`](RESEARCH.md) for benchmark evidence standards
+5. [`notebooks/README_notebooks.md`](notebooks/README_notebooks.md) for notebook navigation
 
 Deeper implementation notes:
 
@@ -413,6 +414,24 @@ Run the full suite, including slow integration coverage, with:
 pytest -q -o addopts=''
 ```
 
+Run a registered benchmark suite and write reproducible artifacts with:
+
+```bash
+python -m common.benchmarks run --suite expert-z-cross-method --out benchmark_runs
+```
+
+List available suites with:
+
+```bash
+python -m common.benchmarks list
+```
+
+Compare two benchmark runs with:
+
+```bash
+python -m common.benchmarks compare --base old_run/h2-cross-method --head new_run/h2-cross-method
+```
+
 ---
 
 ## Support development
 
@@ -0,0 +1,106 @@
+# Research Use
+
+This repository is most useful as a reproducible evidence generator for
+small-system quantum simulation studies. It is not a claim that any one method
+is universally best, and it is not production chemistry software.
+
+For the problem scope and user-facing workflows, read `PROBLEM.md`. For the
+current benchmark inventory, read `notebooks/benchmarks/SUMMARY.md`. For
+published tables and figures, read `notebooks/benchmarks/RESULTS.md`.
+
+## Research Claims This Repo Can Support
+
+Use this repo to support claims of these forms:
+
+- for a named small molecule or low-qubit Hamiltonian, method A was more
+  accurate, faster, or more stable than method B under a stated configuration
+- a solver default is reasonable for a documented calibration panel
+- a configuration is sensitive to seed, shots, noise channel, mapping, ansatz,
+  optimizer, or active-space choice
+- a non-chemistry Hamiltonian can be run through the same expert-mode API as the
+  chemistry benchmarks
+
+Do not use this repo, by itself, to claim:
+
+- chemical accuracy for large molecules
+- hardware performance or device-readiness
+- universal algorithm rankings across problem classes
+- production-quality quantum chemistry results
+
+## Evidence Standard
+
+A result should be treated as research evidence only when it records:
+
+- the resolved problem: molecule or model, geometry, charge, basis, mapping,
+  active space, qubit count, and Hamiltonian-term count where available
+- the reference: exact diagonalization, Hartree-Fock, analytical model result,
+  or an explicit statement that no reference is used
+- the solver configuration: method, ansatz, optimizer, step counts, stepsizes,
+  QPE evolution settings, shots, seeds, and noise model
+- the metrics: energy, absolute error against the reference where meaningful,
+  runtime, cache-hit state, and method-specific diagnostics
+- the statistical design: seed list, shot list, repetitions, failure criteria,
+  and aggregation method for stochastic or optimizer-sensitive studies
+- the environment: package version and relevant dependency versions for
+  release-grade benchmark artifacts
+
+The benchmark row contract is documented in
+`notebooks/benchmarks/SCHEMA.md`.
+
+## Claim Levels
+
+| Level | Meaning | Minimum evidence |
+| --- | --- | --- |
+| Smoke | The API path runs. | One tiny deterministic case. |
+| Case study | A method behaves as reported on one problem. | One problem, fixed config, reference where available. |
+| Benchmark | A comparison is decision-useful. | Multiple methods or settings, common reference, runtime/cache metadata, documented metrics. |
+| Reproducibility study | Stability is measured. | Multiple seeds or shots, aggregate statistics, failure-rate notes. |
+| Release-grade evidence | A result can be cited. | Curated artifact export, versioned code, clean validation checks, and documented limitations. |
+
+## Benchmark Acceptance Checklist
+
+Before treating a new notebook as a benchmark, verify that it:
+
+- asks one explicit research or method-selection question
+- avoids duplicating an existing notebook unless it replaces or generalizes it
+- follows `notebooks/benchmarks/TEMPLATE.md` for question, scope, reference,
+  metrics, aggregation, limitations, and artifact-export notes
+- uses the shared problem-resolution and Hamiltonian pipeline
+- compares against an exact or clearly documented reference when feasible
+- reports cache hits separately from compute runtime
+- exports any published table or figure through
+  `scripts/export_benchmark_artifacts.py`
+- states limitations in the notebook or nearby docs when a method is known to
+  be noiseless-only, calibration-specific, or small-system-only
+
+## Release Protocol
+
+For a release intended to be useful in research:
+
+1. Run the default and full test suites.
+2. Run registered benchmark suites that should become release evidence. Use
+   `python -m common.benchmarks list` to inspect available suites, then run a
+   selected suite, for example
+   `python -m common.benchmarks run --suite h2-cross-method`.
+3. Compare refreshed suite outputs against the previous evidence set with
+   `python -m common.benchmarks compare --base old_run --head new_run`.
+4. Rerun benchmark notebooks whose results are being refreshed.
+5. Export curated artifacts with `python scripts/export_benchmark_artifacts.py`.
+6. Confirm `notebooks/benchmarks/_artifacts/benchmark_manifest.json` describes
+   the published artifact set.
+7. Build docs with `python -m sphinx -W -b html docs docs/_build/html`.
+8. Tag the code and attach or archive the curated artifact set if the release is
+   meant to be cited directly.
+
+## Document Boundaries
+
+The markdown files intentionally have separate jobs:
+
+- `README.md`: installation, orientation, and quickstart
+- `PROBLEM.md`: what practical problems the repo is for
+- `THEORY.md`: algorithm background
+- `USAGE.md`: API and CLI usage
+- `RESEARCH.md`: evidence standards and benchmark acceptance rules
+- `notebooks/benchmarks/SUMMARY.md`: benchmark inventory
+- `notebooks/benchmarks/RESULTS.md`: curated result surfaces
+- `notebooks/benchmarks/SCHEMA.md`: benchmark row and artifact metadata fields
@@ -47,6 +47,14 @@
     ),
     "ionization_energy_panel": ("common.benchmarks", "ionization_energy_panel"),
     "summarize_problem": ("common.benchmarks", "summarize_problem"),
+    "compare_benchmark_runs": ("common.benchmarks", "compare_benchmark_runs"),
+    "list_benchmark_suites": ("common.benchmarks", "list_benchmark_suites"),
+    "run_benchmark_suite": ("common.benchmarks", "run_benchmark_suite"),
+    "environment_metadata": ("common.environment", "environment_metadata"),
+    "ensure_environment_metadata": (
+        "common.environment",
+        "ensure_environment_metadata",
+    ),
     "compute_fidelity": ("common.metrics", "compute_fidelity"),
     "ResolvedProblem": ("common.problem", "ResolvedProblem"),
     "resolve_problem": ("common.problem", "resolve_problem"),