Skip to content

Commit e9e7030

Browse files
Multi-run dashboard, IC-based breeding, configurable run workflow
Dashboard: - Takes simulation_dir instead of STDOUT path; discovers all runs with STDOUT.0000 and presents a dropdown selector - Per-run SLURM job ID via slurm_job_id file in each run directory - All endpoints accept ?run= parameter for run-specific data Converter & plotter: - Both take simulation_dir and process all discovered runs - Converter uses atomic writes (tmp + rename) to prevent partial reads - Per-run plot/conversion state tracked independently Bred vectors: - Rewritten to use IC files (T/S/U/V/Eta.init.bin) instead of pickups - Each cycle: overwrite member ICs, run from nIter0=0 for 30 days - breed_vectors.sh: fresh run dir each cycle, copies member ICs, generates member-specific data file with nIter0=0 and nTimeSteps - Production runs start from the pickup at t=30 days Run workflow: - RUN_DIR and SIMULATION_DIR configurable via environment variables - Writes slurm_job_id to run directory for dashboard association - Syncs data* files to local input dir before each run Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 352c554 commit e9e7030

8 files changed

Lines changed: 806 additions & 858 deletions

File tree

simulations/glorysv12-curvilinear/ensemble/README.md

Lines changed: 83 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -23,24 +23,39 @@ far more realistic initial condition perturbations than random noise alone.
2323

2424
### Breeding cycle
2525

26+
Each member has its own set of initial condition files (`T.init.bin`,
27+
`S.init.bin`, `U.init.bin`, `V.init.bin`, `Eta.init.bin`). The cycle operates
28+
on these IC files directly:
29+
2630
```
27-
Control: ─────────────────────────────────────────────►
28-
│ │
29-
│ perturb │ rescale & perturb
30-
▼ ▼
31-
Member i: ─────●════════════════════●════════════════════●───►
32-
cycle 0 cycle 1 cycle 2 ...
33-
(30 days) (30 days)
31+
Control ICs: T.init.bin, S.init.bin, U.init.bin, V.init.bin, Eta.init.bin
32+
33+
│ add perturbation (breed_vectors.py init)
34+
35+
Member ICs: T.init.bin, S.init.bin, ... (in member_NNN/)
36+
37+
│ run MITgcm from nIter0=0 for 30 days
38+
39+
Member pickup at t=30 days (pickup.0000007200.data)
40+
41+
│ bred_vector = member_pickup - control_pickup
42+
│ rescale by target_RMS / actual_T_RMS
43+
│ new_IC = control_IC + rescaled_bred_vector
44+
45+
Member ICs: overwritten with new perturbation → next cycle
3446
```
3547

36-
Each cycle:
37-
1. Member starts from `control_state + perturbation`
38-
2. Runs forward for 30 days (configurable)
39-
3. At end: `bred_vector = member_state - control_state`
40-
4. Rescale factor = `target_RMS / actual_RMS` (computed from temperature)
41-
5. **Same rescale factor applied to ALL variables** (T, S, U, V, SSH) to preserve
42-
geostrophic and hydrostatic balance
43-
6. New perturbation: `control_state + rescale_factor × bred_vector`
48+
Key points:
49+
- Every cycle starts from **nIter0=0** — the member's IC files are the
50+
perturbation mechanism, not pickup files
51+
- The **same forcing, grid, and namelist files** are shared across all members
52+
(symlinked from the master input directory). Only the IC files differ.
53+
- Bred vectors are computed from the **pickup at t=30 days**, which captures
54+
how the perturbation grew over the cycle
55+
- The **same rescale factor** (derived from temperature RMS) is applied to
56+
all variables to preserve dynamical balance
57+
- For **production runs** after breeding converges, each member restarts from
58+
its pickup at t=30 days (iteration 7200)
4459

4560
### Design choices
4661

@@ -98,42 +113,57 @@ vectors readjust within the first few weeks of the production run regardless.
98113

99114
### Prerequisites
100115

101-
- Completed control spinup run with at least one permanent pickup file
102-
- Set `pickup_iter` in `breed_config.yaml` (or leave null to auto-detect latest)
116+
- Completed control spinup run (1 year)
117+
- Control run must also be run from nIter0=0 for 30 days (same as members)
118+
to produce the reference pickup for bred vector computation
103119

104120
### Steps
105121

106122
```bash
107123
cd simulations/glorysv12-curvilinear
108124

109-
# 1. Initialize 50 perturbed pickups from the control state
125+
# 1. Initialize 50 perturbed IC files from the control ICs
110126
uv run python ../../spectre_utils/breed_vectors.py init ensemble/breed_config.yaml
111127

112-
# 2. Run all 50 members for one breeding cycle (SLURM array job)
128+
# 2. Run all 50 members for one 30-day cycle (SLURM array job)
129+
# Each member starts from nIter0=0 with its perturbed ICs
113130
sbatch --chdir=$(pwd) workflows/breed_vectors.sh
114131

115-
# 3. After all members complete — compute bred vectors and rescale
132+
# 3. After all members complete — compute bred vectors and overwrite ICs
116133
uv run python ../../spectre_utils/breed_vectors.py rescale ensemble/breed_config.yaml --cycle 1
117134

118135
# 4. Check convergence (per-variable RMS table)
119-
uv run python ../../spectre_utils/breed_vectors.py status ensemble/breed_config.yaml --cycle 1
136+
uv run python ../../spectre_utils/breed_vectors.py status ensemble/breed_config.yaml
120137

121-
# 5. Repeat steps 2–4 for each cycle
122-
# Update --cycle 2, 3, ... 8
138+
# 5. Repeat steps 2–4 for each cycle (2, 3, ... 8)
123139
```
124140

125-
### GCP deployment
141+
### Running the control alongside members
126142

127-
Each member directory (`member_001/` through `member_050/`) is self-contained:
128-
- Perturbed pickup file (`.data` + `.meta`)
129-
- `nIter0.txt` with the starting iteration
143+
The control must also produce a pickup at iteration 7200 (30 days from nIter0=0)
144+
for the bred vector computation. This can be done:
145+
- As a separate single run with the same `data` settings as the members
146+
- On one of the 17 compute nodes alongside 2 members (3 sims per node)
130147

131-
To run on GCP:
132-
1. Copy the input deck and member pickups to each compute node's local disk
133-
2. Each member runs standard MITgcm with the member's pickup as the restart file
134-
3. After all members finish, copy pickups back and run the `rescale` step
148+
### Transitioning to production
149+
150+
After breeding converges (cycle 5–8):
151+
1. Each member has a pickup at `member_NNN/run/pickup.0000007200.data`
152+
2. Copy this pickup to the member's production run directory
153+
3. Set `nIter0=7200` and full production `endTime`/`nTimeSteps`
154+
4. Run the production ensemble
135155

136-
## GCP Cost Estimate
156+
## GCP deployment
157+
158+
Each member directory (`member_001/` through `member_050/`) contains:
159+
- `T.init.bin`, `S.init.bin`, `U.init.bin`, `V.init.bin`, `Eta.init.bin`
160+
161+
To run on GCP:
162+
1. Copy the master input deck to each compute node's local disk (one copy per node)
163+
2. Copy each member's IC files to the node
164+
3. Set up the member run directory: symlink master input, replace IC symlinks with copies
165+
4. Run MITgcm with `nIter0=0`, `nTimeSteps=7200`
166+
5. After all members finish, copy pickups back and run the `rescale` step
137167

138168
### Cluster configuration
139169

@@ -143,28 +173,7 @@ To run on GCP:
143173
| Login | n1-standard-2 | 1 | SSH access, job submission |
144174
| Controller | n1-standard-2 | 1 | Slurm controller |
145175

146-
### Compute requirements per cycle
147-
148-
- 50 members ÷ 3 per node = **17 nodes** per cycle
149-
- 30 sim-days at 12–20 sim-days/wall-hr = **1.5–2.5 wall hours** per cycle
150-
- 8 cycles × 2.5 hrs = **~20 hours** total wall time (plus ~30 min rescaling between cycles)
151-
- Total compute: 17 nodes × 20 hrs = **340 node-hours** (conservative)
152-
153-
### Local disk per node
154-
155-
| Data | Size |
156-
|------|------|
157-
| EXF forcing (8 variables × 54 GB) | 432 GB |
158-
| OBC boundary files | 20 GB |
159-
| Grid, bathymetry, initial conditions | 2 GB |
160-
| Pickup files (3 members) | 6 GB |
161-
| Output headroom (diagnostics, pickups) | 40 GB |
162-
| **Total** | **~500 GB** |
163-
164-
Recommend **1 TB pd-ssd** per compute node, or local NVMe SSD if available
165-
on the machine type.
166-
167-
### Cost breakdown
176+
### Cost estimate
168177

169178
| Item | On-demand | Spot (~70% discount) |
170179
|------|-----------|---------------------|
@@ -173,16 +182,16 @@ on the machine type.
173182
| n1-standard-2 × 2 × 24 hrs @ $0.095/hr | $5 | $5 |
174183
| **Total** | **~$3,400** | **~$1,100** |
175184

176-
### Notes
185+
### Local disk per node
177186

178-
- Spot/preemptible instances are viable since each breeding cycle is only
179-
1.5–2.5 hours — short enough to avoid most preemptions
180-
- The 30-min rescaling step between cycles runs on a single node and is
181-
negligible cost
182-
- Data transfer: ~500 GB input deck upload (one-time) + ~100 MB pickups per
183-
cycle (negligible)
184-
- The control run must also advance 30 days per cycle to provide the reference
185-
state — this can run on one of the 17 compute nodes
187+
| Data | Size |
188+
|------|------|
189+
| EXF forcing (8 variables × 54 GB) | 432 GB |
190+
| OBC boundary files | 20 GB |
191+
| Grid, bathymetry, other input | 5 GB |
192+
| Member IC files (3 members × 5 files × 130 MB) | 2 GB |
193+
| Output headroom (pickups) | 40 GB |
194+
| **Total** | **~500 GB** |
186195

187196
## Configuration
188197

@@ -213,11 +222,19 @@ consider a shorter `cycle_length_days` to accelerate convergence.
213222
```
214223
ensemble/
215224
├── breed_config.yaml # Breeding parameters
225+
├── convergence.json # Per-cycle RMS diagnostics (written by rescale)
216226
├── README.md # This file
217227
├── member_001/ # Member 1
218-
│ ├── pickup.NNNNNNNNNN.data
219-
│ ├── pickup.NNNNNNNNNN.meta
220-
│ └── nIter0.txt
228+
│ ├── T.init.bin # Perturbed temperature IC
229+
│ ├── S.init.bin # Perturbed salinity IC
230+
│ ├── U.init.bin # Perturbed zonal velocity IC
231+
│ ├── V.init.bin # Perturbed meridional velocity IC
232+
│ ├── Eta.init.bin # Perturbed SSH IC
233+
│ └── run/ # MITgcm run directory (created by breed_vectors.sh)
234+
│ ├── *.bin → /input/ # Symlinks to master input (forcing, grid, OBC)
235+
│ ├── T.init.bin # Copied (not symlinked) from member dir
236+
│ ├── data # Member-specific (nIter0=0, nTimeSteps=7200)
237+
│ └── pickup.0000007200.data # Output: state at t=30 days
221238
├── member_002/
222239
│ └── ...
223240
└── member_050/

simulations/glorysv12-curvilinear/ensemble/breed_config.yaml

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,9 @@ grid:
1212
Ny: 424
1313
Nr: 50
1414

15-
# Pickup file from the control run to use as the base state
15+
# Control run directory (relative to ensemble/)
16+
# Must contain a pickup at iteration 7200 (30 days from nIter0=0)
1617
control:
17-
pickup_prefix: "pickup"
18-
# Iteration number of the pickup to use (set after spinup completes)
19-
pickup_iter: null
2018
run_dir: "../test-run-03252026"
2119

2220
# MITgcm run parameters for each breeding member
@@ -28,5 +26,4 @@ member_run:
2826

2927
# Output directories
3028
paths:
31-
ensemble_dir: "."
3229
member_prefix: "member"

simulations/glorysv12-curvilinear/workflows/breed_vectors.sh

Lines changed: 54 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,16 @@
77
#SBATCH --output=breed_%A_%a.out
88
#SBATCH --error=breed_%A_%a.out
99

10-
# Each array task runs one breeding member.
10+
# Each array task runs one breeding member for one 30-day cycle.
1111
# SLURM_ARRAY_TASK_ID = member number (1-50)
12+
#
13+
# Each member starts from nIter0=0 with its own perturbed IC files
14+
# (T.init.bin, S.init.bin, U.init.bin, V.init.bin, Eta.init.bin)
15+
# and runs forward for nTimeSteps (default 7200 = 30 days at dt=360s).
16+
#
17+
# The perturbed ICs are created/updated by breed_vectors.py (init or rescale).
1218

1319
MEMBER_ID=$(printf "%03d" $SLURM_ARRAY_TASK_ID)
14-
MEMBER_DIR="ensemble/member_${MEMBER_ID}"
1520

1621
if [ -n "${SLURM_JOB_ID:-}" ]; then
1722
SCRIPT_PATH=$(scontrol show job "$SLURM_JOB_ID" --json | jq -r '.jobs[0].command')
@@ -24,50 +29,68 @@ fi
2429

2530
source $SCRIPT_DIR/env.sh
2631

32+
MEMBER_DIR="${SIMULATION_DIR}/ensemble/member_${MEMBER_ID}"
33+
MEMBER_RUN_DIR="${MEMBER_DIR}/run"
34+
NTIMESTEPS=${BREED_NTIMESTEPS:-7200}
35+
IC_FILES="T.init.bin S.init.bin U.init.bin V.init.bin Eta.init.bin"
36+
2737
echo "======================================="
2838
echo " Breeding member: ${MEMBER_ID}"
29-
echo " Simulation dir: ${SIMULATION_DIR}"
30-
echo " Member dir: ${MEMBER_DIR}"
39+
echo " nIter0: 0"
40+
echo " nTimeSteps: ${NTIMESTEPS}"
3141
echo "======================================="
3242

33-
# Read nIter0 for this member
34-
NITER0=$(cat ${SIMULATION_DIR}/${MEMBER_DIR}/nIter0.txt)
35-
echo "Starting from iteration: ${NITER0}"
36-
3743
###############################################################################
38-
# Set up member run directory if needed
44+
# Set up member run directory (fresh each cycle)
3945
###############################################################################
40-
if [[ ! -d "${SIMULATION_DIR}/${MEMBER_DIR}/run" ]]; then
41-
echo "Setting up member run directory..."
42-
mkdir -p ${SIMULATION_DIR}/${MEMBER_DIR}/run
46+
rm -rf ${MEMBER_RUN_DIR}
47+
mkdir -p ${MEMBER_RUN_DIR}
4348

44-
# Symlink input files from the main input directory
45-
for f in ${SIMULATION_INPUT_DIR}/*; do
46-
ln -sf $f ${SIMULATION_DIR}/${MEMBER_DIR}/run/$(basename $f)
47-
done
49+
# Symlink all files from the master input directory
50+
for f in ${SIMULATION_INPUT_DIR}/*; do
51+
ln -sf $f ${MEMBER_RUN_DIR}/$(basename $f)
52+
done
4853

49-
# Symlink namelist files
50-
for f in data data.cal data.exf data.kpp data.mnc data.obcs data.pkg data.diagnostics eedata; do
51-
ln -sf ${SIMULATION_INPUT_DIR}/$f ${SIMULATION_DIR}/${MEMBER_DIR}/run/$f 2>/dev/null
52-
done
54+
# Remove symlinks for IC files — these will be member-specific copies
55+
for f in ${IC_FILES}; do
56+
rm -f ${MEMBER_RUN_DIR}/$f
57+
done
5358

54-
# Override pickup with the member's perturbed pickup
55-
ln -sf ${SIMULATION_DIR}/${MEMBER_DIR}/pickup.*.data ${SIMULATION_DIR}/${MEMBER_DIR}/run/
56-
ln -sf ${SIMULATION_DIR}/${MEMBER_DIR}/pickup.*.meta ${SIMULATION_DIR}/${MEMBER_DIR}/run/
59+
# Copy member's perturbed IC files (created by breed_vectors.py)
60+
for f in ${IC_FILES}; do
61+
if [[ -f ${MEMBER_DIR}/$f ]]; then
62+
cp ${MEMBER_DIR}/$f ${MEMBER_RUN_DIR}/$f
63+
else
64+
echo "WARNING: ${MEMBER_DIR}/$f not found"
65+
fi
66+
done
5767

58-
# Create a member-specific data file with correct nIter0 and nTimeSteps
59-
sed "s/nIter0=.*/nIter0=${NITER0},/" ${SIMULATION_INPUT_DIR}/data > ${SIMULATION_DIR}/${MEMBER_DIR}/run/data
68+
# Copy namelist files from beegfs (latest config, not stale local copy)
69+
for f in data.cal data.exf data.kpp data.mnc data.obcs data.pkg data.diagnostics eedata; do
70+
cp ${SIMULATION_DIR}/input/$f ${MEMBER_RUN_DIR}/$f 2>/dev/null
71+
done
6072

61-
echo "Done."
62-
fi
73+
# Generate member-specific 'data' file:
74+
# nIter0=0, nTimeSteps=NTIMESTEPS, single pickup at end
75+
CYCLE_SECONDS=$((NTIMESTEPS * 360))
76+
cat ${SIMULATION_DIR}/input/data | \
77+
sed -e "s/^ nIter0=.*/ nIter0=0,/" \
78+
-e "s/^ endTime=.*/ nTimeSteps=${NTIMESTEPS},/" \
79+
-e "s/^ pChkptFreq=.*/ pChkptFreq=${CYCLE_SECONDS}.0,/" \
80+
-e "s/^ chkptFreq=.*/ chkptFreq=0.0,/" \
81+
-e "s/^ dumpFreq=.*/ dumpFreq=0.0,/" \
82+
> ${MEMBER_RUN_DIR}/data
83+
84+
echo "--- Member data file (key params) ---"
85+
grep -E '^ nIter0|^ nTimeSteps|^ pChkptFreq|^ chkptFreq|^ deltaT|^ dumpFreq' ${MEMBER_RUN_DIR}/data
86+
echo "--------------------------------------"
6387

6488
###############################################################################
65-
# Run MITgcm for this member
89+
# Run MITgcm
6690
###############################################################################
67-
cd ${SIMULATION_DIR}/${MEMBER_DIR}/run
91+
cd ${MEMBER_RUN_DIR}
6892

6993
srun --mpi=pmix \
7094
--container-image=$MITGCM_BASE_IMG \
71-
--container-mounts=${SIMULATION_INPUT_DIR}:/input,${SIMULATION_DIR}:/workspace:rw \
72-
--container-env=MEMBER_DIR,NITER0 \
95+
--container-mounts=${SIMULATION_INPUT_DIR}:/input:ro,${SIMULATION_DIR}:/workspace:rw \
7396
/opt/mitgcm/mitgcmuv

simulations/glorysv12-curvilinear/workflows/run.sh

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,16 @@
77
#SBATCH --output=%x-%A.out
88
#SBATCH --error=%x-%A.out
99

10-
export RUN_DIR="test-run-03252026/"
10+
export RUN_DIR="${RUN_DIR:-test-run-03252026/}"
1111

1212
if [ -n "${SLURM_JOB_ID:-}" ]; then
1313
SCRIPT_PATH=$(scontrol show job "$SLURM_JOB_ID" --json | jq -r '.jobs[0].command' )
1414
SCRIPT_DIR=$(dirname "$(readlink -f "$SCRIPT_PATH")")
15-
SIMULATION_DIR=$(dirname $SCRIPT_DIR)
15+
SIMULATION_DIR="${SIMULATION_DIR:-$(dirname $SCRIPT_DIR)}"
1616
else
1717
# Fallback for when running the script outside of a Slurm job
1818
SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
19+
SIMULATION_DIR="${SIMULATION_DIR:-$(dirname $SCRIPT_DIR)}"
1920
fi
2021

2122
source $SCRIPT_DIR/env.sh
@@ -25,9 +26,25 @@ echo ""
2526
echo " Using simulation directory : ${SIMULATION_DIR}"
2627
echo " Using run directory : ${RUN_DIR}"
2728
echo " Using MITgcm base image : ${MITGCM_BASE_IMG}"
29+
echo " SLURM Job ID : ${SLURM_JOB_ID}"
2830
echo ""
2931
echo "======================================="
3032

33+
# Write job ID into the run directory so the dashboard can find it
34+
mkdir -p ${RUN_DIR}
35+
echo ${SLURM_JOB_ID} > ${RUN_DIR}/slurm_job_id
36+
37+
###############################################################################
38+
# Sync namelist files (data*) from beegfs to local input directory
39+
# This ensures the local disk copy always has the latest configuration
40+
###############################################################################
41+
echo "-------------------------------------"
42+
echo " > Syncing namelist files to local input directory..."
43+
cp -v ${SIMULATION_DIR}/input/data* ${SIMULATION_INPUT_DIR}/ 2>/dev/null
44+
cp -v ${SIMULATION_DIR}/input/eedata ${SIMULATION_INPUT_DIR}/ 2>/dev/null
45+
echo " > Done syncing."
46+
echo "-------------------------------------"
47+
3148
###############################################################################
3249
# Set up run directory
3350
###############################################################################

0 commit comments

Comments
 (0)