Skip to content

Commit 8159cf3

Browse files
Add bred vector ensemble workflow and fix surface field plotter
- breed_vectors.py: init/rescale/status subcommands for bred vector ensemble generation with configurable cycle length, target amplitude, and per-variable RMS monitoring - breed_config.yaml: 50 members, 8 cycles, 30-day cycle, 0.05°C target - breed_vectors.sh: SLURM array job (1-50) for parallel member runs - ensemble/README.md: methodology, GCP cost estimate (~$1.1K spot), convergence monitoring, and deployment guide - Fix plot_surface_fields.py: correct tile layout using X/Y coordinate indices, handle MNC Z dimension naming (Zmd000050) - Fix monitor_dashboard.py: handle query strings in /plots and /img URL paths Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0a4664d commit 8159cf3

6 files changed

Lines changed: 749 additions & 13 deletions

File tree

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# Bred Vector Ensemble Generation
2+
3+
This directory contains the configuration and member directories for generating
4+
50 independent ensemble initial conditions using the **bred vector method**.
5+
6+
## Methodology
7+
8+
### Why bred vectors?
9+
10+
Ocean models are chaotic — small differences in initial conditions grow
11+
exponentially, projecting onto the system's fastest-growing instability modes.
12+
The bred vector method exploits this by:
13+
14+
1. Perturbing a control state with small noise
15+
2. Running the model forward to let perturbations grow
16+
3. Rescaling the perturbation back to a target amplitude
17+
4. Repeating until the perturbation "locks on" to the dominant growing modes
18+
19+
After several breeding cycles, the perturbations are no longer random noise —
20+
they represent physically meaningful, dynamically balanced uncertainty structures
21+
(Gulf Stream meanders, baroclinic eddies, frontal instabilities). These are
22+
far more realistic initial condition perturbations than random noise alone.
23+
24+
### Breeding cycle
25+
26+
```
27+
Control: ─────────────────────────────────────────────►
28+
│ │
29+
│ perturb │ rescale & perturb
30+
▼ ▼
31+
Member i: ─────●════════════════════●════════════════════●───►
32+
cycle 0 cycle 1 cycle 2 ...
33+
(30 days) (30 days)
34+
```
35+
36+
Each cycle:
37+
1. Member starts from `control_state + perturbation`
38+
2. Runs forward for 30 days (configurable)
39+
3. At end: `bred_vector = member_state - control_state`
40+
4. Rescale factor = `target_RMS / actual_RMS` (computed from temperature)
41+
5. **Same rescale factor applied to ALL variables** (T, S, U, V, SSH) to preserve
42+
geostrophic and hydrostatic balance
43+
6. New perturbation: `control_state + rescale_factor × bred_vector`
44+
45+
### Design choices
46+
47+
**50 independent streams**: Each member has its own random seed and evolves
48+
independently. This maximizes the diversity of growing modes captured.
49+
50+
**Single rescaling factor from temperature**: Rather than rescaling each variable
51+
independently (which would break dynamical consistency), we compute one factor
52+
from the temperature field and apply it uniformly. The bred vector's internal
53+
balance between T, S, U, V, and SSH is preserved.
54+
55+
**Target amplitude 0.05°C RMS**: This is the standard for mesoscale-resolving
56+
North Atlantic ensembles — large enough to seed growing instabilities but small
57+
enough to remain in the linear growth regime.
58+
59+
**30-day cycle length**: Captures mesoscale eddy growth and Gulf Stream
60+
instabilities (Rossby deformation timescale). Shorter cycles (7 days) emphasize
61+
fast barotropic modes; longer cycles (60+ days) allow slower baroclinic modes.
62+
Configurable in `breed_config.yaml`.
63+
64+
**8 breeding cycles**: Empirically sufficient for convergence. Monitor with
65+
`breed_vectors.py status` — the per-variable RMS should stabilize by cycle 5–6.
66+
67+
## Workflow
68+
69+
### Prerequisites
70+
71+
- Completed control spinup run with at least one permanent pickup file
72+
- Set `pickup_iter` in `breed_config.yaml` (or leave null to auto-detect latest)
73+
74+
### Steps
75+
76+
```bash
77+
cd simulations/glorysv12-curvilinear
78+
79+
# 1. Initialize 50 perturbed pickups from the control state
80+
uv run python ../../spectre_utils/breed_vectors.py init ensemble/breed_config.yaml
81+
82+
# 2. Run all 50 members for one breeding cycle (SLURM array job)
83+
sbatch --chdir=$(pwd) workflows/breed_vectors.sh
84+
85+
# 3. After all members complete — compute bred vectors and rescale
86+
uv run python ../../spectre_utils/breed_vectors.py rescale ensemble/breed_config.yaml --cycle 1
87+
88+
# 4. Check convergence (per-variable RMS table)
89+
uv run python ../../spectre_utils/breed_vectors.py status ensemble/breed_config.yaml --cycle 1
90+
91+
# 5. Repeat steps 2–4 for each cycle
92+
# Update --cycle 2, 3, ... 8
93+
```
94+
95+
### GCP deployment
96+
97+
Each member directory (`member_001/` through `member_050/`) is self-contained:
98+
- Perturbed pickup file (`.data` + `.meta`)
99+
- `nIter0.txt` with the starting iteration
100+
101+
To run on GCP:
102+
1. Copy the input deck and member pickups to each compute node's local disk
103+
2. Each member runs standard MITgcm with the member's pickup as the restart file
104+
3. After all members finish, copy pickups back and run the `rescale` step
105+
106+
## GCP Cost Estimate
107+
108+
### Cluster configuration
109+
110+
| Component | Machine type | Count | Purpose |
111+
|-----------|-------------|-------|---------|
112+
| Compute | h4d-standard-192 | 17 | 3 simulations per node (64 cores each) |
113+
| Login | n1-standard-2 | 1 | SSH access, job submission |
114+
| Controller | n1-standard-2 | 1 | Slurm controller |
115+
116+
### Compute requirements per cycle
117+
118+
- 50 members ÷ 3 per node = **17 nodes** per cycle
119+
- 30 sim-days at 12–20 sim-days/wall-hr = **1.5–2.5 wall hours** per cycle
120+
- 8 cycles × 2.5 hrs = **~20 hours** total wall time (plus ~30 min rescaling between cycles)
121+
- Total compute: 17 nodes × 20 hrs = **340 node-hours** (conservative)
122+
123+
### Local disk per node
124+
125+
| Data | Size |
126+
|------|------|
127+
| EXF forcing (8 variables × 54 GB) | 432 GB |
128+
| OBC boundary files | 20 GB |
129+
| Grid, bathymetry, initial conditions | 2 GB |
130+
| Pickup files (3 members) | 6 GB |
131+
| Output headroom (diagnostics, pickups) | 40 GB |
132+
| **Total** | **~500 GB** |
133+
134+
Recommend **1 TB pd-ssd** per compute node, or local NVMe SSD if available
135+
on the machine type.
136+
137+
### Cost breakdown
138+
139+
| Item | On-demand | Spot (~70% discount) |
140+
|------|-----------|---------------------|
141+
| h4d-standard-192 × 340 node-hrs @ $9.64/hr | $3,278 | $983 |
142+
| pd-ssd 1 TB × 17 nodes × 20 hrs @ $0.23/hr | $78 | $78 |
143+
| n1-standard-2 × 2 × 24 hrs @ $0.095/hr | $5 | $5 |
144+
| **Total** | **~$3,400** | **~$1,100** |
145+
146+
### Notes
147+
148+
- Spot/preemptible instances are viable since each breeding cycle is only
149+
1.5–2.5 hours — short enough to avoid most preemptions
150+
- The 30-min rescaling step between cycles runs on a single node and is
151+
negligible cost
152+
- Data transfer: ~500 GB input deck upload (one-time) + ~100 MB pickups per
153+
cycle (negligible)
154+
- The control run must also advance 30 days per cycle to provide the reference
155+
state — this can run on one of the 17 compute nodes
156+
157+
## Configuration
158+
159+
All parameters are in `breed_config.yaml`:
160+
161+
```yaml
162+
breeding:
163+
n_members: 50
164+
n_cycles: 8
165+
cycle_length_days: 30 # configurable
166+
target_amplitude:
167+
temperature_rms: 0.05 # °C
168+
```
169+
170+
## Convergence monitoring
171+
172+
Run `breed_vectors.py status` after each cycle. You should see:
173+
174+
- **Cycles 1–3**: RMS ratios between variables shift as perturbations reorganize
175+
- **Cycles 4–6**: Per-variable RMS stabilizes — bred vectors are converging
176+
- **Cycles 7–8**: Minimal change — bred vectors have locked onto growing modes
177+
178+
If temperature RMS doesn't stabilize by cycle 8, increase `n_cycles` or
179+
consider a shorter `cycle_length_days` to accelerate convergence.
180+
181+
## Directory structure
182+
183+
```
184+
ensemble/
185+
├── breed_config.yaml # Breeding parameters
186+
├── README.md # This file
187+
├── member_001/ # Member 1
188+
│ ├── pickup.NNNNNNNNNN.data
189+
│ ├── pickup.NNNNNNNNNN.meta
190+
│ └── nIter0.txt
191+
├── member_002/
192+
│ └── ...
193+
└── member_050/
194+
└── ...
195+
```
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Bred vector ensemble configuration
2+
breeding:
3+
n_members: 50
4+
n_cycles: 8
5+
cycle_length_days: 30
6+
target_amplitude:
7+
temperature_rms: 0.05 # °C — rescaling factor derived from T, applied to all fields
8+
9+
# Model grid
10+
grid:
11+
Nx: 768
12+
Ny: 424
13+
Nr: 50
14+
15+
# Pickup file from the control run to use as the base state
16+
control:
17+
pickup_prefix: "pickup"
18+
# Iteration number of the pickup to use (set after spinup completes)
19+
pickup_iter: null
20+
run_dir: "../test-run-03252026"
21+
22+
# MITgcm run parameters for each breeding member
23+
member_run:
24+
deltaT: 360.0
25+
# nTimeSteps = cycle_length_days * 86400 / deltaT
26+
# 30 * 86400 / 360 = 7200
27+
nTimeSteps: 7200
28+
29+
# Output directories
30+
paths:
31+
ensemble_dir: "."
32+
member_prefix: "member"
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#!/bin/bash
2+
#SBATCH --array=1-50
3+
#SBATCH -n64
4+
#SBATCH -c1
5+
#SBATCH --time=12:00:00
6+
#SBATCH --job-name=spectre_breed
7+
#SBATCH --output=breed_%A_%a.out
8+
#SBATCH --error=breed_%A_%a.out
9+
10+
# Each array task runs one breeding member.
11+
# SLURM_ARRAY_TASK_ID = member number (1-50)
12+
13+
MEMBER_ID=$(printf "%03d" $SLURM_ARRAY_TASK_ID)
14+
MEMBER_DIR="ensemble/member_${MEMBER_ID}"
15+
16+
if [ -n "${SLURM_JOB_ID:-}" ]; then
17+
SCRIPT_PATH=$(scontrol show job "$SLURM_JOB_ID" --json | jq -r '.jobs[0].command')
18+
SCRIPT_DIR=$(dirname "$(readlink -f "$SCRIPT_PATH")")
19+
SIMULATION_DIR=$(dirname $SCRIPT_DIR)
20+
else
21+
SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
22+
SIMULATION_DIR=$(dirname $SCRIPT_DIR)
23+
fi
24+
25+
source $SCRIPT_DIR/env.sh
26+
27+
echo "======================================="
28+
echo " Breeding member: ${MEMBER_ID}"
29+
echo " Simulation dir: ${SIMULATION_DIR}"
30+
echo " Member dir: ${MEMBER_DIR}"
31+
echo "======================================="
32+
33+
# Read nIter0 for this member
34+
NITER0=$(cat ${SIMULATION_DIR}/${MEMBER_DIR}/nIter0.txt)
35+
echo "Starting from iteration: ${NITER0}"
36+
37+
###############################################################################
38+
# Set up member run directory if needed
39+
###############################################################################
40+
if [[ ! -d "${SIMULATION_DIR}/${MEMBER_DIR}/run" ]]; then
41+
echo "Setting up member run directory..."
42+
mkdir -p ${SIMULATION_DIR}/${MEMBER_DIR}/run
43+
44+
# Symlink input files from the main input directory
45+
for f in ${SIMULATION_INPUT_DIR}/*; do
46+
ln -sf $f ${SIMULATION_DIR}/${MEMBER_DIR}/run/$(basename $f)
47+
done
48+
49+
# Symlink namelist files
50+
for f in data data.cal data.exf data.kpp data.mnc data.obcs data.pkg data.diagnostics eedata; do
51+
ln -sf ${SIMULATION_INPUT_DIR}/$f ${SIMULATION_DIR}/${MEMBER_DIR}/run/$f 2>/dev/null
52+
done
53+
54+
# Override pickup with the member's perturbed pickup
55+
ln -sf ${SIMULATION_DIR}/${MEMBER_DIR}/pickup.*.data ${SIMULATION_DIR}/${MEMBER_DIR}/run/
56+
ln -sf ${SIMULATION_DIR}/${MEMBER_DIR}/pickup.*.meta ${SIMULATION_DIR}/${MEMBER_DIR}/run/
57+
58+
# Create a member-specific data file with correct nIter0 and nTimeSteps
59+
sed "s/nIter0=.*/nIter0=${NITER0},/" ${SIMULATION_INPUT_DIR}/data > ${SIMULATION_DIR}/${MEMBER_DIR}/run/data
60+
61+
echo "Done."
62+
fi
63+
64+
###############################################################################
65+
# Run MITgcm for this member
66+
###############################################################################
67+
cd ${SIMULATION_DIR}/${MEMBER_DIR}/run
68+
69+
srun --mpi=pmix \
70+
--container-image=$MITGCM_BASE_IMG \
71+
--container-mounts=${SIMULATION_INPUT_DIR}:/input,${SIMULATION_DIR}:/workspace:rw \
72+
--container-env=MEMBER_DIR,NITER0 \
73+
/opt/mitgcm/mitgcmuv

0 commit comments

Comments
 (0)