@@ -23,24 +23,39 @@ far more realistic initial condition perturbations than random noise alone.
2323
2424### Breeding cycle
2525
26+ Each member has its own set of initial condition files (` T.init.bin ` ,
27+ ` S.init.bin ` , ` U.init.bin ` , ` V.init.bin ` , ` Eta.init.bin ` ). The cycle operates
28+ on these IC files directly:
29+
2630```
27- Control: ─────────────────────────────────────────────►
28- │ │
29- │ perturb │ rescale & perturb
30- ▼ ▼
31- Member i: ─────●════════════════════●════════════════════●───►
32- cycle 0 cycle 1 cycle 2 ...
33- (30 days) (30 days)
31+ Control ICs: T.init.bin, S.init.bin, U.init.bin, V.init.bin, Eta.init.bin
32+ │
33+ │ add perturbation (breed_vectors.py init)
34+ ▼
35+ Member ICs: T.init.bin, S.init.bin, ... (in member_NNN/)
36+ │
37+ │ run MITgcm from nIter0=0 for 30 days
38+ ▼
39+ Member pickup at t=30 days (pickup.0000007200.data)
40+ │
41+ │ bred_vector = member_pickup - control_pickup
42+ │ rescale by target_RMS / actual_T_RMS
43+ │ new_IC = control_IC + rescaled_bred_vector
44+ ▼
45+ Member ICs: overwritten with new perturbation → next cycle
3446```
3547
36- Each cycle:
37- 1 . Member starts from ` control_state + perturbation `
38- 2 . Runs forward for 30 days (configurable)
39- 3 . At end: ` bred_vector = member_state - control_state `
40- 4 . Rescale factor = ` target_RMS / actual_RMS ` (computed from temperature)
41- 5 . ** Same rescale factor applied to ALL variables** (T, S, U, V, SSH) to preserve
42- geostrophic and hydrostatic balance
43- 6 . New perturbation: ` control_state + rescale_factor × bred_vector `
48+ Key points:
49+ - Every cycle starts from ** nIter0=0** — the member's IC files are the
50+ perturbation mechanism, not pickup files
51+ - The ** same forcing, grid, and namelist files** are shared across all members
52+ (symlinked from the master input directory). Only the IC files differ.
53+ - Bred vectors are computed from the ** pickup at t=30 days** , which captures
54+ how the perturbation grew over the cycle
55+ - The ** same rescale factor** (derived from temperature RMS) is applied to
56+ all variables to preserve dynamical balance
57+ - For ** production runs** after breeding converges, each member restarts from
58+ its pickup at t=30 days (iteration 7200)
4459
4560### Design choices
4661
@@ -98,42 +113,57 @@ vectors readjust within the first few weeks of the production run regardless.
98113
99114### Prerequisites
100115
101- - Completed control spinup run with at least one permanent pickup file
102- - Set ` pickup_iter ` in ` breed_config.yaml ` (or leave null to auto-detect latest)
116+ - Completed control spinup run (1 year)
117+ - Control run must also be run from nIter0=0 for 30 days (same as members)
118+ to produce the reference pickup for bred vector computation
103119
104120### Steps
105121
106122``` bash
107123cd simulations/glorysv12-curvilinear
108124
109- # 1. Initialize 50 perturbed pickups from the control state
125+ # 1. Initialize 50 perturbed IC files from the control ICs
110126uv run python ../../spectre_utils/breed_vectors.py init ensemble/breed_config.yaml
111127
112- # 2. Run all 50 members for one breeding cycle (SLURM array job)
128+ # 2. Run all 50 members for one 30-day cycle (SLURM array job)
129+ # Each member starts from nIter0=0 with its perturbed ICs
113130sbatch --chdir=$( pwd) workflows/breed_vectors.sh
114131
115- # 3. After all members complete — compute bred vectors and rescale
132+ # 3. After all members complete — compute bred vectors and overwrite ICs
116133uv run python ../../spectre_utils/breed_vectors.py rescale ensemble/breed_config.yaml --cycle 1
117134
118135# 4. Check convergence (per-variable RMS table)
119- uv run python ../../spectre_utils/breed_vectors.py status ensemble/breed_config.yaml --cycle 1
136+ uv run python ../../spectre_utils/breed_vectors.py status ensemble/breed_config.yaml
120137
121- # 5. Repeat steps 2–4 for each cycle
122- # Update --cycle 2, 3, ... 8
138+ # 5. Repeat steps 2–4 for each cycle (2, 3, ... 8)
123139```
124140
125- ### GCP deployment
141+ ### Running the control alongside members
126142
127- Each member directory (` member_001/ ` through ` member_050/ ` ) is self-contained:
128- - Perturbed pickup file (` .data ` + ` .meta ` )
129- - ` nIter0.txt ` with the starting iteration
143+ The control must also produce a pickup at iteration 7200 (30 days from nIter0=0)
144+ for the bred vector computation. This can be done:
145+ - As a separate single run with the same ` data ` settings as the members
146+ - On one of the 17 compute nodes alongside 2 members (3 sims per node)
130147
131- To run on GCP:
132- 1 . Copy the input deck and member pickups to each compute node's local disk
133- 2 . Each member runs standard MITgcm with the member's pickup as the restart file
134- 3 . After all members finish, copy pickups back and run the ` rescale ` step
148+ ### Transitioning to production
149+
150+ After breeding converges (cycle 5–8):
151+ 1 . Each member has a pickup at ` member_NNN/run/pickup.0000007200.data `
152+ 2 . Copy this pickup to the member's production run directory
153+ 3 . Set ` nIter0=7200 ` and full production ` endTime ` /` nTimeSteps `
154+ 4 . Run the production ensemble
135155
136- ## GCP Cost Estimate
156+ ## GCP deployment
157+
158+ Each member directory (` member_001/ ` through ` member_050/ ` ) contains:
159+ - ` T.init.bin ` , ` S.init.bin ` , ` U.init.bin ` , ` V.init.bin ` , ` Eta.init.bin `
160+
161+ To run on GCP:
162+ 1 . Copy the master input deck to each compute node's local disk (one copy per node)
163+ 2 . Copy each member's IC files to the node
164+ 3 . Set up the member run directory: symlink master input, replace IC symlinks with copies
165+ 4 . Run MITgcm with ` nIter0=0 ` , ` nTimeSteps=7200 `
166+ 5 . After all members finish, copy pickups back and run the ` rescale ` step
137167
138168### Cluster configuration
139169
@@ -143,28 +173,7 @@ To run on GCP:
143173| Login | n1-standard-2 | 1 | SSH access, job submission |
144174| Controller | n1-standard-2 | 1 | Slurm controller |
145175
146- ### Compute requirements per cycle
147-
148- - 50 members ÷ 3 per node = ** 17 nodes** per cycle
149- - 30 sim-days at 12–20 sim-days/wall-hr = ** 1.5–2.5 wall hours** per cycle
150- - 8 cycles × 2.5 hrs = ** ~ 20 hours** total wall time (plus ~ 30 min rescaling between cycles)
151- - Total compute: 17 nodes × 20 hrs = ** 340 node-hours** (conservative)
152-
153- ### Local disk per node
154-
155- | Data | Size |
156- | ------| ------|
157- | EXF forcing (8 variables × 54 GB) | 432 GB |
158- | OBC boundary files | 20 GB |
159- | Grid, bathymetry, initial conditions | 2 GB |
160- | Pickup files (3 members) | 6 GB |
161- | Output headroom (diagnostics, pickups) | 40 GB |
162- | ** Total** | ** ~ 500 GB** |
163-
164- Recommend ** 1 TB pd-ssd** per compute node, or local NVMe SSD if available
165- on the machine type.
166-
167- ### Cost breakdown
176+ ### Cost estimate
168177
169178| Item | On-demand | Spot (~ 70% discount) |
170179| ------| -----------| ---------------------|
@@ -173,16 +182,16 @@ on the machine type.
173182| n1-standard-2 × 2 × 24 hrs @ $0.095/hr | $5 | $5 |
174183| ** Total** | ** ~ $3,400** | ** ~ $1,100** |
175184
176- ### Notes
185+ ### Local disk per node
177186
178- - Spot/preemptible instances are viable since each breeding cycle is only
179- 1.5–2.5 hours — short enough to avoid most preemptions
180- - The 30-min rescaling step between cycles runs on a single node and is
181- negligible cost
182- - Data transfer: ~ 500 GB input deck upload (one-time) + ~ 100 MB pickups per
183- cycle (negligible)
184- - The control run must also advance 30 days per cycle to provide the reference
185- state — this can run on one of the 17 compute nodes
187+ | Data | Size |
188+ | ------ | ------ |
189+ | EXF forcing (8 variables × 54 GB) | 432 GB |
190+ | OBC boundary files | 20 GB |
191+ | Grid, bathymetry, other input | 5 GB |
192+ | Member IC files (3 members × 5 files × 130 MB) | 2 GB |
193+ | Output headroom (pickups) | 40 GB |
194+ | ** Total ** | ** ~ 500 GB ** |
186195
187196## Configuration
188197
@@ -213,11 +222,19 @@ consider a shorter `cycle_length_days` to accelerate convergence.
213222```
214223ensemble/
215224├── breed_config.yaml # Breeding parameters
225+ ├── convergence.json # Per-cycle RMS diagnostics (written by rescale)
216226├── README.md # This file
217227├── member_001/ # Member 1
218- │ ├── pickup.NNNNNNNNNN.data
219- │ ├── pickup.NNNNNNNNNN.meta
220- │ └── nIter0.txt
228+ │ ├── T.init.bin # Perturbed temperature IC
229+ │ ├── S.init.bin # Perturbed salinity IC
230+ │ ├── U.init.bin # Perturbed zonal velocity IC
231+ │ ├── V.init.bin # Perturbed meridional velocity IC
232+ │ ├── Eta.init.bin # Perturbed SSH IC
233+ │ └── run/ # MITgcm run directory (created by breed_vectors.sh)
234+ │ ├── * .bin → /input/ # Symlinks to master input (forcing, grid, OBC)
235+ │ ├── T.init.bin # Copied (not symlinked) from member dir
236+ │ ├── data # Member-specific (nIter0=0, nTimeSteps=7200)
237+ │ └── pickup.0000007200.data # Output: state at t=30 days
221238├── member_002/
222239│ └── ...
223240└── member_050/
0 commit comments