Skip to content

NUOPC cap fails to handle parallel restart files #22

@minghangli-uni

Description

@minghangli-uni

When MOM6 is run through the NUOPC driver with PARALLEL_RESTARTFILES = True, it successfully writes the per-rank
restart slices (this will be enabled after payu-org/payu#601 gets merged),

access-om3.mom6.r.1900-01-02-00000.nc.0000
access-om3.mom6.r.1900-01-02-00000.nc.0001
...

but rpointer.ocn contains only the basename, eg, access-om3.mom6.r.1900-01-02-00000.nc.

On the next run MOM6/NUOPC cannot locate the restart ensemble and fails with,

WARNING: MOM_restart: Unable to find restart file : ...nc.nc
FATAL  : MOM_restart: Unable to find any restart files specified by ...

Manual edits of rpointer.ocn to enumerate the .nc.000? files avoid the first fatal, but start up then crashes with,

NetCDF: Index exceeds dimension bound   (variable: Temp)

This is because each slice is opened as a single file, hence each rank thinks the file holds the whole grid and tries to read beyond its local dimenions.


The current fix is to keep the basename in rpointer.ocn, but let MOM open it in decomposed mode then each rank can safely read its own piece. I'll wrap it up in a following PR for this fix.


More discussions can be found: ACCESS-NRI/access-om3-configs#592, ACCESS-NRI/access-om3-configs#637, payu-org/payu#601, payu-org/payu#600

Metadata

Metadata

Labels

bugSomething isn't working

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions