A Snakemake pipeline for pre-processing CMIP6 climate model data. Given a set of CMIP6 variables and a time range, the pipeline subsets, interpolates, and regrids the data to a target grid.
The pipeline runs in four main stages for each variable and time window:
- Subset — extracts the relevant time slice from CMIP6 NetCDF files
- Interpolate (atmospheric variables only) — re-interpolates from native pressure levels to target levels
- Regrid — re-projects data to a target lat/lon grid using either xarray linear interpolation or xESMF
- Manifest — writes a summary file describing the processed outputs
Surface and atmospheric variables are handled as separate branches and can be processed in parallel. Intermediate subset files are marked temporary and cleaned up automatically.
The pipeline uses a conda environment defined in environment.yml. micromamba is recommended for fast installs:
micromamba env create -f environment.yml
micromamba activate regridOr with conda:
conda env create -f environment.yml
conda activate regridKey dependencies include Python 3.12, xarray, dask, scipy, netCDF4, and optionally xESMF for conservative/bilinear regridding.
Copy and edit the example config before running:
cp config/preprocess.example.yaml config/preprocess.yamlThe config file controls:
cmip6— path to the CMIP6 data root and DRS identifiers (activity, institution, source, experiment, member, etc.)selection— time range, variables to process, target pressure levels, and optional time window frequencyregridding— regridding engine (xarray_interporxesmf), method, and target grid (eitherresolution_degreesor explicit lat/lon counts)runtime— dask chunking, weight caching, log leveloutputs— paths for intermediate and final outputs
Run with Snakemake from the repo root:
snakemake --cores 4To do a dry run first:
snakemake --cores 4 --dry-runTo use more parallelism, increase --cores. Each regrid rule uses 4 threads internally.
build/
surface.regridded/{var}/{var}_{window}.{resolution}.regridded.nc
atmos.regridded/{var}/{var}_{window}.{resolution}.regridded.nc
preprocess_manifest.txt
logs/
subset_surface/, subset_atmos/, interpolate_levels/
regrid_surface/, regrid_atmos/
manifest.log
python -m unittest discover testsTests create a synthetic CMIP6 directory structure and exercise the full pipeline end-to-end, including subsetting, interpolation, regridding, and manifest generation.