perf: OpenMP fork-join consolidation, SIMD hints, I/O timer-attribution barrier by JanStreffing · Pull Request #897 · FESOM/fesom2

JanStreffing · 2026-04-27T15:14:10Z

Summary

Cuts FESOM-side timestep overhead by reducing OpenMP fork-join count and giving the compiler explicit vectorization hints in the hottest hot loops. Also fixes timer attribution in the I/O block.

Theme A — adjacent `!$OMP PARALLEL DO` regions merged into one `!$OMP PARALLEL` with multiple `!$OMP DO` blocks

Each pair previously paid two fork-joins per timestep; now one. Touched:

oce_ale_mixing_kpp — KPP coefficient setup + boundary-layer compute
oce_ale_tracer — Fer_GM bolus add / subtract
oce_ale_vel_rhs — RHS build + ke_cor diagnostic
oce_tracer_mod — init_tracers_AB + AB interpolation

Theme B — `io_meandata.update_means`

Replaces the per-stream PARALLEL DO with a single outer PARALLEL DO SCHEDULE(dynamic) over io_NSTREAMS, lifts the size() calls out of the inner loop into szI/szJ, and adds !$OMP SIMD on the inner I axis so the compiler can vectorize the float adds. At HR with ~50 streams and a per-step call rate, the original code spawned a large number of OMP regions per rank per month; one outer fork-join replaces the lot.

Theme C — `io_xios` mask helpers

io_xios_apply_wet_2d / _2d_elem / _3d (_r4 and _r8) and io_xios_apply_ice_mask_2d_elem now run their fill-loops under !$OMP PARALLEL DO. Lifts size() / min() out of the loop header (nn / ne) so OMP doesn't have to re-evaluate every iteration.

Theme D — `fesom_module`

Insert one MPI_Barrier(MPI_COMM_FESOM) at the end of the I/O block, immediately before f%t5 = MPI_Wtime(). XIOS-client / OASIS asymmetry stalls that would otherwise be absorbed by the first halo exchange in ice_timestep are now captured in rtime_write_means instead of leaking into rtime_fullice. No wall-time change — the wait happens anyway, this just makes timer accounting honest.

Theme E — whitespace

oce_ale.F90 and oce_setup_step.F90 carry trailing-whitespace cleanup that landed on the same lines as the OMP rework; pulled in to keep the OMP commits hunk-clean.

…tion barrier Cuts FESOM-side timestep overhead by reducing OpenMP fork-join count and giving the compiler explicit vectorization hints in the hottest streams in tracers, dynamics, ALE mixing, and the mean-output bookkeeping. Theme A — combine adjacent !$OMP PARALLEL DO regions into a single !$OMP PARALLEL with multiple !$OMP DO blocks. Each pair previously paid two fork-joins per timestep; now one. Touched: oce_ale_mixing_kpp (KPP coefficient + boundary-layer setup), oce_ale_tracer (Fer_GM bolus add / subtract), oce_ale_vel_rhs (rhs build + ke_cor diagnostic), oce_tracer_mod (init_tracers_AB + AB interpolation). Theme B — io_meandata.update_means: replaced the per-stream PARALLEL DO with one outer PARALLEL DO over io_NSTREAMS and lifted the size() calls out of the inner loop into szI/szJ; added !$OMP SIMD on the inner I-axis so the compiler can vectorize the float adds. At HR the original code spawned ~50 streams * ~6700 calls/sim-month worth of OMP regions per rank per month; one outer fork-join replaces the lot. Theme C — io_xios mask helpers: io_xios_apply_wet_2d / 2d_elem / 3d (_r4 and _r8) and io_xios_apply_ice_mask_2d_elem now run their fill- loops under !$OMP PARALLEL DO. Lifts size() / min() out of the loop header (nn / ne) so OMP doesn't have to re-evaluate every iteration. Theme D — fesom_module: insert one MPI_Barrier at the end of the I/O block before f%t5 = MPI_Wtime(). XIOS-client / OASIS asymmetry stalls that would otherwise be absorbed by the first halo exchange in ice_timestep are now captured in rtime_write_means instead of leaking into rtime_fullice. No wall-time change, only honest accounting. Theme E — oce_ale.F90 and oce_setup_step.F90: trailing-whitespace cleanup that landed on the same lines as the OMP rework; pulled in to keep the OMP commits hunk-clean.

patrickscholz · 2026-04-28T09:09:04Z

Does this bring anything in terms of performance? Most of this looks like cosmetics to me ?

suvarchal · 2026-05-05T07:26:52Z

unfortunately i noticed all changes to openmp is compiler dependent performance. I guess we need to get performance values, and reproducibility tests checked for more then 1 thread used. that said simd hints are great for getting about twice or more performance with intel compilers for a same kernel.

JanStreffing · 2026-05-05T07:33:41Z

unfortunately i noticed all changes to openmp is compiler dependent performance. I guess we need to get performance values, and reproducibility tests checked for more then 1 thread used. that said simd hints are great for getting about twice or more performance with intel compilers for a same kernel.

Agreed. I have shelved this for now. To be tested later.

suvarchal · 2026-05-05T07:56:01Z

@JanStreffing it looks very good PR, can we seperate it into 2, 1. related to IO io_meandata and xios this part can go already easily as this can easily be test and 2. other openmp in core code oce_ale tracer etc which needs a bit of careful look for reproducibility and also a comment from @patrickscholz in review can be dealt later.

JanStreffing requested review from patrickscholz and suvarchal and removed request for suvarchal April 27, 2026 17:38

JanStreffing self-assigned this Apr 27, 2026

JanStreffing added the enhancement New feature or request label Apr 27, 2026

patrickscholz reviewed Apr 28, 2026

View reviewed changes

Comment thread src/oce_tracer_mod.F90

JanStreffing marked this pull request as draft May 3, 2026 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: OpenMP fork-join consolidation, SIMD hints, I/O timer-attribution barrier#897

perf: OpenMP fork-join consolidation, SIMD hints, I/O timer-attribution barrier#897
JanStreffing wants to merge 1 commit into
mainfrom
perf/fesom-omp-and-vectorization

JanStreffing commented Apr 27, 2026 •

edited

Loading

Uh oh!

patrickscholz commented Apr 28, 2026

Uh oh!

Uh oh!

suvarchal commented May 5, 2026

Uh oh!

JanStreffing commented May 5, 2026

Uh oh!

suvarchal commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JanStreffing commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Theme A — adjacent !$OMP PARALLEL DO regions merged into one !$OMP PARALLEL with multiple !$OMP DO blocks

Theme B — io_meandata.update_means

Theme C — io_xios mask helpers

Theme D — fesom_module

Theme E — whitespace

Uh oh!

patrickscholz commented Apr 28, 2026

Uh oh!

Uh oh!

suvarchal commented May 5, 2026

Uh oh!

JanStreffing commented May 5, 2026

Uh oh!

suvarchal commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JanStreffing commented Apr 27, 2026 •

edited

Loading

Theme A — adjacent `!$OMP PARALLEL DO` regions merged into one `!$OMP PARALLEL` with multiple `!$OMP DO` blocks

Theme B — `io_meandata.update_means`

Theme C — `io_xios` mask helpers

Theme D — `fesom_module`