perf: OpenMP fork-join consolidation, SIMD hints, I/O timer-attribution barrier#897
perf: OpenMP fork-join consolidation, SIMD hints, I/O timer-attribution barrier#897JanStreffing wants to merge 1 commit into
Conversation
…tion barrier Cuts FESOM-side timestep overhead by reducing OpenMP fork-join count and giving the compiler explicit vectorization hints in the hottest streams in tracers, dynamics, ALE mixing, and the mean-output bookkeeping. Theme A — combine adjacent !$OMP PARALLEL DO regions into a single !$OMP PARALLEL with multiple !$OMP DO blocks. Each pair previously paid two fork-joins per timestep; now one. Touched: oce_ale_mixing_kpp (KPP coefficient + boundary-layer setup), oce_ale_tracer (Fer_GM bolus add / subtract), oce_ale_vel_rhs (rhs build + ke_cor diagnostic), oce_tracer_mod (init_tracers_AB + AB interpolation). Theme B — io_meandata.update_means: replaced the per-stream PARALLEL DO with one outer PARALLEL DO over io_NSTREAMS and lifted the size() calls out of the inner loop into szI/szJ; added !$OMP SIMD on the inner I-axis so the compiler can vectorize the float adds. At HR the original code spawned ~50 streams * ~6700 calls/sim-month worth of OMP regions per rank per month; one outer fork-join replaces the lot. Theme C — io_xios mask helpers: io_xios_apply_wet_2d / 2d_elem / 3d (_r4 and _r8) and io_xios_apply_ice_mask_2d_elem now run their fill- loops under !$OMP PARALLEL DO. Lifts size() / min() out of the loop header (nn / ne) so OMP doesn't have to re-evaluate every iteration. Theme D — fesom_module: insert one MPI_Barrier at the end of the I/O block before f%t5 = MPI_Wtime(). XIOS-client / OASIS asymmetry stalls that would otherwise be absorbed by the first halo exchange in ice_timestep are now captured in rtime_write_means instead of leaking into rtime_fullice. No wall-time change, only honest accounting. Theme E — oce_ale.F90 and oce_setup_step.F90: trailing-whitespace cleanup that landed on the same lines as the OMP rework; pulled in to keep the OMP commits hunk-clean.
|
Does this bring anything in terms of performance? Most of this looks like cosmetics to me ? |
|
unfortunately i noticed all changes to openmp is compiler dependent performance. I guess we need to get performance values, and reproducibility tests checked for more then 1 thread used. that said simd hints are great for getting about twice or more performance with intel compilers for a same kernel. |
Agreed. I have shelved this for now. To be tested later. |
|
@JanStreffing it looks very good PR, can we seperate it into 2, 1. related to IO io_meandata and xios this part can go already easily as this can easily be test and 2. other openmp in core code oce_ale tracer etc which needs a bit of careful look for reproducibility and also a comment from @patrickscholz in review can be dealt later. |
Summary
Cuts FESOM-side timestep overhead by reducing OpenMP fork-join count and giving the compiler explicit vectorization hints in the hottest hot loops. Also fixes timer attribution in the I/O block.
Theme A — adjacent
!$OMP PARALLEL DOregions merged into one!$OMP PARALLELwith multiple!$OMP DOblocksEach pair previously paid two fork-joins per timestep; now one. Touched:
oce_ale_mixing_kpp— KPP coefficient setup + boundary-layer computeoce_ale_tracer— Fer_GM bolus add / subtractoce_ale_vel_rhs— RHS build +ke_cordiagnosticoce_tracer_mod—init_tracers_AB+ AB interpolationTheme B —
io_meandata.update_meansReplaces the per-stream
PARALLEL DOwith a single outerPARALLEL DO SCHEDULE(dynamic)overio_NSTREAMS, lifts thesize()calls out of the inner loop intoszI/szJ, and adds!$OMP SIMDon the innerIaxis so the compiler can vectorize the float adds. At HR with ~50 streams and a per-step call rate, the original code spawned a large number of OMP regions per rank per month; one outer fork-join replaces the lot.Theme C —
io_xiosmask helpersio_xios_apply_wet_2d/_2d_elem/_3d(_r4and_r8) andio_xios_apply_ice_mask_2d_elemnow run their fill-loops under!$OMP PARALLEL DO. Liftssize()/min()out of the loop header (nn/ne) so OMP doesn't have to re-evaluate every iteration.Theme D —
fesom_moduleInsert one
MPI_Barrier(MPI_COMM_FESOM)at the end of the I/O block, immediately beforef%t5 = MPI_Wtime(). XIOS-client / OASIS asymmetry stalls that would otherwise be absorbed by the first halo exchange inice_timestepare now captured inrtime_write_meansinstead of leaking intortime_fullice. No wall-time change — the wait happens anyway, this just makes timer accounting honest.Theme E — whitespace
oce_ale.F90andoce_setup_step.F90carry trailing-whitespace cleanup that landed on the same lines as the OMP rework; pulled in to keep the OMP commits hunk-clean.