job_submit/lua - add "core_spec" to job_desc HPC-12614 by iddecker · Pull Request #79 · hpcugent/slurm

iddecker · 2026-03-27T17:43:03Z

No description provided.

Regression from 2586a90. Changelog: slurmcltd - Avoid persistent connections hangs when enable_async_reply is configured. Issue: 50737 Ticket: 24550, 24516, 24581 Cherry-picked: e248aec

When validating a node config, if there is some problem with the configuration, the node could end up with topo_cnt greater than the actual gres configured. This is a problem, because could lead to a corrupt state save being written by the controller: A topo_cnt sized array with less than topo_cnt values initialized. Which can cause SIGEGV while accessing some individual values. Fix properly updates topo_cnt when there are config errors to prevent that potential mismatch. Ticket: 24025 Changelog: Prevent potential controller segfault when reconfiguring after gres file updates. Cherry-picked: 0cb29ac

See merge request SchedMD/dev/slurm!2925

See merge request SchedMD/dev/slurm!2924

Ticket: 23467 Cherry-picked: 440cc02

When starting slurmd as a service, move the process to a subcgroup of the slurmd systemd unit cgroup. According to systemd docs the top cgroup is owned by systemd and it will eventually try to reset slurmd limits, e.g. on a systemctl daemon-reload. This can cause systemd errors in the system log, so we are only allowed to touch the limits in a sub-cgroup. This emulates DelegateSubgroup=slurmd only available after systemd 255. Ticket: 23467 Cherry-picked: 212bbf0

Ticket: 23467 Changelog: Reparent slurmd to a subcgroup to avoid conflicting with systemd. Cherry-picked: cb62c2f

See merge request SchedMD/dev/slurm!2929

sprio was unable to parse a comma-separated list of jobs despite documentation stating that this was possible. The entire job list was passed to the unfmt_job_id_string() function, but this function is unable to parse job lists. Switching to only pass the single job to unfmt_job_id_string() fixes this issue. Regression introduced in 4f21612. Changelog: Fix sprio regression not handling comma separated list of jobids. Ticket: 24625 Cherry-picked: 80beef6

See merge request SchedMD/dev/slurm!2950

Changelog: slurmctld,slurmd - Fix memory leak when container ID is populated. Issue: 50190 Cherry-picked: 13aba98

Enhanced _remove_ecores() to use a two-pass approach that collects all P-core frequencies instead of just the first one found. This fixes P-core detection on processors where different P-cores report different FrequencyMaxMHz values. The previous implementation (commit a80d6ba) used single-frequency matching which incorrectly excluded P-cores with different frequencies. On Intel Core Ultra 7 268V, some P-cores report 5000 MHz while others report 4900 MHz, causing only 2 of 4 P-cores to be detected. Implementation: - First pass: Find all CPU Kinds with CoreType=IntelCore and store all their distinct FrequencyMaxMHz values - Second pass: Include any CPU Kinds matching any collected frequency, even without CoreType=IntelCore attribute The second pass is necessary for hwloc < 2.10 where P-cores restricted by cpuset lose their CoreType attribute but retain FrequencyMaxMHz. This was fixed in hwloc 2.10 (commit 971ea80f9) which added PMU-based CoreType detection. The two-pass approach maintains compatibility with older hwloc versions while fixing the multiple-frequency limitation that broke detection on newer Intel processors. Changelog: slurmd - Fix P-core detection on processors with varying P-core frequencies and in cpuset-restricted environments. Ticket: 24590 Cherry-picked: c5bd248

See merge request SchedMD/dev/slurm!2957

See merge request SchedMD/dev/slurm!2955

This option allows namespace/linux to continue to operate even if bpf tokens are not supported on the system, user namespaces are enabled, and the cgroup constrain devices option is enabled. In this mode of operation, any devices that would be constrained will only be constrained to the job and not individual steps. Ticket: 21718 Changelog: namespace/linux - add disable_bpf_token option. Cherry-picked: a70b2d8

Ticket: 24652 Cherry-picked: 82ca853

Cherry-picked: 33e19b9

Cherry-picked: c239700

Changelog: slurmctld - Avoid expedited requeue triggering a job to requeue when job exit code was zero. Ticket: 24564 Cherry-picked: c255c7d

batch_requeue_fini() should only be called after the epilog scripts have been complete. Add logic to catch and log if batch_requeue_fini() is ever called before the epilog is complete. Changelog: slurmctld - Avoid expedited requeue of jobs while waiting for job epilog script to complete. Ticket: 24564 Cherry-picked: 1511a97

Fair-Share factor reported only for users. Ticket: 24618 Cherry-picked: f0ce621

…nfig Prevent cloud nodes that are configured in topology.[conf|yaml] from being removed from the topology when the node is powered down if it does not use Topology=... in its node configuration line in slurm.conf or in the --conf slurmd option. If node_ptr->config_ptr->topology_str, node_ptr->topology_str, and node_ptr->topology_orig_str are NULL, then the slurm.conf and --conf option did not specify a topology for the node and thus the topology of the node does not need to be reset by node_mgr_set_node_topology(). Changelog: slurmctld - Prevent removing cloud nodes from the topology when putting them in the POWERED_DOWN state if they are present in topology.conf or topology.yaml and their node configuration did not specify the Topology option. Ticket: 24405 Cherry-picked: b36d9b4

Before topology_g_add_rm_node would only clear a node from a topology if topology_str was NULL. When it wasn't it would only modify the topologies listed in topology_str while potentially leaving the node in another topology. This makes it remove the node from all topologies not listed in topology_str. Changelog: interfaces/topology - When modifying a nodes topology with the Topology option in slurm.conf or the slurmd --conf Topology, change the topology to fully match the new topology. Ticket: 24405 Cherry-picked: fafc23e

If the node configuration line in slurm.conf specifies the Topology option, this change allows modifications to that option take effect on a reconfig/restart. The same for changes to Topology.[conf|yaml]. Changelog: slurmctld - Allow changes to topology.conf or topology.yaml, and slurm.conf node configuration Topology option to take effect on a reconfigure or restart when power saving is enabled. Ticket: 24405 Cherry-picked: 8f1b2d2

See merge request SchedMD/dev/slurm!2963

See merge request SchedMD/dev/slurm!2937

See merge request SchedMD/dev/slurm!2961

This fixes the logic in slurm_bf_licenses_equal() to validate that both license lists are equal. Before only licenses in the first license list were being checked to see if they were also the same in the second license list. It did not verify all licenses from the second license list were in the first license list. Changelog: slurmctld - Prevent backfill from combining future timeslots if they have different license reservations. Cherry-picked: fd0ef0d

If a slurmctld reconfig/restart happens while a cloud node is POWER_DOWN, the node's addr is reset, and, because it is CLOUD, it can end up FUTURE. Add POWER_DOWN to the other POWER states to exclude the node from addr reset. Ticket: 24358 Changelog: Fix CLOUD nodes infrequently becoming FUTURE on slurmctld restart. Cherry-picked: 49e3ba7

See merge request SchedMD/dev/slurm!2968

When rem_nodes is 0 in _get_block_level(), log2 resulted in -inf. Previously, this caused a crash. Check for negative before accessing the block_levels bitstring. Changelog: Fix handling of 0 node test allocations in topology/block. Ticket: 24552 Cherry-picked: 0b5f64f

See merge request SchedMD/dev/slurm!2973

See merge request SchedMD/dev/slurm!2962

Function returns the count of licenses in cluster_license_list. This is in preparation for following commit. Ticket: 24594 Cherry-picked: 95659c0

Sort the appended licenses from the advanced license reservations such that they come after the non-resv licenses, sorted by resv_id and lic_id. This is in preparation for the following commit. Ticket: 24594 Cherry-picked: aaa02a4

If the remaining count of a HRes license that the job requests changes in the next node_space table entry set later_start. Skip doing so in all other cases in regards to licenses. This is due to HRes licenses masking out nodes in the available bitmap depending on remaining counts. This prevents setting later_start unnecessarily leading to many unneeded calls to _try_sched(). Changelog: slurmctld - In backfill, prevent unnecessarily testing jobs at future times using the select plugin if it is guaranteed to fail. Ticket: 24594 Cherry-picked: a23a0a2

See merge request SchedMD/dev/slurm!2981

See merge request SchedMD/dev/slurm!2977

This fixes a regression added by commit a23a0a2. When the node_space table entries licenses is NULL don't return true that there is an increase in requested licenses in the next time slot. Ticket: 24594 Cherry-picked: 7321137

See merge request SchedMD/dev/slurm!2984

Update slurm.spec and debian/changelog as well.

slurmrestd does not handle SIGHUP; sending HUP terminates the daemon instead of reconfiguring. Drop ExecReload so systemctl reload is not advertised and does not kill the process. Changelog: slurmrestd - Remove ExecReload from unit file since the daemon does not handle SIGHUP (reload would terminate the process). Ticket: 24667 Cherry-picked: cd319ef

See merge request SchedMD/dev/slurm!2996

…es it Call _archive_table() for the main purge_type before archiving PURGE_JOB associated data (job env, script, etc.). Commit 4dcca88 had reversed this order; period_start is only set by the main archive path, while archive_write_file at the end of _archive_table uses it. The job env/ script codepaths (_pack_archive_job_env, _pack_archive_job_script) do not set it, so the main archive must run first. Changelog: Prevent "period_start should already be set" errors when purging slurmdbd data and fix file names for archives of purged slurmdbd data. Ticket: 24523 Cherry-picked: b263ac6

See merge request SchedMD/dev/slurm!3002

Ticket: 24439 Changelog: Skip x11 shutdown when x11 functionality was not requested. Cherry-picked: 165fe40

See merge request SchedMD/dev/slurm!3011

Changelog: Fix build errors with recent versions of libcurl (8.16+). Cherry-picked: 0ee19c3

See merge request SchedMD/dev/slurm!3018

Ticket: 24676 Cherry-picked: 372543c

See merge request SchedMD/dev/slurm!3021

After swapping alloc->environment with state.job_env in _alloc_job(), alloc->env_size was left with its original value while alloc->environment might be smaller, causing a segfault when trying to access it out-of-bounds in the later env_array_for_job() call. This failed with step_mgr and in 25.11 after 87e4a70. Both of two injects environment variables. Changelog: Fix scrun segfault with step_mgr and if environment is set. Ticket: 24662 Cherry-picked: 3ded4ee

See merge request SchedMD/dev/slurm!3028

Located in the job info struct. Ticket: 24674 Changelog: Fix two memory leaks located in the job info struct. Cherry-picked: 3012ffe

See merge request SchedMD/dev/slurm!3039

naterini and others added 30 commits February 9, 2026 14:54

slurmcltd - Mark REQUEST_PERSIST_INIT as keep_msg

a284be2

Regression from 2586a90. Changelog: slurmcltd - Avoid persistent connections hangs when enable_async_reply is configured. Issue: 50737 Ticket: 24550, 24516, 24581 Cherry-picked: e248aec

Merge branch 'cherrypick-2876-25.11' into 'slurm-25.11'

4869f18

See merge request SchedMD/dev/slurm!2925

Merge branch 'cherrypick-2908-25.11' into 'slurm-25.11'

ae01e4d

See merge request SchedMD/dev/slurm!2924

cgroup/v2 - Make function more generic

acb1cb7

Ticket: 23467 Cherry-picked: 440cc02

Add changelog for the previous 2 commits

a955ffd

Ticket: 23467 Changelog: Reparent slurmd to a subcgroup to avoid conflicting with systemd. Cherry-picked: cb62c2f

Merge branch 'cherrypick-2902-25.11' into 'slurm-25.11'

5afa677

See merge request SchedMD/dev/slurm!2929

Merge branch 'cherrypick-2936-25.11' into 'slurm-25.11'

dbd3778

See merge request SchedMD/dev/slurm!2950

Fix container_id memory leaks

2eb4156

Changelog: slurmctld,slurmd - Fix memory leak when container ID is populated. Issue: 50190 Cherry-picked: 13aba98

Merge branch 'cherrypick-2906-25.11' into 'slurm-25.11'

38075a1

See merge request SchedMD/dev/slurm!2957

Merge branch 'cherrypick-2919-25.11' into 'slurm-25.11'

2a752a9

See merge request SchedMD/dev/slurm!2955

Docs - Fix namespace link and title on documentation page

02d4da2

Ticket: 24652 Cherry-picked: 82ca853

Docs - Update position for namespace entry after page rename

27d5a80

Cherry-picked: 33e19b9

Docs - Update explanatory hint for namespace page

236c1fa

Cherry-picked: c239700

slurmctld - For EXPEDITED_REQUEUE, only requeue on failure

66f074b

Changelog: slurmctld - Avoid expedited requeue triggering a job to requeue when job exit code was zero. Ticket: 24564 Cherry-picked: c255c7d

Docs - Update sshare man page

a5d2b0f

Fair-Share factor reported only for users. Ticket: 24618 Cherry-picked: f0ce621

Merge branch 'cherrypick-2806-25.11' into 'slurm-25.11'

2fff9a1

See merge request SchedMD/dev/slurm!2963

Merge branch 'cherrypick-2884-25.11' into 'slurm-25.11'

dd450d1

See merge request SchedMD/dev/slurm!2937

Merge branch 'cherrypick-2895-25.11' into 'slurm-25.11'

3d306d1

See merge request SchedMD/dev/slurm!2961

Merge branch 'cherrypick-2868-25.11' into 'slurm-25.11'

3c8658f

See merge request SchedMD/dev/slurm!2968

Nathan Bulloch and others added 29 commits February 18, 2026 06:03

Merge branch 'cherrypick-2698-25.11' into 'slurm-25.11'

b683415

See merge request SchedMD/dev/slurm!2973

Merge branch 'cherrypick-2952-25.11' into 'slurm-25.11'

b95b346

See merge request SchedMD/dev/slurm!2962

Add cluster_license_count()

e5e4864

Function returns the count of licenses in cluster_license_list. This is in preparation for following commit. Ticket: 24594 Cherry-picked: 95659c0

Merge branch 'cherrypick-2954-25.11' into 'slurm-25.11'

cbc1fee

See merge request SchedMD/dev/slurm!2981

Merge branch 'cherrypick-2932-25.11' into 'slurm-25.11'

0936f91

See merge request SchedMD/dev/slurm!2977

Merge branch 'cherrypick-2983-25.11' into 'slurm-25.11'

fdb5702

See merge request SchedMD/dev/slurm!2984

Docs - Update REST API reference and changelog for 25.11.3

db99d2b

Populate CHANGELOG for 25.11.3

0f1347f

Update META for 25.11.3

c21c011

Update slurm.spec and debian/changelog as well.

Merge branch 'cherrypick-2985-25.11' into 'slurm-25.11'

ac1b121

See merge request SchedMD/dev/slurm!2996

Merge branch 'cherrypick-2885-25.11' into 'slurm-25.11'

6aaef98

See merge request SchedMD/dev/slurm!3002

Skip x11 shutdown when x11 functionality was not requested

5c795f9

Ticket: 24439 Changelog: Skip x11 shutdown when x11 functionality was not requested. Cherry-picked: 165fe40

Merge branch 'cherrypick-2823-25.11' into 'slurm-25.11'

5081044

See merge request SchedMD/dev/slurm!3011

Fix curl build errors with libcurl 8.16+

84ea30a

Changelog: Fix build errors with recent versions of libcurl (8.16+). Cherry-picked: 0ee19c3

Merge branch 'cherrypick-3006-25.11' into 'slurm-25.11'

136794e

See merge request SchedMD/dev/slurm!3018

Docs - Update restrictions on placement and number of hostlist ranges

a705202

Ticket: 24676 Cherry-picked: 372543c

Merge branch 'cherrypick-3014-25.11' into 'slurm-25.11'

6a26b65

See merge request SchedMD/dev/slurm!3021

Merge branch 'cherrypick-2959-25.11' into 'slurm-25.11'

c420393

See merge request SchedMD/dev/slurm!3028

Fix two memory leaks

0023297

Located in the job info struct. Ticket: 24674 Changelog: Fix two memory leaks located in the job info struct. Cherry-picked: 3012ffe

Merge branch 'cherrypick-3000-25.11' into 'slurm-25.11'

7e239fc

See merge request SchedMD/dev/slurm!3039

Merge branch 'slurm-25.11' into 25.11.ug

b607020

job_submit/lua - add "core_spec" to job_desc

0825419

iddecker marked this pull request as draft March 27, 2026 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job_submit/lua - add "core_spec" to job_desc HPC-12614#79

job_submit/lua - add "core_spec" to job_desc HPC-12614#79
iddecker wants to merge 76 commits intohpcugent:25.11.ugfrom
iddecker:core_spec_lua

iddecker commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Conversation

iddecker commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants