Commit 5f7628a
Add K8s failure classification, terminationReason plumbing, and retryOn directive
When running Nextflow pipelines on Kubernetes, all task failures are treated
identically — there is no distinction between infrastructure failures (OOMKilled,
Evicted, Preempted) and application errors (exit code 1, script bugs). This makes
it impossible to implement smart retry strategies like "retry OOM with more memory"
or "fail fast on application errors" without resorting to fragile exit-code matching
in error strategy closures.
Additionally, task.terminationReason was not accessible in error strategy closures
because the value was set on TaskRun but never plumbed through to TaskConfig (the
object exposed as `task` in closures).
The K8s task handler now extracts the container termination reason from the K8s API
and classifies failures into categories:
- **Application failures** (OOMKilled, Error, etc.): `terminationReason` is set on
the task but no special handling — respects the user's error strategy.
- **Infrastructure failures** (Evicted, Preempting, DeadlineExceeded, Shutdown):
Sets `task.aborted = true` and throws `NodeTerminationException` for automatic
retry, since these are transient platform issues not caused by the task itself.
- **Inferred reasons**: When K8s doesn't provide a reason but the exit code is
informative (137 = SIGKILL/OOM, 143 = SIGTERM), a synthetic reason is set
(e.g. `OOMKilled(exit137)`).
Also fixes the `exitcode` → `exitCode` typo in `K8sClient.jobStateFallback0()`
and the `0 ?: readExitFile()` Groovy truthiness bug (0 is falsy in Groovy, so
exit code 0 was incorrectly falling through to readExitFile).
Building on the work in #6436 and #6442 which introduced K8s exit code reading
from the container terminated state.
- Added `volatile String terminationReason` field to `TaskRun`
- Plumbed `terminationReason` and `aborted` from `TaskRun` to `TaskConfig` in
`TaskProcessor.resumeOrDie()`, following the existing `exitStatus` pattern
- `terminationReason` is now accessible as `task.terminationReason` in error
strategy closures
- Native logging: retry messages now include `[reason: OOMKilled]` and hard
failure error blocks include a "Termination reason" section — no custom
error strategy closure needed for visibility
New process directive `retryOn` that provides a declarative way to retry based on
termination reasons without writing Groovy closures:
process FOO {
retryOn 'OOMKilled'
memory { 2.GB * task.attempt }
...
}
Or in config files (assignment syntax):
process {
retryOn = ['OOMKilled', 'OOMKilled(exit137)']
}
When the task's `terminationReason` matches any value in the `retryOn` list, the
error strategy is overridden to RETRY. The user's error strategy closure is still
evaluated first (preserving side effects like logging), but the return value is
overridden.
Tested on GKE with nf-core/oncoanalyser using a config that forces OOM:
process {
withName: "BWAMEM2_ALIGN" {
memory = { 1500.MB * task.attempt }
retryOn = ['OOMKilled']
}
}
BWAMEM2_ALIGN OOMs at 1500MB, Nextflow logs:
[2b/54e1be] NOTE: Process `BWAMEM2_ALIGN (...)` terminated with an error
exit status (137) [reason: OOMKilled] -- Execution is retried (1)
The task is retried with 3000MB (attempt 2), then 4500MB (attempt 3) if needed.
- TaskRun.groovy: added terminationReason field, cleared on makeCopy()
- TaskConfig.groovy: added getRetryOn() getter
- TaskProcessor.groovy: plumbed terminationReason/aborted to config, implemented
retryOn override in checkErrorStrategy(), added native terminationReason logging
- ProcessBuilder.groovy: registered retryOn as valid directive
- K8sTaskHandler.groovy: failure classification, terminationReason extraction,
infrastructure failure detection, K8sOutOfCpu/MemoryException in catch clause
- K8sClient.groovy: fixed exitcode→exitCode typo, explicit exitCode: 0
- K8sTaskHandlerTest.groovy: 6 new tests + 1 updated test
- K8sClientTest.groovy: 1 new test for job fallback exit code
Signed-off-by: Rayan Hassaine <r.hassaine@hartwigmedicalfoundation.nl>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Rayan Hassaïne <r.hassaine@hartwigmedicalfoundation.nl>1 parent c70376d commit 5f7628a
10 files changed
Lines changed: 231 additions & 16 deletions
File tree
- modules
- nextflow/src/main/groovy/nextflow
- processor
- script/dsl
- nf-lang/src/main/java/nextflow/script/dsl
- plugins/nf-k8s/src
- main/nextflow/k8s
- client
- test/nextflow/k8s
- client
Lines changed: 12 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
240 | 240 | | |
241 | 241 | | |
242 | 242 | | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
243 | 255 | | |
244 | 256 | | |
245 | 257 | | |
| |||
Lines changed: 4 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
117 | 117 | | |
118 | 118 | | |
119 | 119 | | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
120 | 124 | | |
121 | 125 | | |
122 | 126 | | |
| |||
Lines changed: 11 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1075 | 1075 | | |
1076 | 1076 | | |
1077 | 1077 | | |
| 1078 | + | |
| 1079 | + | |
1078 | 1080 | | |
1079 | 1081 | | |
1080 | 1082 | | |
| |||
1085 | 1087 | | |
1086 | 1088 | | |
1087 | 1089 | | |
| 1090 | + | |
| 1091 | + | |
1088 | 1092 | | |
1089 | 1093 | | |
1090 | 1094 | | |
| |||
1150 | 1154 | | |
1151 | 1155 | | |
1152 | 1156 | | |
1153 | | - | |
| 1157 | + | |
| 1158 | + | |
| 1159 | + | |
| 1160 | + | |
| 1161 | + | |
| 1162 | + | |
| 1163 | + | |
1154 | 1164 | | |
1155 | 1165 | | |
1156 | 1166 | | |
| |||
Lines changed: 7 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
342 | 342 | | |
343 | 343 | | |
344 | 344 | | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
345 | 351 | | |
346 | 352 | | |
347 | 353 | | |
| |||
378 | 384 | | |
379 | 385 | | |
380 | 386 | | |
| 387 | + | |
381 | 388 | | |
382 | 389 | | |
383 | 390 | | |
| |||
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
73 | 73 | | |
74 | 74 | | |
75 | 75 | | |
| 76 | + | |
76 | 77 | | |
77 | 78 | | |
78 | 79 | | |
| |||
Lines changed: 6 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
303 | 303 | | |
304 | 304 | | |
305 | 305 | | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
306 | 312 | | |
307 | 313 | | |
308 | 314 | | |
| |||
Lines changed: 40 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| 31 | + | |
| 32 | + | |
31 | 33 | | |
32 | 34 | | |
33 | 35 | | |
| |||
358 | 360 | | |
359 | 361 | | |
360 | 362 | | |
361 | | - | |
| 363 | + | |
362 | 364 | | |
363 | 365 | | |
364 | 366 | | |
| |||
438 | 440 | | |
439 | 441 | | |
440 | 442 | | |
441 | | - | |
442 | | - | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
443 | 447 | | |
444 | 448 | | |
445 | 449 | | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
446 | 465 | | |
447 | 466 | | |
448 | 467 | | |
| |||
455 | 474 | | |
456 | 475 | | |
457 | 476 | | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
458 | 495 | | |
459 | 496 | | |
460 | 497 | | |
| |||
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
420 | 420 | | |
421 | 421 | | |
422 | 422 | | |
| 423 | + | |
423 | 424 | | |
424 | 425 | | |
425 | 426 | | |
| |||
Lines changed: 140 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
| 26 | + | |
25 | 27 | | |
26 | 28 | | |
27 | 29 | | |
| |||
500 | 502 | | |
501 | 503 | | |
502 | 504 | | |
| 505 | + | |
503 | 506 | | |
504 | 507 | | |
| 508 | + | |
505 | 509 | | |
506 | 510 | | |
507 | 511 | | |
| |||
678 | 682 | | |
679 | 683 | | |
680 | 684 | | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
| 724 | + | |
| 725 | + | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
| 740 | + | |
| 741 | + | |
| 742 | + | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
| 749 | + | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
| 756 | + | |
| 757 | + | |
| 758 | + | |
| 759 | + | |
| 760 | + | |
| 761 | + | |
| 762 | + | |
| 763 | + | |
| 764 | + | |
| 765 | + | |
| 766 | + | |
| 767 | + | |
| 768 | + | |
| 769 | + | |
| 770 | + | |
| 771 | + | |
| 772 | + | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
681 | 821 | | |
682 | 822 | | |
683 | 823 | | |
| |||
Lines changed: 9 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1060 | 1060 | | |
1061 | 1061 | | |
1062 | 1062 | | |
1063 | | - | |
| 1063 | + | |
1064 | 1064 | | |
1065 | 1065 | | |
1066 | 1066 | | |
1067 | 1067 | | |
1068 | 1068 | | |
1069 | | - | |
1070 | | - | |
1071 | | - | |
| 1069 | + | |
1072 | 1070 | | |
1073 | 1071 | | |
1074 | 1072 | | |
| |||
1084 | 1082 | | |
1085 | 1083 | | |
1086 | 1084 | | |
| 1085 | + | |
1087 | 1086 | | |
1088 | | - | |
1089 | 1087 | | |
1090 | 1088 | | |
1091 | | - | |
1092 | | - | |
| 1089 | + | |
1093 | 1090 | | |
| 1091 | + | |
| 1092 | + | |
| 1093 | + | |
1094 | 1094 | | |
1095 | | - | |
1096 | 1095 | | |
1097 | | - | |
1098 | 1096 | | |
1099 | 1097 | | |
1100 | 1098 | | |
1101 | | - | |
1102 | | - | |
1103 | | - | |
| 1099 | + | |
| 1100 | + | |
1104 | 1101 | | |
1105 | 1102 | | |
0 commit comments