CC: @ccober6, @sebrowne, @trilinos/framework
Description
Anecdotal evidence seems to suggest that random test failures, including random timeouts, are bringing down PR testing iterations fairly regularly. When this happens, all of the builds need to be run again from scratch, wasting testing computing resources, blocking PR testing iterations for other PRs, and delaying the merge of PRs.
For example, this query over the last two months suggest that random test timeouts took out PR testing iterations for the following PRs:
Note that the test timeouts for the PRs #12050 and #12297 shown in that query don't appear to be random. Filtering out those PRs yields this reduced query over the last two months shows the 7 randomly failing tests:
| Site |
Build Name |
Test Name |
Status |
Time |
Proc Time |
Details |
Build Time |
Processors |
| ascic164 |
PR-12388-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1204 |
ROL_example_PDE-OPT_ginzburg-landau_example_01_MPI_4 |
Failed |
10m 130ms |
40m 520ms |
Completed (Timeout) |
2023-10-10T12:08:21 MDT |
4 |
| ascicgpu036 |
PR-12372-test-rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables-2552 |
Adelus_vector_random_npr3_rhs4_MPI_3 |
Failed |
10m 60ms |
30m 180ms |
Completed (Timeout) |
2023-10-06T10:29:53 MDT |
3 |
| ascic114 |
PR-12367-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1181 |
MueLu_ParameterListInterpreterTpetra_MPI_4 |
Failed |
10m 90ms |
40m 360ms |
Completed (Timeout) |
2023-10-05T09:36:07 MDT |
4 |
| ascic166 |
PR-12281-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1452 |
Tempus_IMEX_RK_Partitioned_Staggered_FSA_Partitioned_IMEX_RK_1st_Order_MPI_1 |
Failed |
10m 40ms |
10m 40ms |
Completed (Timeout) |
2023-09-25T11:50:41 MDT |
1 |
| ascic166 |
PR-12259-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1347 |
Tempus_BackwardEuler_MPI_1 |
Failed |
10m 30ms |
10m 30ms |
Completed (Timeout) |
2023-09-14T00:58:50 MDT |
1 |
| ascic166 |
PR-12223-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1317 |
Tempus_BackwardEuler_MPI_1 |
Failed |
10m 40ms |
10m 40ms |
Completed (Timeout) |
2023-09-11T14:11:26 MDT |
1 |
| ascic164 |
PR-12103-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-784 |
ROL_test_algorithm_TypeE_StabilizedLCL_MPI_1 |
Failed |
10m 120ms |
10m 120ms |
Completed (Timeout) |
2023-08-10T11:19:22 MDT |
1 |
NOTE: Further analysis would be needed to confirm that all of these tests were random timeouts. But I believe that a tool could be written to automatically determine if a timeout (or any test failure) was random. It would actually not be that hard to do.
Suggested solution
The simple solution would seem for the ctest -S driver to just rerun the failing tests again, in serial, to avoid the timeouts. For example, CTest directly supports this with the --repeat after-timeout:<n> argument and the ctest_test() argument REPEAT after-timeout:<n>.
CC: @ccober6, @sebrowne, @trilinos/framework
Description
Anecdotal evidence seems to suggest that random test failures, including random timeouts, are bringing down PR testing iterations fairly regularly. When this happens, all of the builds need to be run again from scratch, wasting testing computing resources, blocking PR testing iterations for other PRs, and delaying the merge of PRs.
For example, this query over the last two months suggest that random test timeouts took out PR testing iterations for the following PRs:
Note that the test timeouts for the PRs #12050 and #12297 shown in that query don't appear to be random. Filtering out those PRs yields this reduced query over the last two months shows the 7 randomly failing tests:
NOTE: Further analysis would be needed to confirm that all of these tests were random timeouts. But I believe that a tool could be written to automatically determine if a timeout (or any test failure) was random. It would actually not be that hard to do.
Suggested solution
The simple solution would seem for the ctest -S driver to just rerun the failing tests again, in serial, to avoid the timeouts. For example, CTest directly supports this with the
--repeat after-timeout:<n>argument and thectest_test()argumentREPEAT after-timeout:<n>.