Random test timeouts bringing down PR build & test iterations 2023

CC: @ccober6, @sebrowne, @trilinos/framework 

## Description

Anecdotal evidence seems to suggest that random test failures, including random timeouts, are bringing down PR testing iterations fairly regularly.  When this happens, all of the builds need to be run again from scratch, wasting testing computing resources, blocking   PR testing iterations for other PRs, and delaying the merge of PRs.

For example, [this query over the last two months](https://trilinos-cdash.sandia.gov/queryTests.php?project=Trilinos&begin=2023-08-10&end=2023-10-10&filtercount=3&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Pull%20Request&field2=status&compare2=61&value2=failed&field3=details&compare3=63&value3=Timeout) suggest that random test timeouts took out PR testing iterations for the following PRs:

* #12388 
* #12372 
* #12367
* #12281 
* #12259
* #12223 
* #12103

Note that the test timeouts for the PRs #12050 and #12297 shown in that query don't appear to be random.  Filtering out those PRs yields [this reduced query over the last two months](https://trilinos-cdash.sandia.gov/queryTests.php?project=Trilinos&begin=2023-08-10&end=2023-10-10&filtercount=5&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Pull%20Request&field2=status&compare2=61&value2=failed&field3=details&compare3=63&value3=Timeout&field4=buildname&compare4=64&value4=PR-12050&field5=buildname&compare5=64&value5=PR-12297) shows the 7 randomly failing tests:

<html>
<body>


Site | Build Name | Test Name | Status | Time | Proc Time | Details | Build Time | Processors
-- | -- | -- | -- | -- | -- | -- | -- | --
ascic164 | PR-12388-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1204 | ROL_example_PDE-OPT_ginzburg-landau_example_01_MPI_4 | Failed | 10m 130ms | 40m 520ms | Completed (Timeout) | 2023-10-10T12:08:21 MDT | 4
ascicgpu036 | PR-12372-test-rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables-2552 | Adelus_vector_random_npr3_rhs4_MPI_3 | Failed | 10m 60ms | 30m 180ms | Completed (Timeout) | 2023-10-06T10:29:53 MDT | 3
ascic114 | PR-12367-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1181 | MueLu_ParameterListInterpreterTpetra_MPI_4 | Failed | 10m 90ms | 40m 360ms | Completed (Timeout) | 2023-10-05T09:36:07 MDT | 4
ascic166 | PR-12281-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1452 | Tempus_IMEX_RK_Partitioned_Staggered_FSA_Partitioned_IMEX_RK_1st_Order_MPI_1 | Failed | 10m 40ms | 10m 40ms | Completed (Timeout) | 2023-09-25T11:50:41 MDT | 1
ascic166 | PR-12259-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1347 | Tempus_BackwardEuler_MPI_1 | Failed | 10m 30ms | 10m 30ms | Completed (Timeout) | 2023-09-14T00:58:50 MDT | 1
ascic166 | PR-12223-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1317 | Tempus_BackwardEuler_MPI_1 | Failed | 10m 40ms | 10m 40ms | Completed (Timeout) | 2023-09-11T14:11:26 MDT | 1
ascic164 | PR-12103-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-784 | ROL_test_algorithm_TypeE_StabilizedLCL_MPI_1 | Failed | 10m 120ms | 10m 120ms | Completed (Timeout) | 2023-08-10T11:19:22 MDT | 1


</body>
</html>

NOTE: Further analysis would be needed to confirm that all of these tests were random timeouts.  But I believe that a tool could be written to automatically determine if a timeout (or any test failure) was random.  It would actually not be that hard to do.


## Suggested solution

The simple solution would seem for the ctest -S driver to just rerun the failing tests again, in serial, to avoid the timeouts. For example, CTest directly supports this with the [`--repeat after-timeout:<n>` argument](https://cmake.org/cmake/help/v3.23/manual/ctest.1.html) and the [`ctest_test()` argument `REPEAT after-timeout:<n>`]( https://cmake.org/cmake/help/v3.23/command/ctest_test.html).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random test timeouts bringing down PR build & test iterations 2023 #12391

Description

Suggested solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Site	Build Name	Test Name	Status	Time	Proc Time	Details	Build Time	Processors
ascic164	PR-12388-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1204	ROL_example_PDE-OPT_ginzburg-landau_example_01_MPI_4	Failed	10m 130ms	40m 520ms	Completed (Timeout)	2023-10-10T12:08:21 MDT	4
ascicgpu036	PR-12372-test-rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables-2552	Adelus_vector_random_npr3_rhs4_MPI_3	Failed	10m 60ms	30m 180ms	Completed (Timeout)	2023-10-06T10:29:53 MDT	3
ascic114	PR-12367-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1181	MueLu_ParameterListInterpreterTpetra_MPI_4	Failed	10m 90ms	40m 360ms	Completed (Timeout)	2023-10-05T09:36:07 MDT	4
ascic166	PR-12281-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1452	Tempus_IMEX_RK_Partitioned_Staggered_FSA_Partitioned_IMEX_RK_1st_Order_MPI_1	Failed	10m 40ms	10m 40ms	Completed (Timeout)	2023-09-25T11:50:41 MDT	1
ascic166	PR-12259-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1347	Tempus_BackwardEuler_MPI_1	Failed	10m 30ms	10m 30ms	Completed (Timeout)	2023-09-14T00:58:50 MDT	1
ascic166	PR-12223-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1317	Tempus_BackwardEuler_MPI_1	Failed	10m 40ms	10m 40ms	Completed (Timeout)	2023-09-11T14:11:26 MDT	1
ascic164	PR-12103-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-784	ROL_test_algorithm_TypeE_StabilizedLCL_MPI_1	Failed	10m 120ms	10m 120ms	Completed (Timeout)	2023-08-10T11:19:22 MDT	1

Random test timeouts bringing down PR build & test iterations 2023 #12391

Description

Description

Suggested solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions