Skip to content

Random test timeouts bringing down PR build & test iterations 2023 #12391

@bartlettroscoe

Description

@bartlettroscoe

CC: @ccober6, @sebrowne, @trilinos/framework

Description

Anecdotal evidence seems to suggest that random test failures, including random timeouts, are bringing down PR testing iterations fairly regularly. When this happens, all of the builds need to be run again from scratch, wasting testing computing resources, blocking PR testing iterations for other PRs, and delaying the merge of PRs.

For example, this query over the last two months suggest that random test timeouts took out PR testing iterations for the following PRs:

Note that the test timeouts for the PRs #12050 and #12297 shown in that query don't appear to be random. Filtering out those PRs yields this reduced query over the last two months shows the 7 randomly failing tests:

Site Build Name Test Name Status Time Proc Time Details Build Time Processors
ascic164 PR-12388-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1204 ROL_example_PDE-OPT_ginzburg-landau_example_01_MPI_4 Failed 10m 130ms 40m 520ms Completed (Timeout) 2023-10-10T12:08:21 MDT 4
ascicgpu036 PR-12372-test-rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables-2552 Adelus_vector_random_npr3_rhs4_MPI_3 Failed 10m 60ms 30m 180ms Completed (Timeout) 2023-10-06T10:29:53 MDT 3
ascic114 PR-12367-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1181 MueLu_ParameterListInterpreterTpetra_MPI_4 Failed 10m 90ms 40m 360ms Completed (Timeout) 2023-10-05T09:36:07 MDT 4
ascic166 PR-12281-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1452 Tempus_IMEX_RK_Partitioned_Staggered_FSA_Partitioned_IMEX_RK_1st_Order_MPI_1 Failed 10m 40ms 10m 40ms Completed (Timeout) 2023-09-25T11:50:41 MDT 1
ascic166 PR-12259-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1347 Tempus_BackwardEuler_MPI_1 Failed 10m 30ms 10m 30ms Completed (Timeout) 2023-09-14T00:58:50 MDT 1
ascic166 PR-12223-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1317 Tempus_BackwardEuler_MPI_1 Failed 10m 40ms 10m 40ms Completed (Timeout) 2023-09-11T14:11:26 MDT 1
ascic164 PR-12103-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-784 ROL_test_algorithm_TypeE_StabilizedLCL_MPI_1 Failed 10m 120ms 10m 120ms Completed (Timeout) 2023-08-10T11:19:22 MDT 1

NOTE: Further analysis would be needed to confirm that all of these tests were random timeouts. But I believe that a tool could be written to automatically determine if a timeout (or any test failure) was random. It would actually not be that hard to do.

Suggested solution

The simple solution would seem for the ctest -S driver to just rerun the failing tests again, in serial, to avoid the timeouts. For example, CTest directly supports this with the --repeat after-timeout:<n> argument and the ctest_test() argument REPEAT after-timeout:<n>.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CLOSED_DUE_TO_INACTIVITYIssue or PR has been closed by the GitHub Actions bot due to inactivity.MARKED_FOR_CLOSUREIssue or PR is marked for auto-closure by the GitHub Actions bot.PA: FrameworkIssues that fall under the Trilinos Framework Product Area

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions