Skip to content

Add tool for analyzing and reporting random CDash test failures #600

@achauphan

Description

@achauphan

Related issues

Description

Random failures can bring down an entire CI iteration on a regular basis and waste resources whenever a retest is requested in order to pass the various checks of a pull request.

Spotting a randomly failing test requires a lot of manual CDash querying and analysis by the developer. However, in most cases, a developer may not have the time to trace, identify, and report the randomly failing test, and instead will opt to ignore it in favor of requesting a retest, leading to the previously stated point of wasting resources. This lack of reporting also leads to bigger issue in that it allows the randomly failing test to linger inside the code base and further affect developers in the future.

Proposed Solution

This issue proposes a new tool (which for now would live inside of TriBITS under tribits/ci_support) that can run automatically to query, scrape, analyze, and report tests that are deemed to be "randomly failing" to an operations team via email or an automated issue creation in the repository.

The definition for a randomly failing test will be a test that intermittently reports as passing or failing without any changes made to the topic or target branch being tested (topic and target tip SHA1 are the same) between CI testing iterations.

Fortunately, there is a lot of already existing work done that can be leveraged to build this tool in Python that already exists inside of tribits/ci_support. Notably, the module CreateIssueTrackerFromCDashQuery.py which can be used in the template example example_test_failure_github_issue.py along with the module CDashQueryAnalyzeReport.py which contains most of the heavy CDash querying functionality. Thus, the core work that will need to be done after utilizing the previously written modules will be to implement the algorithm that determines a random failure that is customizable on a project basis.

The goal will be for this tool to be able to look for randomly failing tests for any projects that posts their test results to CDash. The specifics of how this tool will gather the version information of the builds in CDash will be unique to each project and will require implementation on a project basis.

Ideally, this tool can be extended to analyze and report randomly failing configure, builds, and tests, however starting with randomly failing tests should lead to a similar framework that can be used for those other cases.

Requirements

  • posts a github issue upon identifying a randomly failing test (TRILFRAME-614 requirement for any post starting with an email first)
  • be able to query cdash results over a period of time
  • all functionality is tested
  • usage is documented

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions