agents: add pd ci flaky triage skill by okJiang · Pull Request #10328 · tikv/pd

okJiang · 2026-03-10T09:39:13Z

What problem does this PR solve?

PD needs a repository-maintained skill for triaging recent CI flaky failures from raw logs, reviewing likely flaky tests, and preparing the right GitHub issue actions.

Issue Number: ref #10159

What is changed and how does it work?

This PR updates the checked-in pd-ci-flaky-triage skill under .agents/skills/ to match the current local workflow.

The refreshed skill now:

collects recent Prow and GitHub Actions failures with a dedicated prepare_logs.py step
keeps source-specific raw-log collection separate before merging reviewed failure items
lets agents review raw logs, candidate flaky tests, and issue matches before any GitHub write
filters collection to failures that target master
supports fixed test windows with --start-from plus --days
removes the old snippet-validator path that no longer belongs to the current workflow

It also refreshes the AGENTS.md skill index entry so the repo description matches the checked-in skill behavior.

Check List

Tests

Unit test
Manual test

Code changes

Has the configuration change: No
Has HTTP APIs changed (Don't forget to add the declarative for the new API): No
Has persistent data change: No

Side effects

Possible performance regression: No
Increased code complexity: No
Breaking backward compatibility: No

Related changes

PR to update pingcap/docs/pingcap/docs-cn: N/A
PR to update pingcap/tiup: N/A
Need to cherry-pick to the release branch: No

Release note

None.

Summary by CodeRabbit

New Features
- Added a new automated workflow for triaging recent CI test failures from Prow and GitHub Actions, categorizing them by failure type, and creating or updating issues for identified flaky tests. The workflow includes structured failure analysis with excerpt extraction and environment filtering to improve flaky test visibility and resolution.

Vendor the pd-ci-flaky-triage skill into .agents/skills, make its commands repo-local, and index it in AGENTS.md. Signed-off-by: okjiang <819421878@qq.com>

ti-chi-bot · 2026-03-10T09:39:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cabinfeverb for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-03-10T09:39:34Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces a new agent skill pd-ci-flaky-triage for automating the triage of recent PD CI failures across Prow and GitHub Actions sources. It includes orchestration scripts to collect failures and download logs, helper modules for CI integration, extraction guidelines for failure snippets, and comprehensive test coverage for core functionality.

Changes

Cohort / File(s)	Summary
Skill Definition `.agents/skills/pd-ci-flaky-triage/SKILL.md`, `AGENTS.md`	Documents the end-to-end workflow stages, control-flow constraints (auth verification, environment filtering, GitHub action justification), artifact contracts for `failure_items.json`, `env_filtered.json`, and `flaky_tests.json`, and registers the skill in the agent registry.
Orchestration & Helpers `.agents/skills/pd-ci-flaky-triage/scripts/prepare_logs.py`, `.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py`	Main pipeline script that resolves time windows, collects failures from both Prow and Actions sources, fetches and spools raw logs concurrently, and writes source-scoped JSON artifacts; legacy helper module providing dataclasses and functions for CI failure collection, log fetching, and authentication.
Reference Materials `.agents/skills/pd-ci-flaky-triage/references/stack_snippet_guidelines.md`, `.agents/skills/pd-ci-flaky-triage/references/stack_snippet_examples.jsonl`	Guidelines for manual failure snippet extraction, failure family classification, noise filtering, and excerpt validation; sample failure entries covering assertion, timeout, goleak, panic, deadlock, and data\_race categories.
Test Coverage `.agents/skills/pd-ci-flaky-triage/scripts/tests/test_triage_pd_ci_flaky.py`, `.agents/skills/pd-ci-flaky-triage/scripts/tests/test_prepare_logs.py`	Unit tests exercising URL conversion, Prow/Actions failure collection logic, log fetching, and window/timestamp resolution.

Sequence Diagram

sequenceDiagram
    participant Script as prepare_logs.py
    participant Auth as gh auth
    participant Prow as Prow API
    participant Actions as GitHub Actions
    participant LogStore as Log Storage
    participant JSON as JSON Artifacts

    Script->>Auth: ensure_gh_auth()
    Auth-->>Script: ✓ authenticated or ✗ fail
    
    Script->>Prow: collect_prow_failures(since, max_pages)
    Prow-->>Script: list of FailureRecords
    
    Script->>Actions: collect_actions_failures(since, max_runs)
    Actions-->>Script: list of FailureRecords
    
    Script->>LogStore: fetch_and_spool_log() [parallel, 8 workers]
    LogStore-->>Script: DownloadedLog or error
    
    Script->>JSON: write prow_failures.json
    Script->>JSON: write actions_failures.json
    Script->>JSON: write prow_logs.json
    Script->>JSON: write actions_logs.json
    JSON-->>Script: RUN_DIR printed

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

The PR introduces substantial, multi-faceted additions with varying logic density: a large orchestration script with concurrent log fetching, a helper module with extensive CI API integration and dataclass definitions, and comprehensive test coverage. The changes span diverse concerns (CI platforms, file I/O, timestamp handling, async operations) and require independent reasoning across different components.

Possibly related PRs

docs: add comprehensive AGENTS.md #10147: Modifies AGENTS.md to register or update agent skill entries, directly related to skill registration.

Suggested labels

size/L, ok-to-test, lgtm, approved

Suggested reviewers

rleungx
JmPotato
bufferflies

Poem

🐰 Hop! Hop! Through logs we sift,
Failures sorted, gifts unwrapped,
Prow and Actions, GitHub bless,
CI flakiness—now redressed! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a new pd ci flaky triage skill to the agents directory.
Description check	✅ Passed	The description covers the problem statement, detailed explanation of changes, and includes appropriate checklist items. Required sections are present and adequately filled.
Linked Issues check	✅ Passed	The PR description includes 'ref `#10159`' linking to the related issue, establishing traceability for the feature work.
Out of Scope Changes check	✅ Passed	All changes are directly related to adding the pd-ci-flaky-triage skill: documentation, implementation scripts, tests, and AGENTS.md update. No unrelated modifications present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-03-10T09:50:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.88%. Comparing base (c1f3166) to head (47f87f2).
⚠️ Report is 38 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10328      +/-   ##
==========================================
+ Coverage   78.78%   78.88%   +0.10%     
==========================================
  Files         527      532       +5     
  Lines       70916    71862     +946     
==========================================
+ Hits        55870    56689     +819     
- Misses      11026    11137     +111     
- Partials     4020     4036      +16

Flag	Coverage Δ
unittests	`78.88% <ø> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rleungx · 2026-03-11T07:11:27Z

.agents/skills/pd-ci-flaky-triage/references/stack_snippet_examples.jsonl

@@ -0,0 +1,7 @@
+{"id":"good-assertion-store-list","quality":"good","failure_type":"assertion","target_kind":"test","target":"TestStoreTestSuite/TestStoresList","source_job":"https://github.com/tikv/pd/actions/runs/22709114896/job/65842590307","feature_tokens":["=== NAME  TestStoreTestSuite/TestStoresList","Error Trace:","Error:","Test:"],"recommended_line_budget":[5,20],"excerpt":"=== NAME  TestStoreTestSuite/TestStoresList\n    store_test.go:575: \n        Error Trace: /home/runner/work/pd/pd/tests/server/api/store_test.go:575\n                     /home/runner/work/pd/pd/tests/server/api/store_test.go:87\n                     /home/runner/work/pd/pd/tests/testutil.go:578\n                     /home/runner/work/pd/pd/tests/testutil.go:405\n                     /home/runner/work/pd/pd/tests/server/api/store_test.go:64\n        Error:      \"[0xc0086ad0d0 0xc0086ad0f0]\" should have 3 item(s), but has 2\n        Test:       TestStoreTestSuite/TestStoresList","notes":"Keep the full assertion block. Do not replace it with the later suite summary."}


Are these files necessary?

This is necessary; the agent needs some examples as references, otherwise there is a higher chance of generating an unexpected issue body.

okJiang · 2026-03-11T10:23:39Z

/retest

Signed-off-by: okjiang <819421878@qq.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py (1)

884-892: Consider using tempfile.gettempdir() for better cross-platform support.

The hardcoded /tmp/pd-ci-flaky path works on Unix but could be more portable. Since --log-spool-dir already allows overriding this, it's a minor concern.

♻️ Optional: Use tempfile module for default path

+import tempfile
+
 def resolve_log_spool_dir(base_dir: str) -> Path:
     run_id = f"{now_utc().strftime('%Y%m%dT%H%M%SZ')}-{os.getpid()}"
     if base_dir:
         root = Path(base_dir)
     else:
-        root = Path("/tmp/pd-ci-flaky")
+        root = Path(tempfile.gettempdir()) / "pd-ci-flaky"
     spool_dir = root / run_id
     spool_dir.mkdir(parents=True, exist_ok=True)
     return spool_dir

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py around lines
884 - 892, The default path in resolve_log_spool_dir is hardcoded to
"/tmp/pd-ci-flaky", which is Unix-specific; change it to use
tempfile.gettempdir() for cross-platform compatibility (e.g. derive root as
Path(tempfile.gettempdir()) / "pd-ci-flaky" when base_dir is empty). Update
resolve_log_spool_dir to import tempfile, compute root from
tempfile.gettempdir() instead of the literal "/tmp/pd-ci-flaky", keep the
existing run_id/spool_dir creation and mkdir logic unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py:
- Around line 1236-1243: The decide_flaky function currently accepts a
closed_issue parameter that is intentionally unused for decision logic; update
the code to explicitly indicate this to silence linters by either adding a short
docstring on decide_flaky stating "closed_issue used for routing, not decision"
or by adding a no-op reference like "_ = closed_issue" near the top of
decide_flaky; reference the decide_flaky signature and the test
test_decide_flaky_does_not_treat_closed_issue_as_sufficient_evidence to ensure
the behavior/intent is preserved.

---

Nitpick comments:
In @.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py:
- Around line 884-892: The default path in resolve_log_spool_dir is hardcoded to
"/tmp/pd-ci-flaky", which is Unix-specific; change it to use
tempfile.gettempdir() for cross-platform compatibility (e.g. derive root as
Path(tempfile.gettempdir()) / "pd-ci-flaky" when base_dir is empty). Update
resolve_log_spool_dir to import tempfile, compute root from
tempfile.gettempdir() instead of the literal "/tmp/pd-ci-flaky", keep the
existing run_id/spool_dir creation and mkdir logic unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ff25767-1a2b-4374-82bf-89b2119eb99c

📥 Commits

Reviewing files that changed from the base of the PR and between 2465640 and 3e65175.

📒 Files selected for processing (3)

.agents/skills/pd-ci-flaky-triage/SKILL.md
.agents/skills/pd-ci-flaky-triage/scripts/tests/test_triage_pd_ci_flaky.py
.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py

.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py

okJiang · 2026-03-15T20:01:21Z

/retest

Sync the repo-local pd-ci-flaky-triage skill with the current local version. Replace the older single-script and snippet-validator flow with the current staged workflow: prepare source-specific log artifacts first, let agents review raw logs and candidate flaky tests, and keep the final GitHub output narrowly focused on defended flaky issue actions. Update the checked-in scripts and tests to match the new collection rules, including master-only PR filtering, fixed-window support for prepare_logs, and removal of the old validator path. Refresh the AGENTS.md skill index entry so it describes the current skill behavior and prerequisites. Signed-off-by: okjiang <819421878@qq.com>

JmPotato · 2026-04-01T12:03:57Z

.agents/skills/pd-ci-flaky-triage/SKILL.md

+
+## Review File Contracts
+
+`/tmp/failure_items.json` is an agent-written artifact merged after step 3. The file must contain:


These handoff files are all fixed /tmp/*.json paths. That makes retries and concurrent runs overwrite each other. Since the log spool is already run-scoped, I think the JSON artifacts should also live under a per-run directory.

Changed in 883d7e6. The handoff JSON files are now run-scoped instead of fixed /tmp paths. Step 2 records a fresh RUN_DIR and every later artifact in the skill uses $RUN_DIR/*. prepare_logs.py also defaults its JSON outputs under that run directory, and test_prepare_logs.py now covers the default path resolution.

JmPotato · 2026-04-01T12:03:57Z

.agents/skills/pd-ci-flaky-triage/SKILL.md

+- `/tmp/prow_logs.json`: `Prow` failures with local `log_ref` paths that point to downloaded raw logs
+- `/tmp/actions_logs.json`: `GitHub Actions` failures with local `log_ref` paths that point to downloaded raw logs
+
+Intermediate artifact note:


Please add an explicit validation gate before step 3. prepare_logs.py can finish with skipped downloads or command_failed_after_retries, so the current flow may continue with partial data and still produce GitHub writes.

Changed in 883d7e6. I did not make this a fail-closed gate. The skill now treats skipped downloads and command_failed_after_retries as collection gaps that must be carried into the final summary for manual review. Step 2 explicitly says to keep running, and the final output template now reports log fetch failures and retry-exhausted commands so they are visible before anyone trusts the run completely.

JmPotato · 2026-04-01T12:03:58Z

.agents/skills/pd-ci-flaky-triage/SKILL.md

+
+3. Parse raw logs, extract failure items, and select GitHub-facing excerpts.
+
+You should delegate the work to two subagents (one for `Prow` and one for `GitHub Actions`). Each subagent should return structured results for its own source. Do not let either subagent write the merged output files directly.


This handoff is fragile, but the instruction only says return structured results. Please add a minimal JSON skeleton for the per-source subagent output; otherwise small field drift may only surface when the main agent merges both sources.

Changed in 883d7e6. Step 3 now defines explicit per-source handoff files, $RUN_DIR/prow_source_review.json and $RUN_DIR/actions_source_review.json, plus a minimal JSON skeleton with source, window, counts, failure_items, and env_filtered. The main agent only merges after those source-specific handoffs exist.

Signed-off-by: okjiang <819421878@qq.com>

ti-chi-bot · 2026-04-03T10:13:38Z

@okJiang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test-next-gen-3	`47f87f2`	link	true	`/test pull-unit-test-next-gen-3`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

agents: add pd ci flaky triage skill

2465640

Vendor the pd-ci-flaky-triage skill into .agents/skills, make its commands repo-local, and index it in AGENTS.md. Signed-off-by: okjiang <819421878@qq.com>

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 10, 2026

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 10, 2026

rleungx reviewed Mar 11, 2026

View reviewed changes

agents: tighten flaky triage reopen checks

3e65175

Signed-off-by: okjiang <819421878@qq.com>

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py Outdated Show resolved Hide resolved

JmPotato reviewed Apr 1, 2026

View reviewed changes

okJiang added 3 commits April 2, 2026 17:47

agents: harden flaky triage skill handoffs

883d7e6

Signed-off-by: okjiang <819421878@qq.com>

agents: drop flaky triage skill doc tests

bb05adc

Signed-off-by: okjiang <819421878@qq.com>

agents: filter pre-fix flaky evidence

47f87f2

Signed-off-by: okjiang <819421878@qq.com>

		@@ -0,0 +1,7 @@
		{"id":"good-assertion-store-list","quality":"good","failure_type":"assertion","target_kind":"test","target":"TestStoreTestSuite/TestStoresList","source_job":"https://github.com/tikv/pd/actions/runs/22709114896/job/65842590307","feature_tokens":["=== NAME TestStoreTestSuite/TestStoresList","Error Trace:","Error:","Test:"],"recommended_line_budget":[5,20],"excerpt":"=== NAME TestStoreTestSuite/TestStoresList\n store_test.go:575: \n Error Trace: /home/runner/work/pd/pd/tests/server/api/store_test.go:575\n /home/runner/work/pd/pd/tests/server/api/store_test.go:87\n /home/runner/work/pd/pd/tests/testutil.go:578\n /home/runner/work/pd/pd/tests/testutil.go:405\n /home/runner/work/pd/pd/tests/server/api/store_test.go:64\n Error: \"[0xc0086ad0d0 0xc0086ad0f0]\" should have 3 item(s), but has 2\n Test: TestStoreTestSuite/TestStoresList","notes":"Keep the full assertion block. Do not replace it with the later suite summary."}


		## Review File Contracts

		`/tmp/failure_items.json` is an agent-written artifact merged after step 3. The file must contain:


		3. Parse raw logs, extract failure items, and select GitHub-facing excerpts.

		You should delegate the work to two subagents (one for `Prow` and one for `GitHub Actions`). Each subagent should return structured results for its own source. Do not let either subagent write the merged output files directly.

Conversation

okJiang commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot bot commented Mar 10, 2026

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

okJiang commented Mar 11, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

okJiang commented Mar 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

okJiang commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

codecov bot commented Mar 10, 2026 •

edited

Loading