Skip to content

agents: add pd ci flaky triage skill#10328

Open
okJiang wants to merge 6 commits intotikv:masterfrom
okJiang:codex/add-pd-ci-flaky-triage-skill
Open

agents: add pd ci flaky triage skill#10328
okJiang wants to merge 6 commits intotikv:masterfrom
okJiang:codex/add-pd-ci-flaky-triage-skill

Conversation

@okJiang
Copy link
Copy Markdown
Member

@okJiang okJiang commented Mar 10, 2026

What problem does this PR solve?

PD needs a repository-maintained skill for triaging recent CI flaky failures from raw logs, reviewing likely flaky tests, and preparing the right GitHub issue actions.

Issue Number: ref #10159

What is changed and how does it work?

This PR updates the checked-in pd-ci-flaky-triage skill under .agents/skills/ to match the current local workflow.

The refreshed skill now:

  • collects recent Prow and GitHub Actions failures with a dedicated prepare_logs.py step
  • keeps source-specific raw-log collection separate before merging reviewed failure items
  • lets agents review raw logs, candidate flaky tests, and issue matches before any GitHub write
  • filters collection to failures that target master
  • supports fixed test windows with --start-from plus --days
  • removes the old snippet-validator path that no longer belongs to the current workflow

It also refreshes the AGENTS.md skill index entry so the repo description matches the checked-in skill behavior.

Check List

Tests

  • Unit test
  • Manual test

Code changes

  • Has the configuration change: No
  • Has HTTP APIs changed (Don't forget to add the declarative for the new API): No
  • Has persistent data change: No

Side effects

  • Possible performance regression: No
  • Increased code complexity: No
  • Breaking backward compatibility: No

Related changes

Release note

None.

Summary by CodeRabbit

  • New Features
    • Added a new automated workflow for triaging recent CI test failures from Prow and GitHub Actions, categorizing them by failure type, and creating or updating issues for identified flaky tests. The workflow includes structured failure analysis with excerpt extraction and environment filtering to improve flaky test visibility and resolution.

Vendor the pd-ci-flaky-triage skill into .agents/skills, make its commands repo-local, and index it in AGENTS.md.

Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 10, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cabinfeverb for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 10, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces a new agent skill pd-ci-flaky-triage for automating the triage of recent PD CI failures across Prow and GitHub Actions sources. It includes orchestration scripts to collect failures and download logs, helper modules for CI integration, extraction guidelines for failure snippets, and comprehensive test coverage for core functionality.

Changes

Cohort / File(s) Summary
Skill Definition
.agents/skills/pd-ci-flaky-triage/SKILL.md, AGENTS.md
Documents the end-to-end workflow stages, control-flow constraints (auth verification, environment filtering, GitHub action justification), artifact contracts for failure_items.json, env_filtered.json, and flaky_tests.json, and registers the skill in the agent registry.
Orchestration & Helpers
.agents/skills/pd-ci-flaky-triage/scripts/prepare_logs.py, .agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py
Main pipeline script that resolves time windows, collects failures from both Prow and Actions sources, fetches and spools raw logs concurrently, and writes source-scoped JSON artifacts; legacy helper module providing dataclasses and functions for CI failure collection, log fetching, and authentication.
Reference Materials
.agents/skills/pd-ci-flaky-triage/references/stack_snippet_guidelines.md, .agents/skills/pd-ci-flaky-triage/references/stack_snippet_examples.jsonl
Guidelines for manual failure snippet extraction, failure family classification, noise filtering, and excerpt validation; sample failure entries covering assertion, timeout, goleak, panic, deadlock, and data\_race categories.
Test Coverage
.agents/skills/pd-ci-flaky-triage/scripts/tests/test_triage_pd_ci_flaky.py, .agents/skills/pd-ci-flaky-triage/scripts/tests/test_prepare_logs.py
Unit tests exercising URL conversion, Prow/Actions failure collection logic, log fetching, and window/timestamp resolution.

Sequence Diagram

sequenceDiagram
    participant Script as prepare_logs.py
    participant Auth as gh auth
    participant Prow as Prow API
    participant Actions as GitHub Actions
    participant LogStore as Log Storage
    participant JSON as JSON Artifacts

    Script->>Auth: ensure_gh_auth()
    Auth-->>Script: ✓ authenticated or ✗ fail
    
    Script->>Prow: collect_prow_failures(since, max_pages)
    Prow-->>Script: list of FailureRecords
    
    Script->>Actions: collect_actions_failures(since, max_runs)
    Actions-->>Script: list of FailureRecords
    
    Script->>LogStore: fetch_and_spool_log() [parallel, 8 workers]
    LogStore-->>Script: DownloadedLog or error
    
    Script->>JSON: write prow_failures.json
    Script->>JSON: write actions_failures.json
    Script->>JSON: write prow_logs.json
    Script->>JSON: write actions_logs.json
    JSON-->>Script: RUN_DIR printed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

The PR introduces substantial, multi-faceted additions with varying logic density: a large orchestration script with concurrent log fetching, a helper module with extensive CI API integration and dataclass definitions, and comprehensive test coverage. The changes span diverse concerns (CI platforms, file I/O, timestamp handling, async operations) and require independent reasoning across different components.

Possibly related PRs

Suggested labels

size/L, ok-to-test, lgtm, approved

Suggested reviewers

  • rleungx
  • JmPotato
  • bufferflies

Poem

🐰 Hop! Hop! Through logs we sift,
Failures sorted, gifts unwrapped,
Prow and Actions, GitHub bless,
CI flakiness—now redressed! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a new pd ci flaky triage skill to the agents directory.
Description check ✅ Passed The description covers the problem statement, detailed explanation of changes, and includes appropriate checklist items. Required sections are present and adequately filled.
Linked Issues check ✅ Passed The PR description includes 'ref #10159' linking to the related issue, establishing traceability for the feature work.
Out of Scope Changes check ✅ Passed All changes are directly related to adding the pd-ci-flaky-triage skill: documentation, implementation scripts, tests, and AGENTS.md update. No unrelated modifications present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.88%. Comparing base (c1f3166) to head (47f87f2).
⚠️ Report is 38 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10328      +/-   ##
==========================================
+ Coverage   78.78%   78.88%   +0.10%     
==========================================
  Files         527      532       +5     
  Lines       70916    71862     +946     
==========================================
+ Hits        55870    56689     +819     
- Misses      11026    11137     +111     
- Partials     4020     4036      +16     
Flag Coverage Δ
unittests 78.88% <ø> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -0,0 +1,7 @@
{"id":"good-assertion-store-list","quality":"good","failure_type":"assertion","target_kind":"test","target":"TestStoreTestSuite/TestStoresList","source_job":"https://github.com/tikv/pd/actions/runs/22709114896/job/65842590307","feature_tokens":["=== NAME TestStoreTestSuite/TestStoresList","Error Trace:","Error:","Test:"],"recommended_line_budget":[5,20],"excerpt":"=== NAME TestStoreTestSuite/TestStoresList\n store_test.go:575: \n Error Trace: /home/runner/work/pd/pd/tests/server/api/store_test.go:575\n /home/runner/work/pd/pd/tests/server/api/store_test.go:87\n /home/runner/work/pd/pd/tests/testutil.go:578\n /home/runner/work/pd/pd/tests/testutil.go:405\n /home/runner/work/pd/pd/tests/server/api/store_test.go:64\n Error: \"[0xc0086ad0d0 0xc0086ad0f0]\" should have 3 item(s), but has 2\n Test: TestStoreTestSuite/TestStoresList","notes":"Keep the full assertion block. Do not replace it with the later suite summary."}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these files necessary?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary; the agent needs some examples as references, otherwise there is a higher chance of generating an unexpected issue body.

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Mar 11, 2026

/retest

Signed-off-by: okjiang <819421878@qq.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py (1)

884-892: Consider using tempfile.gettempdir() for better cross-platform support.

The hardcoded /tmp/pd-ci-flaky path works on Unix but could be more portable. Since --log-spool-dir already allows overriding this, it's a minor concern.

♻️ Optional: Use tempfile module for default path
+import tempfile
+
 def resolve_log_spool_dir(base_dir: str) -> Path:
     run_id = f"{now_utc().strftime('%Y%m%dT%H%M%SZ')}-{os.getpid()}"
     if base_dir:
         root = Path(base_dir)
     else:
-        root = Path("/tmp/pd-ci-flaky")
+        root = Path(tempfile.gettempdir()) / "pd-ci-flaky"
     spool_dir = root / run_id
     spool_dir.mkdir(parents=True, exist_ok=True)
     return spool_dir
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py around lines
884 - 892, The default path in resolve_log_spool_dir is hardcoded to
"/tmp/pd-ci-flaky", which is Unix-specific; change it to use
tempfile.gettempdir() for cross-platform compatibility (e.g. derive root as
Path(tempfile.gettempdir()) / "pd-ci-flaky" when base_dir is empty). Update
resolve_log_spool_dir to import tempfile, compute root from
tempfile.gettempdir() instead of the literal "/tmp/pd-ci-flaky", keep the
existing run_id/spool_dir creation and mkdir logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py:
- Around line 1236-1243: The decide_flaky function currently accepts a
closed_issue parameter that is intentionally unused for decision logic; update
the code to explicitly indicate this to silence linters by either adding a short
docstring on decide_flaky stating "closed_issue used for routing, not decision"
or by adding a no-op reference like "_ = closed_issue" near the top of
decide_flaky; reference the decide_flaky signature and the test
test_decide_flaky_does_not_treat_closed_issue_as_sufficient_evidence to ensure
the behavior/intent is preserved.

---

Nitpick comments:
In @.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py:
- Around line 884-892: The default path in resolve_log_spool_dir is hardcoded to
"/tmp/pd-ci-flaky", which is Unix-specific; change it to use
tempfile.gettempdir() for cross-platform compatibility (e.g. derive root as
Path(tempfile.gettempdir()) / "pd-ci-flaky" when base_dir is empty). Update
resolve_log_spool_dir to import tempfile, compute root from
tempfile.gettempdir() instead of the literal "/tmp/pd-ci-flaky", keep the
existing run_id/spool_dir creation and mkdir logic unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ff25767-1a2b-4374-82bf-89b2119eb99c

📥 Commits

Reviewing files that changed from the base of the PR and between 2465640 and 3e65175.

📒 Files selected for processing (3)
  • .agents/skills/pd-ci-flaky-triage/SKILL.md
  • .agents/skills/pd-ci-flaky-triage/scripts/tests/test_triage_pd_ci_flaky.py
  • .agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Mar 15, 2026

/retest

Sync the repo-local pd-ci-flaky-triage skill with the current local version.

Replace the older single-script and snippet-validator flow with the current
staged workflow: prepare source-specific log artifacts first, let agents review
raw logs and candidate flaky tests, and keep the final GitHub output narrowly
focused on defended flaky issue actions.

Update the checked-in scripts and tests to match the new collection rules,
including master-only PR filtering, fixed-window support for prepare_logs, and
removal of the old validator path. Refresh the AGENTS.md skill index entry so
it describes the current skill behavior and prerequisites.

Signed-off-by: okjiang <819421878@qq.com>

## Review File Contracts

`/tmp/failure_items.json` is an agent-written artifact merged after step 3. The file must contain:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These handoff files are all fixed /tmp/*.json paths. That makes retries and concurrent runs overwrite each other. Since the log spool is already run-scoped, I think the JSON artifacts should also live under a per-run directory.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 883d7e6. The handoff JSON files are now run-scoped instead of fixed /tmp paths. Step 2 records a fresh RUN_DIR and every later artifact in the skill uses $RUN_DIR/*. prepare_logs.py also defaults its JSON outputs under that run directory, and test_prepare_logs.py now covers the default path resolution.

- `/tmp/prow_logs.json`: `Prow` failures with local `log_ref` paths that point to downloaded raw logs
- `/tmp/actions_logs.json`: `GitHub Actions` failures with local `log_ref` paths that point to downloaded raw logs

Intermediate artifact note:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an explicit validation gate before step 3. prepare_logs.py can finish with skipped downloads or command_failed_after_retries, so the current flow may continue with partial data and still produce GitHub writes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 883d7e6. I did not make this a fail-closed gate. The skill now treats skipped downloads and command_failed_after_retries as collection gaps that must be carried into the final summary for manual review. Step 2 explicitly says to keep running, and the final output template now reports log fetch failures and retry-exhausted commands so they are visible before anyone trusts the run completely.


3. Parse raw logs, extract failure items, and select GitHub-facing excerpts.

You should delegate the work to two subagents (one for `Prow` and one for `GitHub Actions`). Each subagent should return structured results for its own source. Do not let either subagent write the merged output files directly.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handoff is fragile, but the instruction only says return structured results. Please add a minimal JSON skeleton for the per-source subagent output; otherwise small field drift may only surface when the main agent merges both sources.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 883d7e6. Step 3 now defines explicit per-source handoff files, $RUN_DIR/prow_source_review.json and $RUN_DIR/actions_source_review.json, plus a minimal JSON skeleton with source, window, counts, failure_items, and env_filtered. The main agent only merges after those source-specific handoffs exist.

okJiang added 3 commits April 2, 2026 17:47
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 3, 2026

@okJiang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-3 47f87f2 link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants