Conversation
Vendor the pd-ci-flaky-triage skill into .agents/skills, make its commands repo-local, and index it in AGENTS.md. Signed-off-by: okjiang <819421878@qq.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR introduces a new agent skill Changes
Sequence DiagramsequenceDiagram
participant Script as prepare_logs.py
participant Auth as gh auth
participant Prow as Prow API
participant Actions as GitHub Actions
participant LogStore as Log Storage
participant JSON as JSON Artifacts
Script->>Auth: ensure_gh_auth()
Auth-->>Script: ✓ authenticated or ✗ fail
Script->>Prow: collect_prow_failures(since, max_pages)
Prow-->>Script: list of FailureRecords
Script->>Actions: collect_actions_failures(since, max_runs)
Actions-->>Script: list of FailureRecords
Script->>LogStore: fetch_and_spool_log() [parallel, 8 workers]
LogStore-->>Script: DownloadedLog or error
Script->>JSON: write prow_failures.json
Script->>JSON: write actions_failures.json
Script->>JSON: write prow_logs.json
Script->>JSON: write actions_logs.json
JSON-->>Script: RUN_DIR printed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes The PR introduces substantial, multi-faceted additions with varying logic density: a large orchestration script with concurrent log fetching, a helper module with extensive CI API integration and dataclass definitions, and comprehensive test coverage. The changes span diverse concerns (CI platforms, file I/O, timestamp handling, async operations) and require independent reasoning across different components. Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10328 +/- ##
==========================================
+ Coverage 78.78% 78.88% +0.10%
==========================================
Files 527 532 +5
Lines 70916 71862 +946
==========================================
+ Hits 55870 56689 +819
- Misses 11026 11137 +111
- Partials 4020 4036 +16
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
| @@ -0,0 +1,7 @@ | |||
| {"id":"good-assertion-store-list","quality":"good","failure_type":"assertion","target_kind":"test","target":"TestStoreTestSuite/TestStoresList","source_job":"https://github.com/tikv/pd/actions/runs/22709114896/job/65842590307","feature_tokens":["=== NAME TestStoreTestSuite/TestStoresList","Error Trace:","Error:","Test:"],"recommended_line_budget":[5,20],"excerpt":"=== NAME TestStoreTestSuite/TestStoresList\n store_test.go:575: \n Error Trace: /home/runner/work/pd/pd/tests/server/api/store_test.go:575\n /home/runner/work/pd/pd/tests/server/api/store_test.go:87\n /home/runner/work/pd/pd/tests/testutil.go:578\n /home/runner/work/pd/pd/tests/testutil.go:405\n /home/runner/work/pd/pd/tests/server/api/store_test.go:64\n Error: \"[0xc0086ad0d0 0xc0086ad0f0]\" should have 3 item(s), but has 2\n Test: TestStoreTestSuite/TestStoresList","notes":"Keep the full assertion block. Do not replace it with the later suite summary."} | |||
There was a problem hiding this comment.
This is necessary; the agent needs some examples as references, otherwise there is a higher chance of generating an unexpected issue body.
|
/retest |
Signed-off-by: okjiang <819421878@qq.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py (1)
884-892: Consider usingtempfile.gettempdir()for better cross-platform support.The hardcoded
/tmp/pd-ci-flakypath works on Unix but could be more portable. Since--log-spool-diralready allows overriding this, it's a minor concern.♻️ Optional: Use tempfile module for default path
+import tempfile + def resolve_log_spool_dir(base_dir: str) -> Path: run_id = f"{now_utc().strftime('%Y%m%dT%H%M%SZ')}-{os.getpid()}" if base_dir: root = Path(base_dir) else: - root = Path("/tmp/pd-ci-flaky") + root = Path(tempfile.gettempdir()) / "pd-ci-flaky" spool_dir = root / run_id spool_dir.mkdir(parents=True, exist_ok=True) return spool_dir🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py around lines 884 - 892, The default path in resolve_log_spool_dir is hardcoded to "/tmp/pd-ci-flaky", which is Unix-specific; change it to use tempfile.gettempdir() for cross-platform compatibility (e.g. derive root as Path(tempfile.gettempdir()) / "pd-ci-flaky" when base_dir is empty). Update resolve_log_spool_dir to import tempfile, compute root from tempfile.gettempdir() instead of the literal "/tmp/pd-ci-flaky", keep the existing run_id/spool_dir creation and mkdir logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py:
- Around line 1236-1243: The decide_flaky function currently accepts a
closed_issue parameter that is intentionally unused for decision logic; update
the code to explicitly indicate this to silence linters by either adding a short
docstring on decide_flaky stating "closed_issue used for routing, not decision"
or by adding a no-op reference like "_ = closed_issue" near the top of
decide_flaky; reference the decide_flaky signature and the test
test_decide_flaky_does_not_treat_closed_issue_as_sufficient_evidence to ensure
the behavior/intent is preserved.
---
Nitpick comments:
In @.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py:
- Around line 884-892: The default path in resolve_log_spool_dir is hardcoded to
"/tmp/pd-ci-flaky", which is Unix-specific; change it to use
tempfile.gettempdir() for cross-platform compatibility (e.g. derive root as
Path(tempfile.gettempdir()) / "pd-ci-flaky" when base_dir is empty). Update
resolve_log_spool_dir to import tempfile, compute root from
tempfile.gettempdir() instead of the literal "/tmp/pd-ci-flaky", keep the
existing run_id/spool_dir creation and mkdir logic unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0ff25767-1a2b-4374-82bf-89b2119eb99c
📒 Files selected for processing (3)
.agents/skills/pd-ci-flaky-triage/SKILL.md.agents/skills/pd-ci-flaky-triage/scripts/tests/test_triage_pd_ci_flaky.py.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py
.agents/skills/pd-ci-flaky-triage/scripts/triage_pd_ci_flaky.py
Outdated
Show resolved
Hide resolved
|
/retest |
Sync the repo-local pd-ci-flaky-triage skill with the current local version. Replace the older single-script and snippet-validator flow with the current staged workflow: prepare source-specific log artifacts first, let agents review raw logs and candidate flaky tests, and keep the final GitHub output narrowly focused on defended flaky issue actions. Update the checked-in scripts and tests to match the new collection rules, including master-only PR filtering, fixed-window support for prepare_logs, and removal of the old validator path. Refresh the AGENTS.md skill index entry so it describes the current skill behavior and prerequisites. Signed-off-by: okjiang <819421878@qq.com>
|
|
||
| ## Review File Contracts | ||
|
|
||
| `/tmp/failure_items.json` is an agent-written artifact merged after step 3. The file must contain: |
There was a problem hiding this comment.
These handoff files are all fixed /tmp/*.json paths. That makes retries and concurrent runs overwrite each other. Since the log spool is already run-scoped, I think the JSON artifacts should also live under a per-run directory.
There was a problem hiding this comment.
Changed in 883d7e6. The handoff JSON files are now run-scoped instead of fixed /tmp paths. Step 2 records a fresh RUN_DIR and every later artifact in the skill uses $RUN_DIR/*. prepare_logs.py also defaults its JSON outputs under that run directory, and test_prepare_logs.py now covers the default path resolution.
| - `/tmp/prow_logs.json`: `Prow` failures with local `log_ref` paths that point to downloaded raw logs | ||
| - `/tmp/actions_logs.json`: `GitHub Actions` failures with local `log_ref` paths that point to downloaded raw logs | ||
|
|
||
| Intermediate artifact note: |
There was a problem hiding this comment.
Please add an explicit validation gate before step 3. prepare_logs.py can finish with skipped downloads or command_failed_after_retries, so the current flow may continue with partial data and still produce GitHub writes.
There was a problem hiding this comment.
Changed in 883d7e6. I did not make this a fail-closed gate. The skill now treats skipped downloads and command_failed_after_retries as collection gaps that must be carried into the final summary for manual review. Step 2 explicitly says to keep running, and the final output template now reports log fetch failures and retry-exhausted commands so they are visible before anyone trusts the run completely.
|
|
||
| 3. Parse raw logs, extract failure items, and select GitHub-facing excerpts. | ||
|
|
||
| You should delegate the work to two subagents (one for `Prow` and one for `GitHub Actions`). Each subagent should return structured results for its own source. Do not let either subagent write the merged output files directly. |
There was a problem hiding this comment.
This handoff is fragile, but the instruction only says return structured results. Please add a minimal JSON skeleton for the per-source subagent output; otherwise small field drift may only surface when the main agent merges both sources.
There was a problem hiding this comment.
Changed in 883d7e6. Step 3 now defines explicit per-source handoff files, $RUN_DIR/prow_source_review.json and $RUN_DIR/actions_source_review.json, plus a minimal JSON skeleton with source, window, counts, failure_items, and env_filtered. The main agent only merges after those source-specific handoffs exist.
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
|
@okJiang: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
PD needs a repository-maintained skill for triaging recent CI flaky failures from raw logs, reviewing likely flaky tests, and preparing the right GitHub issue actions.
Issue Number: ref #10159
What is changed and how does it work?
This PR updates the checked-in
pd-ci-flaky-triageskill under.agents/skills/to match the current local workflow.The refreshed skill now:
ProwandGitHub Actionsfailures with a dedicatedprepare_logs.pystepmaster--start-fromplus--daysIt also refreshes the
AGENTS.mdskill index entry so the repo description matches the checked-in skill behavior.Check List
Tests
Code changes
Side effects
Related changes
pingcap/docs/pingcap/docs-cn: N/Apingcap/tiup: N/ARelease note
Summary by CodeRabbit