feat: review system v2 — shared skills, bugfixes, analysis#304
feat: review system v2 — shared skills, bugfixes, analysis#304
Conversation
Plugin Version StatusVersions are auto-bumped in PRs. Manual bumps to higher versions are preserved.
|
There was a problem hiding this comment.
⚠️ Good improvements — a few issues to address before merge
⚠️ Prompt template diverges from CI prompt — missing verdict criteria that the CI prompt includes (see inline thread)
⚠️ self-review SKILL.md has conflicting/arbitrary scoring rules (lines 64-65) that clash with CI prompt format requirements
✅ CDATA prohibition is now explicit and prominent in both prompts
✅ Verdict criteria restructured into clear decision tree
✅ Non-blocking feedback bar raised appropriately with justification requirement
✅ Shared skills architecture with correct symlinks
✅ Version bump 0.1.13 → 0.2.0 appropriate for new features
✅ Analysis doc and migration guide are comprehensive and accurate
✅ All referenced workflow/action files verified to exist
✅ PR description accurately matches code changes
🖱️ Click to expand for full details
self-review SKILL.md has conflicting/arbitrary scoring rules (lines 64-65) that clash with CI prompt format requirements✅ CDATA prohibition is now explicit and prominent in both prompts
✅ Verdict criteria restructured into clear decision tree
✅ Non-blocking feedback bar raised appropriately with justification requirement
✅ Shared skills architecture with correct symlinks
✅ Version bump 0.1.13 → 0.2.0 appropriate for new features
✅ Analysis doc and migration guide are comprehensive and accurate
✅ All referenced workflow/action files verified to exist
✅ PR description accurately matches code changes
Code Quality (90%)
The PR makes well-structured improvements across the review system. The bugfixes (CDATA, verdict overuse) address real problems observed in production reviews. The shared skills architecture is a meaningful DRY improvement — extracting the multi-agent review logic from a project-level skill into shared/skills/ and symlinking into the plugin.
Strengths:
- The verdict criteria restructuring (from flat CRITICAL lines to a structured decision tree) is significantly more readable and actionable
- The footnote section cleanup (separating "must include" from "must not include") removes a confusing numbered list where items 3-5 were exclusions mixed into an inclusion list
- The follow-up recommendation format now requires justification, which prevents lazy "non-blocking" feedback
Score rationale: 90% because the core changes are solid, but the template sync issue (see below) introduces a quality gap between the distributable template and the CI prompt.
Template Synchronization (Key Issue)
The CI prompt (.github/prompts/claude-code-review.md) now includes substantially more verdict guidance than the plugin template (prompt-template.md). Specifically, the template is missing:
- Style/convention violation criteria for REQUEST_CHANGES
- Maintainability/complexity criteria for REQUEST_CHANGES
- The "previous feedback adequately addressed" criterion for APPROVE
- The "bar for non-blocking feedback" paragraph
- The APPROVE "if code changes will break something, do NOT approve" CRITICAL line
The analysis document (§3.1, §3.5) acknowledges this divergence as known debt, but this PR actively widens the gap by adding content to the CI prompt that isn't reflected in the template. See the inline comment on prompt-template.md for a concrete suggestion.
Shared Skills Architecture
The self-review and parallel-review skills are well-structured. The symlinks are correct:
plugins/scm-utils/skills/self-review/SKILL.md→../../../../shared/skills/self-review/SKILL.md✅plugins/scm-utils/skills/parallel-review/SKILL.md→../../../../shared/skills/parallel-review/SKILL.md✅
The removed .claude/skills/code-review/SKILL.md content is fully captured in the new shared/skills/self-review/SKILL.md with improvements (verdict criteria, CDATA prohibition, inline comment conventions).
However, self-review/SKILL.md has two scoring rules (lines 64-65) that are problematic — see inline comments for details.
Documentation
Both new docs are comprehensive and accurate:
docs/review-system-analysis.md— thorough architecture overview with honest assessment of weaknessesdocs/review-system-v2-migration-guide.md— clear migration steps with before/after diagrams and rollback instructions
All component references in the analysis doc were verified against the actual filesystem.
Simplicity (85%)
The PR adds 899 lines across 11 files. The two documentation files (524 lines combined) are valuable as system documentation but add maintenance surface. The shared skills + symlink pattern is the right approach for DRY. Score reduced from 100% because the docs are comprehensive but lengthy for what is largely a prompt restructuring + architecture reorganization.
Confidence (85%)
High confidence in the prompt/template changes — the diffs are clear and the improvements are straightforward. Moderate confidence in how the shared skills will behave at runtime, since skills are interpreted by the agent and the scoring rules in self-review could produce unexpected interactions with the CI prompt's format requirements.
Recommended follow-ups (non-blocking — each item MUST explain why it shouldn't be addressed in this PR):
- Unify CI prompt and template from a single source — requires architectural work (§5.1 of analysis) beyond this PR's scope; the analysis doc already tracks this as high priority
- Add output validation for review markdown (strip CDATA, verify details/summary structure) — requires a new post-processing hook or action, tracked as §5.3 in analysis
Footnotes
|
|
||
| When all agents complete, review each report. Compare results across dimensions to build a complete picture. Create one overall report including: | ||
|
|
||
| - Score table with emoji indicators: |
There was a problem hiding this comment.
- Removing this rule and letting the synthesized score reflect the actual category scores
- Documenting the rationale (e.g., "a warning in any category means the PR has room for improvement, so the overall should reflect that")
As-is, this creates a magic number without justification, making it harder for agents to reason about scoring consistently.
| - Score table with emoji indicators: | |
| - If any category has ⚠️, the overall score should reflect that no category achieved ✅ level |
There was a problem hiding this comment.
Applied the suggested change — replaced the arbitrary 94% cap with "the overall score should reflect that no category achieved ✅ level" in f97c75f.
| When all agents complete, review each report. Compare results across dimensions to build a complete picture. Create one overall report including: | ||
|
|
||
| - Score table with emoji indicators: | ||
| - 🚨 Score below 70% |
There was a problem hiding this comment.
<details>/<summary> blocks, footnotes, and badges in every review. An agent following both the CI prompt and this skill would get contradictory instructions.
Consider aligning this with the CI prompt's mandatory format, or adding a qualifier like "when used outside CI":
| - 🚨 Score below 70% | |
| - If overall score is >95%, the detailed section can be minimal (just the table and a brief summary) |
There was a problem hiding this comment.
Applied — changed to "the detailed section can be minimal (just the table and a brief summary)" to avoid conflicting with CI format requirements. f97c75f
|
|
||
| ### 3.1 Review Logic Split (Improved in v2) | ||
|
|
||
| Previously, review instructions existed in three divergent places. In v2, the architecture was restructured: |
There was a problem hiding this comment.
✅ Good self-awareness here — this is the key architectural debt. The PR improves the situation (shared skills, DRY for review logic) but the CI prompt ↔ template divergence remains. The analysis correctly identifies this as a future improvement area rather than trying to solve it in this PR.
| - Security, performance, or correctness issues must be fixed before merging | ||
| - The code would improve meaningfully from a suggested change and the change is straightforward to make | ||
| - CRITICAL: If a change would make the code better and it's reasonable to do before merge, it's a requested change, not a suggestion | ||
|
|
||
| Use **"APPROVE"** when: | ||
| - The PR is ready to merge as-is with no outstanding issues | ||
|
|
||
| Use **"COMMENT"** only when ALL of these are true: | ||
| - The code won't break if merged as-is | ||
| - The suggestions are genuinely optional and there is a clear reason why each should NOT be addressed in this PR | ||
| - CRITICAL: If a suggestion would improve the code and is reasonable to implement, use REQUEST_CHANGES. Non-blocking feedback must have a justification. |
There was a problem hiding this comment.
.github/prompts/claude-code-review.md) has significantly more detail in its verdict section — including style/convention violations, maintainability criteria, and the full "bar for non-blocking feedback" paragraph. The template here is a stripped-down version.
This asymmetry means repos using the template will get a less rigorous review than this repo's CI prompt. If the template is the canonical distributable version, it should include the full verdict criteria. The extra ~10 lines from the CI prompt would make the template complete:
| - Security, performance, or correctness issues must be fixed before merging | |
| - The code would improve meaningfully from a suggested change and the change is straightforward to make | |
| - CRITICAL: If a change would make the code better and it's reasonable to do before merge, it's a requested change, not a suggestion | |
| Use **"APPROVE"** when: | |
| - The PR is ready to merge as-is with no outstanding issues | |
| Use **"COMMENT"** only when ALL of these are true: | |
| - The code won't break if merged as-is | |
| - The suggestions are genuinely optional and there is a clear reason why each should NOT be addressed in this PR | |
| - CRITICAL: If a suggestion would improve the code and is reasonable to implement, use REQUEST_CHANGES. Non-blocking feedback must have a justification. | |
| Use **"REQUEST_CHANGES"** when: | |
| - Security, performance, or correctness issues must be fixed before merging | |
| - Style or convention violations need correction before merging | |
| - Maintainability or complexity issues would degrade the codebase if merged | |
| - The code would improve meaningfully from a suggested change and the change is straightforward to make | |
| - CRITICAL: If a change would make the code better and it's reasonable to do before merge, it's a requested change, not a suggestion | |
| Use **"APPROVE"** when: | |
| - The PR is ready to merge as-is with no outstanding issues | |
| - Previous feedback has been adequately addressed (e.g., author clarified logic in PR body, commit messages, or code comments) | |
| - CRITICAL: If the code changes will break something after merge, do NOT approve | |
| Use **"COMMENT"** only when ALL of these are true: | |
| - The code won't break if merged as-is | |
| - The suggestions are genuinely optional and there is a clear reason why each should NOT be addressed in this PR | |
| - CRITICAL: If a suggestion would improve the code and is reasonable to implement, use REQUEST_CHANGES. Non-blocking feedback must have a justification. | |
| **The bar for non-blocking feedback:** | |
| The goal is to merge high-quality code. If your feedback would improve the code, and the improvement is within the scope of the PR and straightforward to implement, then it IS blocking and should use REQUEST_CHANGES. Non-blocking feedback should be reserved for items that genuinely warrant discussion about whether they should be addressed at all, or that would require work significantly beyond the PR's scope. |
There was a problem hiding this comment.
Synced the template with the full CI prompt verdict criteria in f97c75f. Now includes: style/convention violations, maintainability criteria for REQUEST_CHANGES; previous feedback and break-after-merge criteria for APPROVE; and the full "bar for non-blocking feedback" paragraph.
ac53a62 to
f97c75f
Compare
- Fix CDATA wrapper bug: restructured footnote instructions to make CDATA prohibition prominent and explicit (was buried in a confusing numbered list that mixed "include" and "exclude" items) - Fix non-blocking feedback overuse: raised the bar for COMMENT verdicts to require justification. If a suggestion would improve code and is reasonable to implement, it's now REQUEST_CHANGES, not COMMENT. Non-blocking follow-up items must explain why they shouldn't be addressed in the current PR. - Extract multi-agent review logic from project-level skill into shared/skills/self-review/ (symlinked into scm-utils plugin) - Add parallel-review shared skill for Agent Teams fan-out reviews (8 concurrent teammates, one per quality dimension) - Remove project-level .claude/skills/code-review/SKILL.md - Update scm-utils code-review skill to reference self-review for local fallback - Bump scm-utils to v0.2.0 (new skills, breaking: project skill removed) - docs/review-system-analysis.md: comprehensive analysis of review system architecture, strengths, weaknesses, and improvement suggestions - docs/review-system-v2-migration-guide.md: migration guide from v1 to v2 with step-by-step instructions and architecture diagrams https://claude.ai/code/session_01RB1y75VoeNNB7sucmusnmM
- Replace arbitrary 94% scoring cap with "should reflect that no category achieved ✅ level" (removes magic number) - Change >95% "just the table" instruction to "minimal (table + brief summary)" to avoid conflicting with CI prompt format requirements - Sync prompt-template.md verdict criteria with CI prompt: add style/convention, maintainability criteria for REQUEST_CHANGES; add previous feedback and break-after-merge criteria for APPROVE; add "bar for non-blocking feedback" paragraph https://claude.ai/code/session_01RB1y75VoeNNB7sucmusnmM
- Remove parallel-review skill (167 lines) — experimental, can't run in CI, violates YAGNI - Remove migration guide (246 lines) — over-engineered for an internal refactor - Trim analysis doc — remove parallel-review refs, replace 7-item roadmap with brief note pointing to GitHub issues - Fix scoring threshold inconsistency — align emoji boundaries (now 65/85) with badge color boundaries - Add concrete scoring ceiling: max 84% overall when any⚠️ exists - Add scoring calibration rubric for sub-agents (0-39/40-64/65-84/85-89/90-100) - De-duplicate verdict criteria from self-review — now references prompt template instead of maintaining a third copy - De-duplicate design principles — brief reference instead of full list - Update code-review entry point verdict table to v2 phrasing Net: -497 lines removed, +22 added https://claude.ai/code/session_01RB1y75VoeNNB7sucmusnmM
e5b94c5 to
1be3350
Compare
The 189-line review system analysis doc is valuable documentation but disproportionate to this PR's functional changes (bugfixes + skill extraction). Moving it out of scope to keep this PR focused. The analysis will be submitted as a separate documentation PR. Also fixes stale "three overlapping locations" text that was already edited in the working tree. https://claude.ai/code/session_01RB1y75VoeNNB7sucmusnmM
f775b5c to
571d839
Compare
Summary
<![CDATA[...]]>prohibition is prominent and explicitshared/skills/self-review/, symlinked into scm-utilsFiles Changed (8 files, +185/-68)
.github/prompts/claude-code-review.md— CDATA prohibition + verdict criteria restructuredplugins/scm-utils/.../prompt-template.md— verdict criteria synced from CI promptshared/skills/self-review/SKILL.md— NEW: shared multi-agent review procedure (107 lines)plugins/scm-utils/skills/self-review/— symlink to sharedplugins/scm-utils/skills/code-review/SKILL.md— verdict table updated to v2 phrasing.claude/skills/code-review/SKILL.md— REMOVED (replaced by shared skill)plugins/scm-utils/.claude-plugin/plugin.json— version bump to 0.2.0.claude/.gitignore— ignore ephemeral pr-review reportsTest plan
self-reviewskill invocation locallyhttps://claude.ai/code/session_01RB1y75VoeNNB7sucmusnmM