cpu-o3: 2 entry size resolve queue for evaluation by Yakkhini · Pull Request #670 · OpenXiangShan/GEM5

Yakkhini · 2025-12-23T06:28:15Z

Change-Id: Ie1936843e32037b77c45b6134a175563b3e6833a

Summary by CodeRabbit

Refactor
- Improved resolve-queue processing with per-item merging, tighter overflow handling, and more accurate enqueue/dequeue accounting; added post-processing of the front resolve entry with conditional dequeue on success.
- Introduced a lightweight "dry-run" prediction hook invoked in squash and outstanding-prediction paths to better track prediction state without committing changes.
Chores
- Added per-core resolve-queue sizing option for finer pipeline tuning.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-23T06:28:31Z

📝 Walkthrough

Walkthrough

Per-core configs add cpu.resolveQueueSize = 2. Fetch-stage resolve handling rewritten to per-item merge/enqueue logic with queue-full gating, enqueue/dequeue/stat tracking, and front-entry resolve-update processing. BTB predictors gain a dryRunCycle API and are invoked from the decoupled predictor in specific paths.

Changes

Cohort / File(s)	Summary
Configuration adjustments `configs/example/idealkmhv3.py`, `configs/example/kmhv3.py`	Add `cpu.resolveQueueSize = 2` to per-core parameter sets (inserted near fetch queue params).
Fetch-stage resolve queue logic `src/cpu/o3/fetch.cc`	Replace bulk enqueue logic with per-item processing: merge by FSQ ID, per-item full-queue checks against `resolveQueueSize`, per-item enqueue/dequeue/stat updates; perform resolve-update for front entry and dequeue only on success.
BTB predictor API additions `src/cpu/pred/btb/btb_ittage.hh`, `src/cpu/pred/btb/btb_ittage.cc`, `src/cpu/pred/btb/btb_tage.hh`, `src/cpu/pred/btb/btb_tage.cc`, `src/cpu/pred/btb/timed_base_pred.hh`	Add `dryRunCycle(Addr)` declaration to predictor base/header(s) and implement no-op / lightweight dry-run in ITTAGE/TAGE implementations.
Predictor integration `src/cpu/pred/btb/decoupled_bpred.cc`	Invoke `tage->dryRunCycle(s0PC)` in squash and prediction-outstanding/override-bubble paths to trigger the new dry-run hook.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant IEW as IEW (producer)
  participant Fetch as Fetch Stage
  participant ResolveQ as ResolveQueue
  participant DBPBTB as DBP/BTB (updater)
  participant TAGE as TAGE Predictor

  IEW->>Fetch: send resolved CFIs (list)
  note right of Fetch `#E8F0FF`: For each CFI:\n- try merge by FSQ ID\n- if merged -> append PC\n- else if ResolveQ.size >= resolveQueueSize -> record enqueue-fail\n- else -> create & enqueue entry
  Fetch->>ResolveQ: merge/enqueue (per-item)
  ResolveQ-->>Fetch: ack / full

  alt ResolveQ non-empty after enqueue
    Fetch->>DBPBTB: prepare resolveUpdate for front entry
    DBPBTB-->>Fetch: update result (success / failure)
    alt success
      Fetch->>ResolveQ: dequeue front entry
      ResolveQ-->>Fetch: dequeued
    else failure
      Fetch->>Fetch: notify failure (do not dequeue)
    end
  end

  %% Predictor dry-run interactions (separate flow)
  Note over DecoupledBpred,TAGE: On squash or override-bubble paths
  DecoupledBpred->>TAGE: dryRunCycle(s0PC)
  TAGE-->>DecoupledBpred: returns (updates predBank tracking)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Resolve Queue Alignment in Kunminghu-v3 #626 — related modifications to branch resolution queue sizing and enqueue/dequeue behavior.
cpu-o3: move resolve queue merging from iew to fetch stage #635 — related resolve-queue merging and fetch-stage handling using resolveQueueSize.
cpu-o3: add more performance counter in resolve queue #666 — related resolve-queue enqueue/merge logic and stats updates in src/cpu/o3/fetch.*.

Suggested labels

align-kmhv3

Suggested reviewers

jensen-yan

Poem

"I hopped through queues with nimble paws,
Merged PCs and watched the counters pause.
Two cozy slots I gently placed,
Front-first resolves—neat and chaste. 🐇"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'cpu-o3: 2 entry size resolve queue for evaluation' directly and specifically describes the main change: setting the resolve queue to 2 entries across multiple configuration files and the fetch.cc implementation.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ef695e and d328815.

📒 Files selected for processing (3)

configs/example/idealkmhv3.py
configs/example/kmhv3.py
src/cpu/o3/fetch.cc

🧰 Additional context used

🧬 Code graph analysis (1)

src/cpu/o3/fetch.cc (1)

src/cpu/o3/fetch.hh (1)

fetchStats (1120-1120)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Quick Build, Unit Tests & Smoke Test
GitHub Check: perf_test / XS-GEM5 - Run performance test (spec06-0.3c)

🔇 Additional comments (5)

configs/example/idealkmhv3.py (1)

32-32: LGTM: Resolve queue size configuration added.

The addition of cpu.resolveQueueSize = 2 is correctly placed within the fetch configuration block and aligns with the PR objective of implementing a 2-entry resolve queue for evaluation.

src/cpu/o3/fetch.cc (3)

1515-1541: Approve the per-item resolve queue logic, pending duplicate stat fix.

The refactored resolve queue handling correctly implements per-item processing with:

Merge behavior for matching FSQ IDs to avoid duplicates

Full-queue overflow checks per item

Separate tracking of enqueue counts for new entries only

The logic flow is sound, though the duplicate statistic increment on line 1532 must be fixed.

1547-1562: LGTM: Resolve-update/dequeue post-processing flow.

The post-processing logic correctly:

Guards against empty queue before accessing front()

Prepares and marks CFIs as resolved via dbpbtb

Dequeues on successful resolveUpdate() with appropriate statistics

Notifies failure when update cannot be completed

The single-entry-per-tick processing is appropriate for pipelined operation.

1543-1545: LGTM: Statistics emission for enqueue and occupancy.

The statistics correctly capture:

enqueueCount: Number of new entries added in this cycle

resolveQueue.size(): Current queue occupancy

These metrics are emitted after processing all incoming CFIs, providing accurate evaluation data for the 2-entry resolve queue.

configs/example/kmhv3.py (1)

30-30: The fetchQueueSize reduction from 64 to 2 entries (32x reduction) is intentional for evaluation of a 2-entry resolve queue, as documented in the git history. This configuration applies system-wide to all CPUs and may impact fetch stage buffering, but this is expected for the evaluation scenario.

src/cpu/o3/fetch.cc

github-actions · 2025-12-23T06:37:46Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.0659`	-
This PR	`1.3785`	📉 `-0.6874` (`-33.27%`)

✅ Difftest smoke test passed!

github-actions · 2025-12-23T06:45:59Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.0659`	-
This PR	`1.3785`	📉 `-0.6874` (`-33.27%`)

✅ Difftest smoke test passed!

github-actions · 2025-12-23T06:56:06Z

🚀 Performance test triggered: spec06-0.8c

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/cpu/o3/fetch.cc (1)
1515-1527: LGTM with optional optimization note.

The merging logic correctly consolidates resolved CFIs with matching FSQ IDs. The O(n·m) nested loop is acceptable given the 2-entry queue size mentioned in the PR.
Optional: Hash-based lookup for larger queues

If the queue size increases significantly in the future, consider using a hash map to avoid the O(n·m) nested loop:
// Build a map for O(1) lookups
std::unordered_map<uint64_t, size_t> fsqIdToIndex;
for (size_t i = 0; i < resolveQueue.size(); i++) {
    fsqIdToIndex[resolveQueue[i].resolvedFSQId] = i;
}

for (const auto &resolved : incoming) {
    auto it = fsqIdToIndex.find(resolved.fsqId);
    if (it != fsqIdToIndex.end()) {
        resolveQueue[it->second].resolvedInstPC.push_back(resolved.pc);
        continue;
    }
    // ... rest of enqueue logic
}
However, this adds complexity that isn't justified for a 2-entry queue.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ada8778 and 65606de.

📒 Files selected for processing (3)

configs/example/idealkmhv3.py
configs/example/kmhv3.py
src/cpu/o3/fetch.cc

🚧 Files skipped from review as they are similar to previous changes (2)

configs/example/kmhv3.py
configs/example/idealkmhv3.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/cpu/o3/fetch.cc (1)

src/cpu/o3/fetch.hh (1)

fetchStats (1120-1120)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Quick Build, Unit Tests & Smoke Test
GitHub Check: perf_test / XS-GEM5 - Run performance test (spec06-0.3c)

🔇 Additional comments (3)

src/cpu/o3/fetch.cc (3)
1530-1534: Duplicate increment fixed - LGTM.

The duplicate increment issue flagged in the previous review has been correctly addressed. Both statistics now increment exactly once when the queue is full and an enqueue attempt fails.

1536-1543: Verify enqueueCount semantics for merged entries.

The enqueueCount is incremented only when a new entry is created (line 1540), not when resolved CFIs are merged into existing entries (lines 1519-1520). Please confirm whether this is the intended behavior.

If the statistic should track "new queue entries created," the current implementation is correct.

If it should track "total resolved CFIs processed," then merged CFIs should also increment the count.
Alternative if merged entries should be counted
 for (const auto &resolved : incoming) {
     bool merged = false;
     for (auto &queued : resolveQueue) {
         if (queued.resolvedFSQId == resolved.fsqId) {
             queued.resolvedInstPC.push_back(resolved.pc);
             merged = true;
+            enqueueCount++;  // Count merged entries too
             break;
         }
     }
1547-1562: Front entry processing logic is correct.

The front queue entry processing correctly implements retry-on-failure semantics: the entry remains in the queue when resolveUpdate fails and is retried on the next tick. This provides natural backpressure when the BTB cannot accept updates.

github-actions · 2025-12-24T07:06:12Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.0659`	-
This PR	`2.0397`	📉 `-0.0262` (`-1.27%`)

✅ Difftest smoke test passed!

XiangShanRobot · 2025-12-25T02:32:04Z

[Generated by GEM5 Performance Robot]
commit: 65606de
workflow: gem5 Align BTB Performance Test(0.3c)

Standard Performance

Overall Score

	PR	Master	Diff(%)
Score	16.85	16.87	-0.16 🔴

Yakkhini · 2025-12-30T08:11:27Z

Change-Id: Idbdcff6c75d797501ee831d143aa968eccaf6bdf

Change-Id: I4d3b8a805a64a04dfe55e97707b55ed481c1ed04

Change-Id: Ie1936843e32037b77c45b6134a175563b3e6833a

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

src/cpu/pred/btb/timed_base_pred.hh (1)

56-63: Base dryRunCycle hook is well-placed; minor style nit only

Adding virtual void dryRunCycle(Addr startAddr) {} in the base class cleanly supports optional dry-run behavior in derived predictors. If you want to avoid unused-parameter warnings in stricter builds, you could drop the parameter name or comment it (Addr /*startAddr*/), but that’s purely cosmetic.

src/cpu/pred/btb/btb_tage.cc (1)

336-343: dryRunCycle correctly integrates with the BTBTAGE bank-conflict model

Recording lastPredBankId/predBankValid in dryRunCycle matches what putPCHistory does and cleanly feeds canResolveUpdate’s bank-conflict check for cycles where the predictor doesn’t do a full lookup but still occupies a bank (e.g., override bubbles, squash handling). The logic looks self-consistent with enableBankConflict gating in canResolveUpdate.

If you ever want to micro-optimize, you could early-return in dryRunCycle when !enableBankConflict, but it’s not necessary.

Also applies to: 359-367, 658-681

src/cpu/pred/btb/decoupled_bpred.cc (1)

132-140: dryRunCycle call sites align with intended “no-prediction” cycles

Invoking tage->dryRunCycle(s0PC):

On squash (after resetting BPU state), and

While in PREDICTION_OUTSTANDING with pending numOverrideBubbles

nicely keeps BTBTAGE’s bank-conflict bookkeeping in step with the pipeline even when no new prediction is requested. This should make canResolveUpdate’s conflict detection more realistic without changing visible behavior.

You might optionally guard the calls with if (tage->isEnabled()) for symmetry with other TAGE usage, but functionally this is sound.

Also applies to: 166-168

src/cpu/o3/fetch.cc (1)

254-337: Resolve-queue statistics wiring matches the new behavior

The added stats:

resolveQueueFullEvents and resolveEnqueueFailEvent are incremented only on failed enqueues, which matches their descriptions.

resolveEnqueueCount.init(1, 8, 1) + sample(enqueueCount) turns this into a per-cycle “entries enqueued” distribution, while resolveQueueOccupancy.init(0, 32, 1) tracks instantaneous queue depth.

Functionally this is fine. If you intend resolveEnqueueCount to be a simple cumulative counter rather than a distribution, you’d instead increment a scalar; otherwise, consider updating the description comment to clarify it’s a per-cycle histogram.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65606de and f3a0a83.

📒 Files selected for processing (9)

configs/example/idealkmhv3.py
configs/example/kmhv3.py
src/cpu/o3/fetch.cc
src/cpu/pred/btb/btb_ittage.cc
src/cpu/pred/btb/btb_ittage.hh
src/cpu/pred/btb/btb_tage.cc
src/cpu/pred/btb/btb_tage.hh
src/cpu/pred/btb/decoupled_bpred.cc
src/cpu/pred/btb/timed_base_pred.hh

🚧 Files skipped from review as they are similar to previous changes (1)

configs/example/kmhv3.py

🧰 Additional context used

🧬 Code graph analysis (6)

src/cpu/pred/btb/timed_base_pred.hh (2)

src/cpu/pred/btb/btb_ittage.hh (2)

startAddr (96-96)

startAddr (98-100)

src/cpu/pred/btb/btb_tage.hh (2)

startAddr (121-121)

startAddr (123-125)

src/cpu/pred/btb/btb_tage.hh (2)

src/cpu/pred/btb/btb_ittage.hh (2)

startAddr (96-96)

startAddr (98-100)

src/cpu/pred/btb/timed_base_pred.hh (3)

startAddr (58-58)

startAddr (60-62)

startAddr (60-62)

src/cpu/pred/btb/btb_ittage.cc (1)

src/cpu/pred/btb/btb_tage.cc (2)

dryRunCycle (336-344)

dryRunCycle (337-337)

src/cpu/pred/btb/btb_ittage.hh (2)

src/cpu/pred/btb/btb_tage.hh (2)

startAddr (121-121)

startAddr (123-125)

src/cpu/pred/btb/timed_base_pred.hh (3)

startAddr (58-58)

startAddr (60-62)

startAddr (60-62)

src/cpu/pred/btb/btb_tage.cc (1)

src/cpu/pred/btb/btb_ittage.cc (2)

dryRunCycle (159-162)

dryRunCycle (160-160)

src/cpu/pred/btb/decoupled_bpred.cc (1)

src/cpu/pred/btb/test/btb_tage.test.cc (1)

tage (276-282)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Quick Build, Unit Tests & Smoke Test
GitHub Check: perf_test / XS-GEM5 - Run performance test (spec06-0.3c)

🔇 Additional comments (5)

configs/example/idealkmhv3.py (1)

31-31: Resolve queue size wiring for ideal KMHV3 looks consistent

Setting cpu.resolveQueueSize = 2 alongside fetchQueueSize keeps this config aligned with the new resolve-queue behavior; no issues from this change alone.

src/cpu/pred/btb/btb_tage.hh (1)

118-125: BTBTAGE dry-run hook declaration is consistent with the base API

dryRunCycle(Addr startAddr) override cleanly extends TimedBaseBTBPredictor and matches the implementation in btb_tage.cc; interface-wise this is sound.

src/cpu/pred/btb/btb_ittage.hh (1)

93-101: ITTAGE dry-run override keeps the BTB predictor API uniform

dryRunCycle(Addr startAddr) override for BTBITTAGE matches the base and other predictors; with a no-op implementation this is harmless and keeps the interface consistent.

src/cpu/pred/btb/btb_ittage.cc (1)

159-162: No-op ITTAGE dryRunCycle is fine

A trivial BTBITTAGE::dryRunCycle(Addr) implementation that just returns is appropriate here given ITTAGE doesn’t model bank conflicts; it satisfies the common interface without side effects.

src/cpu/o3/fetch.cc (1)

1505-1563: Per-entry resolve queue handling looks correct and capacity-safe

The new handleIEWSignals() flow:

Merges incoming resolvedCFIs by fsqId into existing resolveQueue entries, so each FSQ has at most one queue entry with a vector of PCs.

When no merge is possible and resolveQueue.size() >= resolveQueueSize, it cleanly drops the item while bumping resolveQueueFullEvents and resolveEnqueueFailEvent.

Otherwise it creates a ResolveQueueEntry, pushes it, and tracks enqueueCount for stats.

After enqueuing, it processes only resolveQueue.front() via prepareResolveUpdateEntries, markCFIResolved for each PC, and resolveUpdate, then pops on success and counts dequeues.

This preserves head-of-queue ordering, avoids overfilling the queue, and keeps stats consistent with the actual operations. Semantics look sound with the new resolveQueueSize parameter.

github-actions · 2025-12-31T03:08:12Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.1772`	-
This PR	`2.1039`	📉 `-0.0733` (`-3.37%`)

✅ Difftest smoke test passed!

XiangShanRobot · 2025-12-31T03:54:49Z

[Generated by GEM5 Performance Robot]
commit: f3a0a83
workflow: gem5 Align BTB Performance Test(0.3c)

Align BTB Performance

Overall Score

	PR	Master	Diff(%)
Score	17.44	17.53	-0.50 🔴

Yakkhini · 2026-01-05T09:11:49Z

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

src/cpu/o3/fetch.cc Show resolved Hide resolved

Yakkhini force-pushed the resolve-queue-align branch from d328815 to ada8778 Compare December 23, 2025 06:39

Yakkhini added the perf label Dec 23, 2025

Yakkhini force-pushed the resolve-queue-align branch from ada8778 to 65606de Compare December 24, 2025 06:59

coderabbitai bot reviewed Dec 24, 2025

View reviewed changes

jensen-yan added the do not merge label Dec 25, 2025

Yakkhini added 2 commits December 31, 2025 10:58

cpu-o3: mark tage bank conflict when squashing

9acfd8d

Change-Id: Idbdcff6c75d797501ee831d143aa968eccaf6bdf

cpu-o3: mark tage bank conflict when s3 override

5c2f446

Change-Id: I4d3b8a805a64a04dfe55e97707b55ed481c1ed04

Yakkhini force-pushed the resolve-queue-align branch from 65606de to f3a0a83 Compare December 31, 2025 03:00

cpu-o3: 2 entry size resolve queue for evaluation

f3a0a83

Change-Id: Ie1936843e32037b77c45b6134a175563b3e6833a

coderabbitai bot reviewed Dec 31, 2025

View reviewed changes

Yakkhini closed this Jan 14, 2026

Conversation

Yakkhini commented Dec 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Dec 23, 2025

🚀 Coremark Smoke Test Results

Uh oh!

github-actions bot commented Dec 23, 2025

🚀 Coremark Smoke Test Results

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 24, 2025

🚀 Coremark Smoke Test Results

Uh oh!

XiangShanRobot commented Dec 25, 2025

Standard Performance

Overall Score

Uh oh!

Yakkhini commented Dec 30, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 31, 2025

🚀 Coremark Smoke Test Results

Uh oh!

XiangShanRobot commented Dec 31, 2025

Align BTB Performance

Overall Score

Uh oh!

Yakkhini commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yakkhini commented Dec 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 23, 2025 •

edited

Loading