Skip to content

cpu-o3: 2 entry size resolve queue for evaluation#670

Closed
Yakkhini wants to merge 3 commits intoxs-devfrom
resolve-queue-align
Closed

cpu-o3: 2 entry size resolve queue for evaluation#670
Yakkhini wants to merge 3 commits intoxs-devfrom
resolve-queue-align

Conversation

@Yakkhini
Copy link
Copy Markdown
Collaborator

@Yakkhini Yakkhini commented Dec 23, 2025

Change-Id: Ie1936843e32037b77c45b6134a175563b3e6833a

Summary by CodeRabbit

  • Refactor

    • Improved resolve-queue processing with per-item merging, tighter overflow handling, and more accurate enqueue/dequeue accounting; added post-processing of the front resolve entry with conditional dequeue on success.
    • Introduced a lightweight "dry-run" prediction hook invoked in squash and outstanding-prediction paths to better track prediction state without committing changes.
  • Chores

    • Added per-core resolve-queue sizing option for finer pipeline tuning.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 23, 2025

📝 Walkthrough

Walkthrough

Per-core configs add cpu.resolveQueueSize = 2. Fetch-stage resolve handling rewritten to per-item merge/enqueue logic with queue-full gating, enqueue/dequeue/stat tracking, and front-entry resolve-update processing. BTB predictors gain a dryRunCycle API and are invoked from the decoupled predictor in specific paths.

Changes

Cohort / File(s) Summary
Configuration adjustments
configs/example/idealkmhv3.py, configs/example/kmhv3.py
Add cpu.resolveQueueSize = 2 to per-core parameter sets (inserted near fetch queue params).
Fetch-stage resolve queue logic
src/cpu/o3/fetch.cc
Replace bulk enqueue logic with per-item processing: merge by FSQ ID, per-item full-queue checks against resolveQueueSize, per-item enqueue/dequeue/stat updates; perform resolve-update for front entry and dequeue only on success.
BTB predictor API additions
src/cpu/pred/btb/btb_ittage.hh, src/cpu/pred/btb/btb_ittage.cc, src/cpu/pred/btb/btb_tage.hh, src/cpu/pred/btb/btb_tage.cc, src/cpu/pred/btb/timed_base_pred.hh
Add dryRunCycle(Addr) declaration to predictor base/header(s) and implement no-op / lightweight dry-run in ITTAGE/TAGE implementations.
Predictor integration
src/cpu/pred/btb/decoupled_bpred.cc
Invoke tage->dryRunCycle(s0PC) in squash and prediction-outstanding/override-bubble paths to trigger the new dry-run hook.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant IEW as IEW (producer)
  participant Fetch as Fetch Stage
  participant ResolveQ as ResolveQueue
  participant DBPBTB as DBP/BTB (updater)
  participant TAGE as TAGE Predictor

  IEW->>Fetch: send resolved CFIs (list)
  note right of Fetch `#E8F0FF`: For each CFI:\n- try merge by FSQ ID\n- if merged -> append PC\n- else if ResolveQ.size >= resolveQueueSize -> record enqueue-fail\n- else -> create & enqueue entry
  Fetch->>ResolveQ: merge/enqueue (per-item)
  ResolveQ-->>Fetch: ack / full

  alt ResolveQ non-empty after enqueue
    Fetch->>DBPBTB: prepare resolveUpdate for front entry
    DBPBTB-->>Fetch: update result (success / failure)
    alt success
      Fetch->>ResolveQ: dequeue front entry
      ResolveQ-->>Fetch: dequeued
    else failure
      Fetch->>Fetch: notify failure (do not dequeue)
    end
  end

  %% Predictor dry-run interactions (separate flow)
  Note over DecoupledBpred,TAGE: On squash or override-bubble paths
  DecoupledBpred->>TAGE: dryRunCycle(s0PC)
  TAGE-->>DecoupledBpred: returns (updates predBank tracking)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

align-kmhv3

Suggested reviewers

  • jensen-yan

Poem

"I hopped through queues with nimble paws,
Merged PCs and watched the counters pause.
Two cozy slots I gently placed,
Front-first resolves—neat and chaste. 🐇"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'cpu-o3: 2 entry size resolve queue for evaluation' directly and specifically describes the main change: setting the resolve queue to 2 entries across multiple configuration files and the fetch.cc implementation.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ef695e and d328815.

📒 Files selected for processing (3)
  • configs/example/idealkmhv3.py
  • configs/example/kmhv3.py
  • src/cpu/o3/fetch.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/cpu/o3/fetch.cc (1)
src/cpu/o3/fetch.hh (1)
  • fetchStats (1120-1120)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Quick Build, Unit Tests & Smoke Test
  • GitHub Check: perf_test / XS-GEM5 - Run performance test (spec06-0.3c)
🔇 Additional comments (5)
configs/example/idealkmhv3.py (1)

32-32: LGTM: Resolve queue size configuration added.

The addition of cpu.resolveQueueSize = 2 is correctly placed within the fetch configuration block and aligns with the PR objective of implementing a 2-entry resolve queue for evaluation.

src/cpu/o3/fetch.cc (3)

1515-1541: Approve the per-item resolve queue logic, pending duplicate stat fix.

The refactored resolve queue handling correctly implements per-item processing with:

  • Merge behavior for matching FSQ IDs to avoid duplicates
  • Full-queue overflow checks per item
  • Separate tracking of enqueue counts for new entries only

The logic flow is sound, though the duplicate statistic increment on line 1532 must be fixed.


1547-1562: LGTM: Resolve-update/dequeue post-processing flow.

The post-processing logic correctly:

  • Guards against empty queue before accessing front()
  • Prepares and marks CFIs as resolved via dbpbtb
  • Dequeues on successful resolveUpdate() with appropriate statistics
  • Notifies failure when update cannot be completed

The single-entry-per-tick processing is appropriate for pipelined operation.


1543-1545: LGTM: Statistics emission for enqueue and occupancy.

The statistics correctly capture:

  • enqueueCount: Number of new entries added in this cycle
  • resolveQueue.size(): Current queue occupancy

These metrics are emitted after processing all incoming CFIs, providing accurate evaluation data for the 2-entry resolve queue.

configs/example/kmhv3.py (1)

30-30: The fetchQueueSize reduction from 64 to 2 entries (32x reduction) is intentional for evaluation of a 2-entry resolve queue, as documented in the git history. This configuration applies system-wide to all CPUs and may impact fetch stage buffering, but this is expected for the evaluation scenario.

@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.0659 -
This PR 1.3785 📉 -0.6874 (-33.27%)

✅ Difftest smoke test passed!

@Yakkhini Yakkhini force-pushed the resolve-queue-align branch from d328815 to ada8778 Compare December 23, 2025 06:39
@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.0659 -
This PR 1.3785 📉 -0.6874 (-33.27%)

✅ Difftest smoke test passed!

@Yakkhini Yakkhini added the perf label Dec 23, 2025
@github-actions
Copy link
Copy Markdown

🚀 Performance test triggered: spec06-0.8c

@Yakkhini Yakkhini force-pushed the resolve-queue-align branch from ada8778 to 65606de Compare December 24, 2025 06:59
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/cpu/o3/fetch.cc (1)

1515-1527: LGTM with optional optimization note.

The merging logic correctly consolidates resolved CFIs with matching FSQ IDs. The O(n·m) nested loop is acceptable given the 2-entry queue size mentioned in the PR.

Optional: Hash-based lookup for larger queues

If the queue size increases significantly in the future, consider using a hash map to avoid the O(n·m) nested loop:

// Build a map for O(1) lookups
std::unordered_map<uint64_t, size_t> fsqIdToIndex;
for (size_t i = 0; i < resolveQueue.size(); i++) {
    fsqIdToIndex[resolveQueue[i].resolvedFSQId] = i;
}

for (const auto &resolved : incoming) {
    auto it = fsqIdToIndex.find(resolved.fsqId);
    if (it != fsqIdToIndex.end()) {
        resolveQueue[it->second].resolvedInstPC.push_back(resolved.pc);
        continue;
    }
    // ... rest of enqueue logic
}

However, this adds complexity that isn't justified for a 2-entry queue.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ada8778 and 65606de.

📒 Files selected for processing (3)
  • configs/example/idealkmhv3.py
  • configs/example/kmhv3.py
  • src/cpu/o3/fetch.cc
🚧 Files skipped from review as they are similar to previous changes (2)
  • configs/example/kmhv3.py
  • configs/example/idealkmhv3.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/cpu/o3/fetch.cc (1)
src/cpu/o3/fetch.hh (1)
  • fetchStats (1120-1120)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Quick Build, Unit Tests & Smoke Test
  • GitHub Check: perf_test / XS-GEM5 - Run performance test (spec06-0.3c)
🔇 Additional comments (3)
src/cpu/o3/fetch.cc (3)

1530-1534: Duplicate increment fixed - LGTM.

The duplicate increment issue flagged in the previous review has been correctly addressed. Both statistics now increment exactly once when the queue is full and an enqueue attempt fails.


1536-1543: Verify enqueueCount semantics for merged entries.

The enqueueCount is incremented only when a new entry is created (line 1540), not when resolved CFIs are merged into existing entries (lines 1519-1520). Please confirm whether this is the intended behavior.

  • If the statistic should track "new queue entries created," the current implementation is correct.
  • If it should track "total resolved CFIs processed," then merged CFIs should also increment the count.
Alternative if merged entries should be counted
 for (const auto &resolved : incoming) {
     bool merged = false;
     for (auto &queued : resolveQueue) {
         if (queued.resolvedFSQId == resolved.fsqId) {
             queued.resolvedInstPC.push_back(resolved.pc);
             merged = true;
+            enqueueCount++;  // Count merged entries too
             break;
         }
     }

1547-1562: Front entry processing logic is correct.

The front queue entry processing correctly implements retry-on-failure semantics: the entry remains in the queue when resolveUpdate fails and is retried on the next tick. This provides natural backpressure when the BTB cannot accept updates.

@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.0659 -
This PR 2.0397 📉 -0.0262 (-1.27%)

✅ Difftest smoke test passed!

@XiangShanRobot
Copy link
Copy Markdown

[Generated by GEM5 Performance Robot]
commit: 65606de
workflow: gem5 Align BTB Performance Test(0.3c)

Standard Performance

Overall Score

PR Master Diff(%)
Score 16.85 16.87 -0.16 🔴

@Yakkhini
Copy link
Copy Markdown
Collaborator Author

image

Change-Id: Idbdcff6c75d797501ee831d143aa968eccaf6bdf
Change-Id: I4d3b8a805a64a04dfe55e97707b55ed481c1ed04
@Yakkhini Yakkhini force-pushed the resolve-queue-align branch from 65606de to f3a0a83 Compare December 31, 2025 03:00
Change-Id: Ie1936843e32037b77c45b6134a175563b3e6833a
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/cpu/pred/btb/timed_base_pred.hh (1)

56-63: Base dryRunCycle hook is well-placed; minor style nit only

Adding virtual void dryRunCycle(Addr startAddr) {} in the base class cleanly supports optional dry-run behavior in derived predictors. If you want to avoid unused-parameter warnings in stricter builds, you could drop the parameter name or comment it (Addr /*startAddr*/), but that’s purely cosmetic.

src/cpu/pred/btb/btb_tage.cc (1)

336-343: dryRunCycle correctly integrates with the BTBTAGE bank-conflict model

Recording lastPredBankId/predBankValid in dryRunCycle matches what putPCHistory does and cleanly feeds canResolveUpdate’s bank-conflict check for cycles where the predictor doesn’t do a full lookup but still occupies a bank (e.g., override bubbles, squash handling). The logic looks self-consistent with enableBankConflict gating in canResolveUpdate.

If you ever want to micro-optimize, you could early-return in dryRunCycle when !enableBankConflict, but it’s not necessary.

Also applies to: 359-367, 658-681

src/cpu/pred/btb/decoupled_bpred.cc (1)

132-140: dryRunCycle call sites align with intended “no-prediction” cycles

Invoking tage->dryRunCycle(s0PC):

  • On squash (after resetting BPU state), and
  • While in PREDICTION_OUTSTANDING with pending numOverrideBubbles

nicely keeps BTBTAGE’s bank-conflict bookkeeping in step with the pipeline even when no new prediction is requested. This should make canResolveUpdate’s conflict detection more realistic without changing visible behavior.

You might optionally guard the calls with if (tage->isEnabled()) for symmetry with other TAGE usage, but functionally this is sound.

Also applies to: 166-168

src/cpu/o3/fetch.cc (1)

254-337: Resolve-queue statistics wiring matches the new behavior

The added stats:

  • resolveQueueFullEvents and resolveEnqueueFailEvent are incremented only on failed enqueues, which matches their descriptions.
  • resolveEnqueueCount.init(1, 8, 1) + sample(enqueueCount) turns this into a per-cycle “entries enqueued” distribution, while resolveQueueOccupancy.init(0, 32, 1) tracks instantaneous queue depth.

Functionally this is fine. If you intend resolveEnqueueCount to be a simple cumulative counter rather than a distribution, you’d instead increment a scalar; otherwise, consider updating the description comment to clarify it’s a per-cycle histogram.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65606de and f3a0a83.

📒 Files selected for processing (9)
  • configs/example/idealkmhv3.py
  • configs/example/kmhv3.py
  • src/cpu/o3/fetch.cc
  • src/cpu/pred/btb/btb_ittage.cc
  • src/cpu/pred/btb/btb_ittage.hh
  • src/cpu/pred/btb/btb_tage.cc
  • src/cpu/pred/btb/btb_tage.hh
  • src/cpu/pred/btb/decoupled_bpred.cc
  • src/cpu/pred/btb/timed_base_pred.hh
🚧 Files skipped from review as they are similar to previous changes (1)
  • configs/example/kmhv3.py
🧰 Additional context used
🧬 Code graph analysis (6)
src/cpu/pred/btb/timed_base_pred.hh (2)
src/cpu/pred/btb/btb_ittage.hh (2)
  • startAddr (96-96)
  • startAddr (98-100)
src/cpu/pred/btb/btb_tage.hh (2)
  • startAddr (121-121)
  • startAddr (123-125)
src/cpu/pred/btb/btb_tage.hh (2)
src/cpu/pred/btb/btb_ittage.hh (2)
  • startAddr (96-96)
  • startAddr (98-100)
src/cpu/pred/btb/timed_base_pred.hh (3)
  • startAddr (58-58)
  • startAddr (60-62)
  • startAddr (60-62)
src/cpu/pred/btb/btb_ittage.cc (1)
src/cpu/pred/btb/btb_tage.cc (2)
  • dryRunCycle (336-344)
  • dryRunCycle (337-337)
src/cpu/pred/btb/btb_ittage.hh (2)
src/cpu/pred/btb/btb_tage.hh (2)
  • startAddr (121-121)
  • startAddr (123-125)
src/cpu/pred/btb/timed_base_pred.hh (3)
  • startAddr (58-58)
  • startAddr (60-62)
  • startAddr (60-62)
src/cpu/pred/btb/btb_tage.cc (1)
src/cpu/pred/btb/btb_ittage.cc (2)
  • dryRunCycle (159-162)
  • dryRunCycle (160-160)
src/cpu/pred/btb/decoupled_bpred.cc (1)
src/cpu/pred/btb/test/btb_tage.test.cc (1)
  • tage (276-282)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Quick Build, Unit Tests & Smoke Test
  • GitHub Check: perf_test / XS-GEM5 - Run performance test (spec06-0.3c)
🔇 Additional comments (5)
configs/example/idealkmhv3.py (1)

31-31: Resolve queue size wiring for ideal KMHV3 looks consistent

Setting cpu.resolveQueueSize = 2 alongside fetchQueueSize keeps this config aligned with the new resolve-queue behavior; no issues from this change alone.

src/cpu/pred/btb/btb_tage.hh (1)

118-125: BTBTAGE dry-run hook declaration is consistent with the base API

dryRunCycle(Addr startAddr) override cleanly extends TimedBaseBTBPredictor and matches the implementation in btb_tage.cc; interface-wise this is sound.

src/cpu/pred/btb/btb_ittage.hh (1)

93-101: ITTAGE dry-run override keeps the BTB predictor API uniform

dryRunCycle(Addr startAddr) override for BTBITTAGE matches the base and other predictors; with a no-op implementation this is harmless and keeps the interface consistent.

src/cpu/pred/btb/btb_ittage.cc (1)

159-162: No-op ITTAGE dryRunCycle is fine

A trivial BTBITTAGE::dryRunCycle(Addr) implementation that just returns is appropriate here given ITTAGE doesn’t model bank conflicts; it satisfies the common interface without side effects.

src/cpu/o3/fetch.cc (1)

1505-1563: Per-entry resolve queue handling looks correct and capacity-safe

The new handleIEWSignals() flow:

  • Merges incoming resolvedCFIs by fsqId into existing resolveQueue entries, so each FSQ has at most one queue entry with a vector of PCs.
  • When no merge is possible and resolveQueue.size() >= resolveQueueSize, it cleanly drops the item while bumping resolveQueueFullEvents and resolveEnqueueFailEvent.
  • Otherwise it creates a ResolveQueueEntry, pushes it, and tracks enqueueCount for stats.
  • After enqueuing, it processes only resolveQueue.front() via prepareResolveUpdateEntries, markCFIResolved for each PC, and resolveUpdate, then pops on success and counts dequeues.

This preserves head-of-queue ordering, avoids overfilling the queue, and keeps stats consistent with the actual operations. Semantics look sound with the new resolveQueueSize parameter.

@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1772 -
This PR 2.1039 📉 -0.0733 (-3.37%)

✅ Difftest smoke test passed!

@XiangShanRobot
Copy link
Copy Markdown

[Generated by GEM5 Performance Robot]
commit: f3a0a83
workflow: gem5 Align BTB Performance Test(0.3c)

Align BTB Performance

Overall Score

PR Master Diff(%)
Score 17.44 17.53 -0.50 🔴

@Yakkhini
Copy link
Copy Markdown
Collaborator Author

Yakkhini commented Jan 5, 2026

image

@Yakkhini Yakkhini closed this Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants