cpu-o3: 2 entry size resolve queue for evaluation#670
cpu-o3: 2 entry size resolve queue for evaluation#670
Conversation
📝 WalkthroughWalkthroughPer-core configs add Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant IEW as IEW (producer)
participant Fetch as Fetch Stage
participant ResolveQ as ResolveQueue
participant DBPBTB as DBP/BTB (updater)
participant TAGE as TAGE Predictor
IEW->>Fetch: send resolved CFIs (list)
note right of Fetch `#E8F0FF`: For each CFI:\n- try merge by FSQ ID\n- if merged -> append PC\n- else if ResolveQ.size >= resolveQueueSize -> record enqueue-fail\n- else -> create & enqueue entry
Fetch->>ResolveQ: merge/enqueue (per-item)
ResolveQ-->>Fetch: ack / full
alt ResolveQ non-empty after enqueue
Fetch->>DBPBTB: prepare resolveUpdate for front entry
DBPBTB-->>Fetch: update result (success / failure)
alt success
Fetch->>ResolveQ: dequeue front entry
ResolveQ-->>Fetch: dequeued
else failure
Fetch->>Fetch: notify failure (do not dequeue)
end
end
%% Predictor dry-run interactions (separate flow)
Note over DecoupledBpred,TAGE: On squash or override-bubble paths
DecoupledBpred->>TAGE: dryRunCycle(s0PC)
TAGE-->>DecoupledBpred: returns (updates predBank tracking)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
configs/example/idealkmhv3.pyconfigs/example/kmhv3.pysrc/cpu/o3/fetch.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/cpu/o3/fetch.cc (1)
src/cpu/o3/fetch.hh (1)
fetchStats(1120-1120)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Quick Build, Unit Tests & Smoke Test
- GitHub Check: perf_test / XS-GEM5 - Run performance test (spec06-0.3c)
🔇 Additional comments (5)
configs/example/idealkmhv3.py (1)
32-32: LGTM: Resolve queue size configuration added.The addition of
cpu.resolveQueueSize = 2is correctly placed within the fetch configuration block and aligns with the PR objective of implementing a 2-entry resolve queue for evaluation.src/cpu/o3/fetch.cc (3)
1515-1541: Approve the per-item resolve queue logic, pending duplicate stat fix.The refactored resolve queue handling correctly implements per-item processing with:
- Merge behavior for matching FSQ IDs to avoid duplicates
- Full-queue overflow checks per item
- Separate tracking of enqueue counts for new entries only
The logic flow is sound, though the duplicate statistic increment on line 1532 must be fixed.
1547-1562: LGTM: Resolve-update/dequeue post-processing flow.The post-processing logic correctly:
- Guards against empty queue before accessing
front()- Prepares and marks CFIs as resolved via dbpbtb
- Dequeues on successful
resolveUpdate()with appropriate statistics- Notifies failure when update cannot be completed
The single-entry-per-tick processing is appropriate for pipelined operation.
1543-1545: LGTM: Statistics emission for enqueue and occupancy.The statistics correctly capture:
enqueueCount: Number of new entries added in this cycleresolveQueue.size(): Current queue occupancyThese metrics are emitted after processing all incoming CFIs, providing accurate evaluation data for the 2-entry resolve queue.
configs/example/kmhv3.py (1)
30-30: The fetchQueueSize reduction from 64 to 2 entries (32x reduction) is intentional for evaluation of a 2-entry resolve queue, as documented in the git history. This configuration applies system-wide to all CPUs and may impact fetch stage buffering, but this is expected for the evaluation scenario.
🚀 Coremark Smoke Test Results
✅ Difftest smoke test passed! |
d328815 to
ada8778
Compare
🚀 Coremark Smoke Test Results
✅ Difftest smoke test passed! |
|
🚀 Performance test triggered: spec06-0.8c |
ada8778 to
65606de
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/cpu/o3/fetch.cc (1)
1515-1527: LGTM with optional optimization note.The merging logic correctly consolidates resolved CFIs with matching FSQ IDs. The O(n·m) nested loop is acceptable given the 2-entry queue size mentioned in the PR.
Optional: Hash-based lookup for larger queues
If the queue size increases significantly in the future, consider using a hash map to avoid the O(n·m) nested loop:
// Build a map for O(1) lookups std::unordered_map<uint64_t, size_t> fsqIdToIndex; for (size_t i = 0; i < resolveQueue.size(); i++) { fsqIdToIndex[resolveQueue[i].resolvedFSQId] = i; } for (const auto &resolved : incoming) { auto it = fsqIdToIndex.find(resolved.fsqId); if (it != fsqIdToIndex.end()) { resolveQueue[it->second].resolvedInstPC.push_back(resolved.pc); continue; } // ... rest of enqueue logic }However, this adds complexity that isn't justified for a 2-entry queue.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
configs/example/idealkmhv3.pyconfigs/example/kmhv3.pysrc/cpu/o3/fetch.cc
🚧 Files skipped from review as they are similar to previous changes (2)
- configs/example/kmhv3.py
- configs/example/idealkmhv3.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/cpu/o3/fetch.cc (1)
src/cpu/o3/fetch.hh (1)
fetchStats(1120-1120)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Quick Build, Unit Tests & Smoke Test
- GitHub Check: perf_test / XS-GEM5 - Run performance test (spec06-0.3c)
🔇 Additional comments (3)
src/cpu/o3/fetch.cc (3)
1530-1534: Duplicate increment fixed - LGTM.The duplicate increment issue flagged in the previous review has been correctly addressed. Both statistics now increment exactly once when the queue is full and an enqueue attempt fails.
1536-1543: Verify enqueueCount semantics for merged entries.The
enqueueCountis incremented only when a new entry is created (line 1540), not when resolved CFIs are merged into existing entries (lines 1519-1520). Please confirm whether this is the intended behavior.
- If the statistic should track "new queue entries created," the current implementation is correct.
- If it should track "total resolved CFIs processed," then merged CFIs should also increment the count.
Alternative if merged entries should be counted
for (const auto &resolved : incoming) { bool merged = false; for (auto &queued : resolveQueue) { if (queued.resolvedFSQId == resolved.fsqId) { queued.resolvedInstPC.push_back(resolved.pc); merged = true; + enqueueCount++; // Count merged entries too break; } }
1547-1562: Front entry processing logic is correct.The front queue entry processing correctly implements retry-on-failure semantics: the entry remains in the queue when
resolveUpdatefails and is retried on the next tick. This provides natural backpressure when the BTB cannot accept updates.
🚀 Coremark Smoke Test Results
✅ Difftest smoke test passed! |
|
[Generated by GEM5 Performance Robot] Standard PerformanceOverall Score
|
Change-Id: Idbdcff6c75d797501ee831d143aa968eccaf6bdf
Change-Id: I4d3b8a805a64a04dfe55e97707b55ed481c1ed04
65606de to
f3a0a83
Compare
Change-Id: Ie1936843e32037b77c45b6134a175563b3e6833a
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/cpu/pred/btb/timed_base_pred.hh (1)
56-63: BasedryRunCyclehook is well-placed; minor style nit onlyAdding
virtual void dryRunCycle(Addr startAddr) {}in the base class cleanly supports optional dry-run behavior in derived predictors. If you want to avoid unused-parameter warnings in stricter builds, you could drop the parameter name or comment it (Addr /*startAddr*/), but that’s purely cosmetic.src/cpu/pred/btb/btb_tage.cc (1)
336-343: dryRunCycle correctly integrates with the BTBTAGE bank-conflict modelRecording
lastPredBankId/predBankValidindryRunCyclematches whatputPCHistorydoes and cleanly feedscanResolveUpdate’s bank-conflict check for cycles where the predictor doesn’t do a full lookup but still occupies a bank (e.g., override bubbles, squash handling). The logic looks self-consistent withenableBankConflictgating incanResolveUpdate.If you ever want to micro-optimize, you could early-return in
dryRunCyclewhen!enableBankConflict, but it’s not necessary.Also applies to: 359-367, 658-681
src/cpu/pred/btb/decoupled_bpred.cc (1)
132-140: dryRunCycle call sites align with intended “no-prediction” cyclesInvoking
tage->dryRunCycle(s0PC):
- On squash (after resetting BPU state), and
- While in
PREDICTION_OUTSTANDINGwith pendingnumOverrideBubblesnicely keeps BTBTAGE’s bank-conflict bookkeeping in step with the pipeline even when no new prediction is requested. This should make
canResolveUpdate’s conflict detection more realistic without changing visible behavior.You might optionally guard the calls with
if (tage->isEnabled())for symmetry with other TAGE usage, but functionally this is sound.Also applies to: 166-168
src/cpu/o3/fetch.cc (1)
254-337: Resolve-queue statistics wiring matches the new behaviorThe added stats:
resolveQueueFullEventsandresolveEnqueueFailEventare incremented only on failed enqueues, which matches their descriptions.resolveEnqueueCount.init(1, 8, 1)+sample(enqueueCount)turns this into a per-cycle “entries enqueued” distribution, whileresolveQueueOccupancy.init(0, 32, 1)tracks instantaneous queue depth.Functionally this is fine. If you intend
resolveEnqueueCountto be a simple cumulative counter rather than a distribution, you’d instead increment a scalar; otherwise, consider updating the description comment to clarify it’s a per-cycle histogram.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
configs/example/idealkmhv3.pyconfigs/example/kmhv3.pysrc/cpu/o3/fetch.ccsrc/cpu/pred/btb/btb_ittage.ccsrc/cpu/pred/btb/btb_ittage.hhsrc/cpu/pred/btb/btb_tage.ccsrc/cpu/pred/btb/btb_tage.hhsrc/cpu/pred/btb/decoupled_bpred.ccsrc/cpu/pred/btb/timed_base_pred.hh
🚧 Files skipped from review as they are similar to previous changes (1)
- configs/example/kmhv3.py
🧰 Additional context used
🧬 Code graph analysis (6)
src/cpu/pred/btb/timed_base_pred.hh (2)
src/cpu/pred/btb/btb_ittage.hh (2)
startAddr(96-96)startAddr(98-100)src/cpu/pred/btb/btb_tage.hh (2)
startAddr(121-121)startAddr(123-125)
src/cpu/pred/btb/btb_tage.hh (2)
src/cpu/pred/btb/btb_ittage.hh (2)
startAddr(96-96)startAddr(98-100)src/cpu/pred/btb/timed_base_pred.hh (3)
startAddr(58-58)startAddr(60-62)startAddr(60-62)
src/cpu/pred/btb/btb_ittage.cc (1)
src/cpu/pred/btb/btb_tage.cc (2)
dryRunCycle(336-344)dryRunCycle(337-337)
src/cpu/pred/btb/btb_ittage.hh (2)
src/cpu/pred/btb/btb_tage.hh (2)
startAddr(121-121)startAddr(123-125)src/cpu/pred/btb/timed_base_pred.hh (3)
startAddr(58-58)startAddr(60-62)startAddr(60-62)
src/cpu/pred/btb/btb_tage.cc (1)
src/cpu/pred/btb/btb_ittage.cc (2)
dryRunCycle(159-162)dryRunCycle(160-160)
src/cpu/pred/btb/decoupled_bpred.cc (1)
src/cpu/pred/btb/test/btb_tage.test.cc (1)
tage(276-282)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Quick Build, Unit Tests & Smoke Test
- GitHub Check: perf_test / XS-GEM5 - Run performance test (spec06-0.3c)
🔇 Additional comments (5)
configs/example/idealkmhv3.py (1)
31-31: Resolve queue size wiring for ideal KMHV3 looks consistentSetting
cpu.resolveQueueSize = 2alongsidefetchQueueSizekeeps this config aligned with the new resolve-queue behavior; no issues from this change alone.src/cpu/pred/btb/btb_tage.hh (1)
118-125: BTBTAGE dry-run hook declaration is consistent with the base API
dryRunCycle(Addr startAddr) overridecleanly extendsTimedBaseBTBPredictorand matches the implementation inbtb_tage.cc; interface-wise this is sound.src/cpu/pred/btb/btb_ittage.hh (1)
93-101: ITTAGE dry-run override keeps the BTB predictor API uniform
dryRunCycle(Addr startAddr) overrideforBTBITTAGEmatches the base and other predictors; with a no-op implementation this is harmless and keeps the interface consistent.src/cpu/pred/btb/btb_ittage.cc (1)
159-162: No-op ITTAGE dryRunCycle is fineA trivial
BTBITTAGE::dryRunCycle(Addr)implementation that just returns is appropriate here given ITTAGE doesn’t model bank conflicts; it satisfies the common interface without side effects.src/cpu/o3/fetch.cc (1)
1505-1563: Per-entry resolve queue handling looks correct and capacity-safeThe new
handleIEWSignals()flow:
- Merges incoming
resolvedCFIsbyfsqIdinto existingresolveQueueentries, so each FSQ has at most one queue entry with a vector of PCs.- When no merge is possible and
resolveQueue.size() >= resolveQueueSize, it cleanly drops the item while bumpingresolveQueueFullEventsandresolveEnqueueFailEvent.- Otherwise it creates a
ResolveQueueEntry, pushes it, and tracksenqueueCountfor stats.- After enqueuing, it processes only
resolveQueue.front()viaprepareResolveUpdateEntries,markCFIResolvedfor each PC, andresolveUpdate, then pops on success and counts dequeues.This preserves head-of-queue ordering, avoids overfilling the queue, and keeps stats consistent with the actual operations. Semantics look sound with the new
resolveQueueSizeparameter.
🚀 Coremark Smoke Test Results
✅ Difftest smoke test passed! |
|
[Generated by GEM5 Performance Robot] Align BTB PerformanceOverall Score
|


Change-Id: Ie1936843e32037b77c45b6134a175563b3e6833a
Summary by CodeRabbit
Refactor
Chores
✏️ Tip: You can customize this high-level summary in your review settings.