Skip to content

Kmh smt#745

Merged
tastynoob merged 3 commits intoxs-devfrom
kmh-smt
Mar 12, 2026
Merged

Kmh smt#745
tastynoob merged 3 commits intoxs-devfrom
kmh-smt

Conversation

@tastynoob
Copy link
Copy Markdown
Collaborator

@tastynoob tastynoob commented Jan 29, 2026

Do not merge, this pr cant build success

Summary by CodeRabbit

  • Refactor

    • Fetch and branch-prediction are now thread-aware with per-thread state and pipelines for improved multi-threaded handling.
    • Fetch-target queues reworked to manage per-thread queues with round‑robin selection for fairer fetch scheduling.
    • Commit, tracing, and prediction paths updated to use thread-local target access, improving per-thread statistics, traces, and fetch accuracy.
  • New Features

    • Added SMT scheduling framework offering multiple thread-selection policies (instruction-count, delayed/round‑robin, multi-priority).

@tastynoob tastynoob requested a review from jensen-yan January 29, 2026 09:44
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Refactors fetch and BTB prediction to be thread-aware: introduces per-thread FetchBuffer and FetchTargetQueue, migrates FTQ/BTB and predictor APIs to accept ThreadID, replaces global fetch/head state with per-thread equivalents, and updates callsites to use per-thread commit/resolve semantics.

Changes

Cohort / File(s) Summary
O3 fetch core
src/cpu/o3/fetch.cc, src/cpu/o3/fetch.hh, src/cpu/o3/trace/TraceFetch.cc
Replace global/per-core fetch state with per-thread FetchBuffer threads[tid] (fetchpc, data, valid, startPC, cacheReq); remove pc[]/selectFetchThread; adapt fetch loops, multi-cacheline logic, and commit/update callsites to use per-thread APIs and members.
Decoupled BTB predictor
src/cpu/pred/btb/decoupled_bpred.hh, src/cpu/pred/btb/decoupled_bpred.cc, src/cpu/pred/btb/decoupled_bpred_stats.cc
Migrate predictor internals to per-thread state: add per-thread storage, add ThreadID parameters to many APIs, replace global FTQ head usage with per-thread FTQ accessors, rename update()commit() (with tid), and update squash/resolve/prepare flows to be thread-aware.
Fetch Target Queue (FTQ)
src/cpu/pred/btb/ftq.hh, src/cpu/pred/btb/ftq.cc
Add FetchTargetQueue managing per-thread deques, per-thread ftqSize/baseTargetId/fetchptr, round-robin getTargetTid(), and implementations for insert, finishTarget, commitTarget, and squashAfter operating per-thread.
BTB structures & headers
src/cpu/pred/btb/common.hh, src/cpu/pred/btb/ftq.hh
Add ThreadID tid to FullBTBPrediction; expose per-thread FTQ accessors (front/back/fetch/get/hasTarget/size/empty/full) in FTQ header.
Build integration
src/cpu/pred/SConscript
Add btb/ftq.cc to prediction build sources.
Stats / tracing updates
src/cpu/pred/btb/decoupled_bpred_stats.cc
Switch target lookups to ftq.get(ftqId, threadNumber) for per-thread target retrieval in commit and trace paths.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Fetch as O3 Fetch
    participant FTQ as FetchTargetQueue
    participant Cache as CacheSubsystem
    participant BPU as DecoupledBPUWithBTB
    participant BTB as BTB Predictor

    Fetch->>FTQ: ftqHasFetching(tid) / ftqFetchingTarget(tid)
    FTQ-->>Fetch: return FetchTarget (per-thread)
    Fetch->>Cache: prepare/issue threads[tid].cacheReq
    Cache-->>Fetch: deliver data -> threads[tid].data / valid
    Fetch->>BPU: hand off FetchTarget / threads[tid].fetchpc (tid)
    BPU->>BTB: requestNewPrediction/processNewPrediction (tid)
    BTB-->>BPU: per-thread prediction (FullBTBPrediction.tid)
    BPU->>FTQ: commitTarget / finishTarget / squashAfter (tid)
    BPU->>BTB: commit(pred_id, tid)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • Yakkhini
  • jensen-yan

Poem

🐇 I hopped thread by thread through queues and queues,
I tucked PCs in pockets and buffered little news.
Targets found their tid, commits fit like glue,
Fetches hum in burrows, each rabbit knows what to do.
A tiny hop—per-thread rhythms, fresh and new.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Kmh smt' is vague and non-descriptive, using unclear abbreviations that do not convey meaningful information about the changeset to a team member scanning history. Replace with a descriptive title such as 'Add SMT scheduler implementations (ICount, DelayedICount, MultiPriority)' or 'Refactor fetch to per-thread state with SMT schedulers'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kmh-smt

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/cpu/pred/btb/common.hh (1)

452-484: Initialize the new tid member in the default constructor.

The newly added tid field is not initialized in the default constructor, which could lead to undefined behavior if read before being explicitly set. Other members like bbStart are initialized to 0.

🛡️ Proposed fix
     FullBTBPrediction() :
+        tid(0),
         bbStart(0),
         btbEntries(),
src/cpu/o3/fetch.cc (1)

632-638: Pass the tid parameter to per-thread FTQ accessors.

The methods ftqHasFetching() and ftqFetchingTarget() in decoupled_bpred.hh require a ThreadID tid parameter (lines 404-406), but these calls omit it. The variable tid is available in all these contexts, so update the calls to pass it:

dbpbtb->ftqHasFetching(tid)
dbpbtb->ftqFetchingTarget(tid)

This same pattern appears in multiple locations: line 632-633, 785-786, 1760, and 2054 in fetch.cc, as well as line 971 in fetch.hh.

🤖 Fix all issues with AI agents
In `@src/cpu/pred/btb/decoupled_bpred.cc`:
- Line 381: Remove the stray extra semicolon after the assignment to s0PC:
change the line using threads[tid].finalPred.getTarget(predictWidth);; to a
single semicolon so it becomes threads[tid].finalPred.getTarget(predictWidth); —
ensure this occurs where s0PC is assigned (references: s0PC, threads, finalPred,
getTarget, predictWidth) and rebuild to confirm no warnings.
- Line 757: The entry's thread id is incorrectly hardcoded to 0; change the
assignment from setting entry.tid = 0 to using the function parameter (entry.tid
= tid) so the entry records the correct thread; this ensures subsequent calls
that use entry.tid (fillAheadPipeline and updateHistoryForPrediction) access the
correct per-thread state rather than always thread 0.

In `@src/cpu/pred/btb/decoupled_bpred.hh`:
- Line 150: scheduleThread() currently always returns 0 which forces
single-threaded behavior; update the scheduleThread implementation (declared in
decoupled_bpred.hh) to compute and return the correct ThreadID based on the
branch predictor's thread scheduling state (e.g., using the class's
last-scheduled thread counter, a thread queue, or the CPU/ThreadContext APIs
already used elsewhere in decoupled_bpred.cc TODO), making it multi-thread aware
and thread-safe rather than hard-coding 0.

In `@src/cpu/pred/btb/ftq.cc`:
- Around line 40-45: Add a guard/assertion in FetchTargetQueue::commitTarget to
ensure you never call queue[tid].cap.pop_front() on an empty deque: check that
queue[tid].cap is not empty (e.g., assert(!queue[tid].cap.empty()) or an if with
an error/log) before calling pop_front() and only increment
queue[tid].baseTargetId when the pop actually happens; reference the function
FetchTargetQueue::commitTarget and the members queue[tid].cap and
queue[tid].baseTargetId when adding this check.

In `@src/cpu/pred/btb/ftq.hh`:
- Line 46: The backId(ThreadID tid) method can underflow when
queue[tid].cap.size() is 0 because it computes baseTargetId + 0 - 1; update
backId in src/cpu/pred/btb/ftq.hh to guard against empty queues by checking
queue[tid].cap.empty(): either assert/throw a clear precondition (e.g.,
assert(!queue[tid].cap.empty()) or throw std::out_of_range) or return a defined
invalid/sentinel FetchTargetId value instead; reference the backId function,
queue[tid].baseTargetId and queue[tid].cap.size()/cap.empty() when making the
change.
🧹 Nitpick comments (3)
src/cpu/pred/btb/ftq.hh (1)

62-68: Consider renaming anyEmpty() to allEmpty() for clarity.

The current implementation returns true only when ALL queues are empty (i.e., no queue has any targets). The name anyEmpty() suggests it would return true if ANY single queue is empty, which is the opposite behavior. Consider renaming to allEmpty() to match the semantics.

src/cpu/pred/btb/decoupled_bpred.hh (1)

70-78: Hardcoded numThreads = 2 should be parameterized.

The number of threads should be configurable via params rather than hardcoded, especially since MaxThreads is defined as a limit. Also consider removing the commented-out code rather than leaving it in the codebase.

♻️ Suggested change
-    // unsigned fetchTargetQueueSize;
-    // std::deque<FetchTarget> fetchTargetQueue;
-    // FetchTargetId fetchTargetBaseId{1}; // ID of fetchTargetQueue.front()
-    // FetchTargetId ftqId{1}; // next FSQ id to allocate (monotonic)
-    // FetchTargetId fetchHeadFtqId{1}; // next FSQ id to be consumed by fetch

     CPU *cpu;

-    const int numThreads = 2;
+    const int numThreads;  // Set from params in constructor
src/cpu/pred/btb/decoupled_bpred.cc (1)

27-30: Hardcoded thread ID needs to be addressed for SMT support.

The tid = 0 is hardcoded with a TODO comment. For proper SMT support, this function should receive ThreadID as a parameter, similar to other refactored functions in this PR.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cpu/pred/btb/common.hh (1)

452-485: ⚠️ Potential issue | 🟠 Major

Initialize FullBTBPrediction::tid to avoid indeterminate values.

Line 454 adds tid but the constructor doesn’t set it, so it can carry garbage until assigned. This is risky if any code reads it before initialization.

🛠 Proposed fix
-    FullBTBPrediction() :
-        bbStart(0),
+    FullBTBPrediction() :
+        tid(InvalidThreadID),
+        bbStart(0),
         btbEntries(),
         condTakens(),
         indirectTargets(),
         returnTarget(0),
         tageInfoForMgscs(),
         predSource(0),
         predTick(0),
         s1Source(-1),
         s3Source(-1) {}
🤖 Fix all issues with AI agents
In `@src/cpu/o3/fetch.cc`:
- Around line 783-787: The code computes ThreadID tid from inst->threadNumber
but then uses a hard-coded thread 0 for FTQ operations; change calls to use tid
instead of 0 so SMT threads access the correct FTQ: replace
dbpbtb->ftqHasFetching(0) with dbpbtb->ftqHasFetching(tid) and
dbpbtb->ftqFetchingTarget(0) with dbpbtb->ftqFetchingTarget(tid) (ensure tid is
the same ThreadID variable computed earlier).

In `@src/cpu/o3/fetch.hh`:
- Around line 969-972: The ftqEmpty() helper currently calls
dbpbtb->ftqHasFetching(0) with a hard-coded thread 0 which breaks SMT; change
ftqEmpty to accept a ThreadID (or tid) parameter and forward that to
dbpbtb->ftqHasFetching(tid) so the check is per-thread, and then update all call
sites in fetch.cc that call ftqEmpty() to pass the current thread id (e.g., the
local/current tid variable used in fetch code) so SMT fetch uses the correct FTQ
state.

In `@src/cpu/pred/btb/decoupled_bpred.cc`:
- Line 50: The FTQ is being constructed with a hardcoded thread count (ftq(2,
p.ftq_size)); change this to use the configured thread count variable
(numThreads) instead so the FTQ is initialized consistently (e.g.,
ftq(numThreads, p.ftq_size)). Locate the FTQ construction in the decoupled_bpred
constructor/initializer list where ftq(...) is used and replace the literal 2
with the numThreads identifier so the FTQ respects the current thread
configuration.
- Around line 635-640: The method DecoupledBPUWithBTB::blockPredictionOnce
currently hardcodes threads[0].blockPredictionPending = true which breaks SMT;
change the API to accept a ThreadID parameter (e.g., ThreadID tid) and set
threads[tid].blockPredictionPending = true, validating tid is in range, or
alternatively iterate all threads and set threads[i].blockPredictionPending =
true to block everyone; update all call sites of blockPredictionOnce() to pass
the ThreadID (or rely on the new all-threads behavior) and add bounds checks
where appropriate.

In `@src/cpu/pred/btb/ftq.hh`:
- Around line 62-68: The method anyEmpty() currently returns true only when
every queue is empty (it checks queue[i].cap.empty() for all threads), so rename
the function to allEmpty() to match its behavior (and update all callers/tests),
or alternatively change its implementation to return true as soon as any
queue[i].cap.empty() is found if you prefer the original anyEmpty name; locate
the symbol anyEmpty(), the loop over numThreads, and the queue[].cap.empty()
check to make the corrective change and adjust usages accordingly.
🧹 Nitpick comments (5)
src/cpu/pred/btb/decoupled_bpred_stats.cc (2)

801-808: Avoid copying FetchTarget here; use a const reference.

ftq.get(...) returns a FetchTarget. Copying it can be expensive and can desync if callers expect the live entry.

♻️ Proposed refactor
-    auto entry = ftq.get(inst->ftqId, inst->threadNumber);
+    const auto &entry = ftq.get(inst->ftqId, inst->threadNumber);

913-924: Reduce double lookup by caching the FTQ entry reference.

You call ftq.get(...) twice; caching it avoids repeated lookups and keeps logging consistent with the increment.

♻️ Proposed refactor
-    ftq.get(inst->ftqId, inst->threadNumber).commitInstNum++;
+    auto &entry = ftq.get(inst->ftqId, inst->threadNumber);
+    entry.commitInstNum++;
     ...
-            ftq.get(inst->ftqId, inst->threadNumber).commitInstNum);
+            entry.commitInstNum);
src/cpu/pred/btb/ftq.hh (1)

45-61: Missing bounds validation on tid parameter.

All accessor methods use tid to index queue[tid] without validation. If tid >= numThreads, the code accesses uninitialized state (or causes out-of-bounds access if tid >= MaxThreads). Consider adding debug assertions.

🛡️ Proposed fix example
     inline FetchTargetId frontId(ThreadID tid) const {
+        assert(tid < numThreads && "Invalid thread ID");
         return queue[tid].baseTargetId;
     }

Apply similar checks to other methods or create a helper macro/inline function.

src/cpu/pred/btb/decoupled_bpred.cc (1)

24-31: Hardcoded tid = 0 limits multi-threading support.

The function doesn't accept ThreadID as a parameter and uses a hardcoded value. This should be addressed when completing SMT support.

♻️ Suggested signature change
 void
-DecoupledBPUWithBTB::consumeFetchTarget(unsigned fetched_inst_num)
+DecoupledBPUWithBTB::consumeFetchTarget(unsigned fetched_inst_num, ThreadID tid)
 {
-    ThreadID tid = 0; // TODO: multi-threading
     ftq.fetching(tid).fetchInstNum = fetched_inst_num;
     ftq.finishTarget(tid);
 }
src/cpu/pred/btb/decoupled_bpred.hh (1)

78-78: Hardcoded numThreads should be configurable.

numThreads is hardcoded to 2 instead of being derived from simulation parameters. This limits flexibility and could cause inconsistencies if changed in one place but not others (e.g., ftq(2, ...) in the constructor).

♻️ Consider parameterizing
-    const int numThreads = 2;
+    const int numThreads;  // Initialize from params in constructor

Then in constructor initialization list:

numThreads(p.numThreads),

@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1727 -
This PR 2.1907 📈 +0.0181 (+0.83%)

✅ Difftest smoke test passed!

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/cpu/o3/fetch.cc`:
- Around line 459-461: The BPU seeding uses thread 0's PC only
(dbpbtb->resetPC(threads[0].fetchpc->instAddr())), which mis-initializes SMT
threads with different start PCs; update the reset logic to either add/use a
per-thread reset API (e.g., a resetPC method that accepts a thread id or
per-thread predictor instance) or iterate over all threads and call
dbpbtb->resetPC(threads[i].fetchpc->instAddr()) for each thread so every
thread's fetchpc is used to seed the predictor.
- Around line 1537-1545: handleIEWSignals is using iewInfo[0] and passing a
hardcoded 0 to dbpbtb instead of the current thread; update the resolve-update
path to accept and use the ThreadID from checkSignalsAndUpdate(ThreadID tid) by
adding a tid parameter to handleIEWSignals, read resolved CFIs from
fromIEW->iewInfo[tid].resolvedCFIs, and replace the hardcoded 0 with tid in the
dbpbtb calls (prepareResolveUpdateEntries, markCFIResolved, resolveUpdate); also
consider adding ThreadID into ResolvedCFIEntry if you need to preserve thread
context per-entry.

In `@src/cpu/pred/btb/decoupled_bpred.cc`:
- Around line 303-305: The current check uses ftq.backId(tid) and ftq.back(tid)
which underflow/are invalid when the FTQ is empty; change the guard to test
emptiness (e.g. !ftq.empty(tid) or ftq.size(tid) > 0) before calling
ftq.backId/tid or ftq.back(tid). Update the conditional around abtb->isEnabled()
to first ensure the FTQ has entries, then obtain previous_block_startpc from
ftq.back(tid) and call abtb->updateUsingS3Pred(...).
🧹 Nitpick comments (1)
src/cpu/o3/fetch.hh (1)

851-875: Clarify ownership of FetchBuffer::data to avoid leaks.

data is allocated with new[] in Fetch but there’s no corresponding cleanup. Consider explicit deletion in Fetch::~Fetch() (or switch to RAII).

🧹 Suggested fix (explicit cleanup in Fetch::~Fetch)
-Fetch::~Fetch() = default;
+Fetch::~Fetch()
+{
+    for (ThreadID tid = 0; tid < numThreads; ++tid) {
+        delete[] threads[tid].data;
+        threads[tid].data = nullptr;
+    }
+}

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 3, 2026

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1727 -
This PR 2.1907 📈 +0.0181 (+0.83%)

✅ Difftest smoke test passed!

@tastynoob tastynoob added the perf label Feb 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 3, 2026

🚀 Performance test triggered: spec06-0.8c

@XiangShanRobot
Copy link
Copy Markdown

[Generated by GEM5 Performance Robot]
commit: 7fc14cb
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

PR Master Diff(%)
Score 20.27 20.37 -0.48 🔴

jensen-yan
jensen-yan previously approved these changes Feb 6, 2026
Change-Id: Ie21dbd89df9c1d5d80106d71ae4f9b51415be897
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 9, 2026

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.2354 -
This PR 2.2354 ➡️ 0.0000 (0.00%)

✅ Difftest smoke test passed!

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (9)
src/cpu/pred/btb/decoupled_bpred.cc (3)

303-305: ⚠️ Potential issue | 🟠 Major

Don't use backId()/back() as the empty-queue check.

backId() is derived from cap.size() - 1, so on an empty FTQ this condition underflows and ftq.back(tid) becomes invalid. Guard with !ftq.empty(tid) before touching tail accessors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/pred/btb/decoupled_bpred.cc` around lines 303 - 305, The code
currently checks ftq.backId(tid) which underflows on an empty FTQ; replace that
guard with a proper non-empty check (use !ftq.empty(tid)) before accessing
ftq.back(tid). Specifically, in the block where you call abtb->isEnabled() and
then use ftq.back(tid).startPC to call
abtb->updateUsingS3Pred(predsOfEachStage[numStages - 1],
previous_block_startpc), add a check for !ftq.empty(tid) (or combine it with
abtb->isEnabled()) so you never call backId() or back() on an empty queue.

624-638: ⚠️ Potential issue | 🟠 Major

Resolve backpressure still blocks only thread 0.

A resolve failure on thread 1 will set threads[0].blockPredictionPending, so the offending thread keeps predicting while another thread gets stalled. blockPredictionOnce() needs thread context, or it needs to block all threads intentionally.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/pred/btb/decoupled_bpred.cc` around lines 624 - 638,
notifyResolveFailure currently always sets threads[0].blockPredictionPending in
blockPredictionOnce, which incorrectly blocks only thread 0; modify
notifyResolveFailure and/or blockPredictionOnce to take the originating thread
id (or otherwise use the caller's thread context) and set
threads[tid].blockPredictionPending for that thread instead of threads[0], or if
the design intends global backpressure, explicitly set blockPredictionPending on
all threads; update the signatures of DecoupledBPUWithBTB::blockPredictionOnce
(and callers) to accept an int thread id (or iterate all threads) and apply the
pending flag to the correct thread(s).

48-48: ⚠️ Potential issue | 🟠 Major

FTQ is still constructed for exactly 2 threads.

Even if the CPU is configured with a different thread count, the predictor-side FTQ only allocates/schedules two lanes here. That will misroute fetch targets and break the rest of the per-thread refactor.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/pred/btb/decoupled_bpred.cc` at line 48, The FTQ is being constructed
with a hardcoded lane count (ftq(2, p.ftq_size)) which forces exactly two
threads; replace the literal 2 with the configured thread count so the
predictor-side FTQ matches the CPU thread configuration (use the parameter that
holds the CPU thread count, e.g., p.numThreads or the appropriate p.* field that
represents thread count) when constructing ftq in decoupled_bpred so per-thread
fetch lanes are allocated correctly.
src/cpu/o3/fetch.cc (3)

453-455: ⚠️ Potential issue | 🟠 Major

Predictor reset is still seeded from thread 0 only.

This path initializes the BPU with threads[0].fetchpc, and the current resetPC() implementation broadcasts that PC to every predictor thread. If SMT threads start from different PCs, non-zero threads begin with the wrong predictor state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/o3/fetch.cc` around lines 453 - 455, The predictor is being seeded
only from thread 0 via dbpbtb->resetPC(threads[0].fetchpc->instAddr()), which
misinitializes SMT threads that start at different PCs; update the reset logic
to seed each predictor thread from its own fetch PC by iterating over the
threads array and calling resetPC with threads[i].fetchpc->instAddr() for each
thread (or modify dbpbtb->resetPC to accept a thread index/PC and invoke it per
thread) so every predictor thread gets the correct initial PC.

1418-1418: ⚠️ Potential issue | 🟠 Major

Resolve-stage training still runs against thread 0.

checkSignalsAndUpdate(tid) is per-thread, but handleIEWSignals() ignores tid, consumes a single iewInfo lane, and forwards 0 into prepareResolveUpdateEntries(), markCFIResolved(), and resolveUpdate(). That merges resolved CFIs into the wrong queue and trains the wrong predictor state under SMT.

Also applies to: 1453-1506

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/o3/fetch.cc` at line 1418, handleIEWSignals() currently ignores the
thread id and consumes a single iewInfo lane, causing resolve-stage training to
always use thread 0; update it to operate per-thread: change handleIEWSignals()
to accept a tid (or iterate over threads) and call checkSignalsAndUpdate(tid)
and then forward the same tid into prepareResolveUpdateEntries(tid),
markCFIResolved(tid) and resolveUpdate(tid) (or use per-thread variants) so each
SMT thread uses its own iewInfo lane and queues, ensuring resolved CFIs and
predictor training are applied to the correct thread rather than merged into
thread 0.

621-623: ⚠️ Potential issue | 🟠 Major

Hardcoded thread 0 still leaks into FTQ access.

These call sites still validate/fetch the FTQ stream and stamp DynInst::ftqId with thread 0 instead of the current tid. In SMT, that can assert on the wrong thread, compare a fetch buffer against the wrong stream, and later commit/squash against another thread's FTQ entry.

Also applies to: 758-759, 1660-1661

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/o3/fetch.cc` around lines 621 - 623, The FTQ accesses use a hardcoded
thread index (0) causing wrong-thread FTQ validation and ftqId stamping; replace
all uses of dbpbtb->ftqHasFetching(0) and dbpbtb->ftqFetchingTarget(0) (e.g.,
the checks around threads[tid].valid and threads[tid].startPC) with
dbpbtb->ftqHasFetching(tid) and dbpbtb->ftqFetchingTarget(tid), and ensure any
DynInst::ftqId or similar FTQ stamping uses the current tid rather than 0; apply
the same fix to the other occurrences noted (around the regions corresponding to
lines 758-759 and 1660-1661).
src/cpu/pred/btb/decoupled_bpred.hh (2)

78-78: ⚠️ Potential issue | 🟠 Major

Don't hardcode the predictor to 2 threads.

numThreads now drives per-thread initialization, tick(), and resetPC(), so any run with params.numThreads != 2 will leave predictor state out of sync with fetch/FTQ state. This needs to come from the configured thread count, not a literal.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/pred/btb/decoupled_bpred.hh` at line 78, The predictor currently
hardcodes "const int numThreads = 2;" which desynchronizes per-thread
init/tick/resetPC when params.numThreads != 2; replace the literal with the
configured thread count (use params.numThreads / params->numThreads consistent
with the class constructor) and propagate that value to all places that rely on
numThreads (per-thread initialization in the DecoupledBPred constructor, tick(),
and resetPC()) so the predictor uses the real configured thread count.

147-147: ⚠️ Potential issue | 🟠 Major

scheduleThread() still forces thread 0.

Only thread 0 can request fresh predictions with this implementation, so other threads will stop refilling once their current FTQ work is consumed. That breaks SMT correctness and can starve non-zero threads.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/pred/btb/decoupled_bpred.hh` at line 147, scheduleThread() currently
always returns ThreadID 0, preventing non-zero threads from requesting new
predictions and starving SMT threads; change scheduleThread() to select an
appropriate runnable thread instead of hardcoding 0—for example implement a
round-robin or priority chooser that checks per-thread FTQ/refill need and
returns the first ThreadID with work to refill (or uses the existing thread
selection helper if available). Update scheduleThread() to iterate over all
thread indices (or call the thread scheduler helper) and return the chosen
ThreadID so all threads can request fresh predictions rather than always
returning 0.
src/cpu/pred/btb/ftq.cc (1)

41-44: ⚠️ Potential issue | 🟡 Minor

Guard commitTarget() against empty queues.

pop_front() on an empty deque is undefined behavior. Even if current callers usually check first, this queue primitive should assert or bail out before advancing baseTargetId.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/pred/btb/ftq.cc` around lines 41 - 44, The commitTarget() method
(FetchTargetQueue::commitTarget) must guard against empty per-thread queues
before calling queue[tid].cap.pop_front() and incrementing
queue[tid].baseTargetId; add an explicit check/assert for queue[tid].cap.empty()
and either assert/fail early or return without modifying baseTargetId when the
deque is empty so pop_front() is never invoked on an empty container.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cpu/pred/btb/decoupled_bpred.cc`:
- Around line 379-384: The prediction trace is using ftq.backId(tid) before
ftq.insert(entry) so it records the previous tail (or underflows on empty); move
the predTraceManager->write_record(PredictionTrace(...)) call to after
ftq.insert(entry) or else compute the new FTQ id explicitly (e.g., call
ftq.insert(entry) or ftq.push and use the returned/new id) and then call
predTraceManager->write_record with that id; update the code around
enablePredFSQTrace, predTraceManager->write_record, PredictionTrace, and
ftq.insert to ensure the trace uses the post-insert FTQ id.

---

Duplicate comments:
In `@src/cpu/o3/fetch.cc`:
- Around line 453-455: The predictor is being seeded only from thread 0 via
dbpbtb->resetPC(threads[0].fetchpc->instAddr()), which misinitializes SMT
threads that start at different PCs; update the reset logic to seed each
predictor thread from its own fetch PC by iterating over the threads array and
calling resetPC with threads[i].fetchpc->instAddr() for each thread (or modify
dbpbtb->resetPC to accept a thread index/PC and invoke it per thread) so every
predictor thread gets the correct initial PC.
- Line 1418: handleIEWSignals() currently ignores the thread id and consumes a
single iewInfo lane, causing resolve-stage training to always use thread 0;
update it to operate per-thread: change handleIEWSignals() to accept a tid (or
iterate over threads) and call checkSignalsAndUpdate(tid) and then forward the
same tid into prepareResolveUpdateEntries(tid), markCFIResolved(tid) and
resolveUpdate(tid) (or use per-thread variants) so each SMT thread uses its own
iewInfo lane and queues, ensuring resolved CFIs and predictor training are
applied to the correct thread rather than merged into thread 0.
- Around line 621-623: The FTQ accesses use a hardcoded thread index (0) causing
wrong-thread FTQ validation and ftqId stamping; replace all uses of
dbpbtb->ftqHasFetching(0) and dbpbtb->ftqFetchingTarget(0) (e.g., the checks
around threads[tid].valid and threads[tid].startPC) with
dbpbtb->ftqHasFetching(tid) and dbpbtb->ftqFetchingTarget(tid), and ensure any
DynInst::ftqId or similar FTQ stamping uses the current tid rather than 0; apply
the same fix to the other occurrences noted (around the regions corresponding to
lines 758-759 and 1660-1661).

In `@src/cpu/pred/btb/decoupled_bpred.cc`:
- Around line 303-305: The code currently checks ftq.backId(tid) which
underflows on an empty FTQ; replace that guard with a proper non-empty check
(use !ftq.empty(tid)) before accessing ftq.back(tid). Specifically, in the block
where you call abtb->isEnabled() and then use ftq.back(tid).startPC to call
abtb->updateUsingS3Pred(predsOfEachStage[numStages - 1],
previous_block_startpc), add a check for !ftq.empty(tid) (or combine it with
abtb->isEnabled()) so you never call backId() or back() on an empty queue.
- Around line 624-638: notifyResolveFailure currently always sets
threads[0].blockPredictionPending in blockPredictionOnce, which incorrectly
blocks only thread 0; modify notifyResolveFailure and/or blockPredictionOnce to
take the originating thread id (or otherwise use the caller's thread context)
and set threads[tid].blockPredictionPending for that thread instead of
threads[0], or if the design intends global backpressure, explicitly set
blockPredictionPending on all threads; update the signatures of
DecoupledBPUWithBTB::blockPredictionOnce (and callers) to accept an int thread
id (or iterate all threads) and apply the pending flag to the correct thread(s).
- Line 48: The FTQ is being constructed with a hardcoded lane count (ftq(2,
p.ftq_size)) which forces exactly two threads; replace the literal 2 with the
configured thread count so the predictor-side FTQ matches the CPU thread
configuration (use the parameter that holds the CPU thread count, e.g.,
p.numThreads or the appropriate p.* field that represents thread count) when
constructing ftq in decoupled_bpred so per-thread fetch lanes are allocated
correctly.

In `@src/cpu/pred/btb/decoupled_bpred.hh`:
- Line 78: The predictor currently hardcodes "const int numThreads = 2;" which
desynchronizes per-thread init/tick/resetPC when params.numThreads != 2; replace
the literal with the configured thread count (use params.numThreads /
params->numThreads consistent with the class constructor) and propagate that
value to all places that rely on numThreads (per-thread initialization in the
DecoupledBPred constructor, tick(), and resetPC()) so the predictor uses the
real configured thread count.
- Line 147: scheduleThread() currently always returns ThreadID 0, preventing
non-zero threads from requesting new predictions and starving SMT threads;
change scheduleThread() to select an appropriate runnable thread instead of
hardcoding 0—for example implement a round-robin or priority chooser that checks
per-thread FTQ/refill need and returns the first ThreadID with work to refill
(or uses the existing thread selection helper if available). Update
scheduleThread() to iterate over all thread indices (or call the thread
scheduler helper) and return the chosen ThreadID so all threads can request
fresh predictions rather than always returning 0.

In `@src/cpu/pred/btb/ftq.cc`:
- Around line 41-44: The commitTarget() method (FetchTargetQueue::commitTarget)
must guard against empty per-thread queues before calling
queue[tid].cap.pop_front() and incrementing queue[tid].baseTargetId; add an
explicit check/assert for queue[tid].cap.empty() and either assert/fail early or
return without modifying baseTargetId when the deque is empty so pop_front() is
never invoked on an empty container.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: feb473fe-f345-495b-90c6-84a1b6e0520c

📥 Commits

Reviewing files that changed from the base of the PR and between 7fc14cb and a0be485.

📒 Files selected for processing (10)
  • src/cpu/o3/fetch.cc
  • src/cpu/o3/fetch.hh
  • src/cpu/o3/trace/TraceFetch.cc
  • src/cpu/pred/SConscript
  • src/cpu/pred/btb/common.hh
  • src/cpu/pred/btb/decoupled_bpred.cc
  • src/cpu/pred/btb/decoupled_bpred.hh
  • src/cpu/pred/btb/decoupled_bpred_stats.cc
  • src/cpu/pred/btb/ftq.cc
  • src/cpu/pred/btb/ftq.hh
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/cpu/pred/btb/decoupled_bpred_stats.cc
  • src/cpu/pred/btb/ftq.hh
  • src/cpu/pred/btb/common.hh
  • src/cpu/pred/SConscript

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cpu/o3/smt_sched.hh`:
- Around line 38-39: Validate and enforce the numThreads invariant in the
SMTScheduler constructor: ensure numThreads is at least 1 and no greater than
MaxThreads (the fixed allocation size used by InstsCounter and ICountScheduler).
In the SMTScheduler(int numThreads) constructor, check the incoming value and
either clamp it into the valid range or fail fast (e.g., assert/fatal) with a
clear message; update the constructor SMTScheduler and references to numThreads
so subsequent uses (getThread, ICountScheduler loops and any code assuming
numThreads) cannot walk past the fixed array or run with zero threads.
- Around line 101-119: MultiPrioritySched::getThread currently scans all
counters equally and dereferences counter[0] even if the initializer list is
empty; change it to honor the priority order (highest → lowest) by first
checking counter.empty() and handling that case (return a default ThreadID or
assert/throw), then iterate counters in priority order: for the first
(highest-priority) counter, compute the set of candidate tids with the minimum
count; for each subsequent (lower-priority) counter, filter that candidate set
by the minimum count in that counter (i.e., use it only as a tie-breaker) until
a single tid remains or all counters are used, and return the selected tid;
update getThread to reference counter.size(), numThreads, and ThreadID
accordingly and avoid unconditional counter[0] dereference.
- Around line 75-89: The constructor DelayedICountScheduler currently accepts
any int delay which can be zero or negative, causing timebuffer to be empty or
set_capacity to receive a huge unsigned value; add validation at the start of
DelayedICountScheduler(int numThreads, InstsCounter* counter, int delay) to
require delay > 0 (e.g., throw std::invalid_argument or assert) and return/fail
fast for invalid values, and when calling timebuffer.set_capacity use a safe
cast (static_cast<size_t>(delay)) to avoid unsigned wrap; additionally, add a
defensive check in getThread() that if timebuffer is empty (shouldn't happen) it
falls back to ICountScheduler::getThread() instead of calling
front()/pop_front().
- Around line 5-7: The header uses std::initializer_list<InstsCounter*> in the
SMT scheduler constructor (the constructor referenced at line 102) but doesn't
include <initializer_list>; add an explicit `#include` <initializer_list> near the
other includes at the top of src/cpu/o3/smt_sched.hh so the constructor
signature and any uses of std::initializer_list compile reliably without relying
on transitive includes.
- Around line 33-40: Change the abstract contract for SMTScheduler so getThread
is purely virtual: update the declaration of SMTScheduler::getThread in class
SMTScheduler to be a pure virtual function (i.e., use "= 0") so derived
schedulers continue to override it and the linker error is resolved; ensure no
out-of-class definition remains for getThread.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a54f6e21-22d6-4a62-9e71-176d4452b535

📥 Commits

Reviewing files that changed from the base of the PR and between a0be485 and f4a84b6.

📒 Files selected for processing (1)
  • src/cpu/o3/smt_sched.hh

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 9, 2026

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.2354 -
This PR 2.2354 ➡️ 0.0000 (0.00%)

✅ Difftest smoke test passed!

Change-Id: Ie567751c9af4d48d3681e30c3c6fd80ecaf190a9
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 9, 2026

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.2354 -
This PR 2.2354 ➡️ 0.0000 (0.00%)

✅ Difftest smoke test passed!

@tastynoob tastynoob requested a review from jensen-yan March 12, 2026 09:28
@jensen-yan
Copy link
Copy Markdown
Collaborator

我理解这个 PR 目前还是一个中间状态,不是最终完成版,所以我下面的意见不是在要求这个版本立刻
达到可合入标准,而是想明确指出,按当前代码状态来看,这个 PR 还处在“SMT 线程化只完成了一部分”的阶段,暂时还不适合做最终 correctness review,也不适合据此判断功能已经 ready。

我觉得当前最大的特点不是“还有一些边角没补”,而是线程维度还没有在 fetch 和 decoupled BPU 的关键路径上闭合。也就是说,这个 PR 已经把不少原本单线程的状态拆成了 per-thread 结构,但仍有几个核心入口和恢复路径保留着 thread 0 假设,导致整体语义还是半线程化的。这样的中间态我觉得是可以接受的开发过程,但需要在 PR 里明确标注成 WIP;否则从 reviewer视角看,会比较像“功能已基本成型,只剩小修”,而实际情况并不是这样。

当前我比较在意的几个主阻塞点如下:

  • BPU 侧实际上还没有真正开始做 SMT 调度。现在 scheduleThread() 还是固定返回 0,所以预测只会为 tid=0 生成。这样一来,即使 fetch 侧已经开始尝试按 FTQ 选择线程,非 0 号线程也拿不到完整的预测流,功能上还是会饿死。
  • fetch 到 decode 的出队路径仍然只消费 tid=0 的 fetchQueue。这意味着即便某个非 0 线程已经成功 fetch 到指令,这些
    指令也不会被正常送入 decode。换句话说,前面刚引入的“可从其他线程取指”的路径,在后半段并没有闭环。
  • fetch 关键路径里仍然有多处直接使用 thread 0 的 FTQ 头。这个问题比“暂时没调度好”更严重,因为它会让 thread 1 在一些路径上直接读错 stream,或者在只有 thread 1 有效时错误断言。换句话说,当前不是单纯的功能没接完,而是已经存在跨线程引用错误。
  • speculative history 目前看仍然是全局的,而且只按 ftqId/streamId 索引;但 FTQ 现在已经变成 per-thread,并且每个
    线程都会从自己的起始 ID 开始分配。这会带来一个很直接的问题:不同线程上相同编号的 FTQ entry 会互相 commit /
    squash,对 GHR、RAS 以及历史恢复逻辑造成串扰。这个点我会认为是 SMT 正确性里非常核心的 blocker。
  • reset / 初始化语义也还没有线程化完成。现在看起来仍然是拿 thread 0 的 PC 去重置所有线程的 predictor 前端状态。只
    要 SMT workload 中不同线程的起始 PC 不同,这里就会让非 0 线程从错误上下文开始构建预测状态。
  • resolve backpressure 相关状态还是全局的,并且阻塞逻辑里还残留了明显的 tid=0 假设。这个问题即使短期不一定立刻炸
    correctness,也会让不同线程之间出现很难解释的性能串扰,所以最好不要把它留到最后再补。

从这些点综合来看,我目前对这个 PR 的判断是:它更像是“把单线程 decoupled frontend / BPU 往 SMT 方向拆状态的第一阶段”,而不是一个可以单独审完并合入的完整功能 PR。我的建议是,至少在下面几件事完成之前,不要把它当成正式可审版本:

  • 先把 thread selection 真正接起来,而不是固定在 tid=0。
  • 把 fetch / decode / FTQ / DynInst.ftqId 这一整条链上的 thread 维度全部打通。
  • 把 history manager 变成真正不会串线程的设计,至少保证 commit / squash / recover 的 key 里包含线程信息,或者从结构上做到每线程独立。
  • 把 reset、squash、resolve backpressure 这些恢复路径也一起线程化,不要只改 steady-state 的预测路径。

另外还有一个比较次要但我觉得值得说的点:smt_sched.hh 里虽然已经放进来了几种 scheduler 的定义,但从当前代码路径看,它们还没有真正接到 fetch/BPU 的运行逻辑里。所以如果这个 PR 的标题/commit message 对外表达的是“已经加入了 3 种SMT scheduler”,那我建议在描述上也稍微收敛一点,避免让人误解为调度策略部分已经成型并可用了。

总之,我的意见不是说这个方向有问题,或者这批重构不值得继续做;相反,我觉得拆 per-thread 状态、把 FTQ 从单队列抽出来,这些方向本身是对的。只是从当前提交状态来看,它仍然明显是一个开发中的中间版本。建议在 PR 描述或者评论里明确标注 WIP,并把上面几个线程维度尚未闭合的点作为后续必须补齐的主任务。等这些主路径补完之后,再做一轮完整 review 会更有效,也更容易区分“设计未完成”和“完成后仍然有 bug”这两类问题。

现阶段如果要我给一个简短结论的话,就是:这个 PR 展示了 SMT 化的重构方向,但还没有形成自洽的 SMT 语义闭环;当前最需要的不是细节打磨,而是先把线程维度在预测、取指、出队、历史恢复和 reset 这几条主路径上彻底打通。

@tastynoob tastynoob merged commit 68b5340 into xs-dev Mar 12, 2026
2 checks passed
@tastynoob tastynoob deleted the kmh-smt branch March 12, 2026 10:21
@tastynoob tastynoob restored the kmh-smt branch March 13, 2026 06:09
@coderabbitai coderabbitai bot mentioned this pull request Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants