docs: add task workflow spec and test cautions (#845)

klin02 · web-flow · commit 7e4f304673fc · 2026-03-31T22:32:08.000+08:00
- docs/workflow.md: new document defining plan/progress spec, execution
  practices (context recovery, askQuestions, sub-agent delegation), and
  debugging escalation order
- docs/test.md: add FPGA_SIM precautions (residual process check, log
  preservation), reference DB cross-comparison workflow (hex conversion,
  ATTACH queries), and phased verification strategy (test ladder, per-phase
  gating)
- docs/README.md: add workflow.md to document index and reading order
- AGENTS.md: reference workflow.md for complex tasks
diff --git a/AGENTS.md b/AGENTS.md
@@ -1,3 +1,5 @@
 # Difftest Working Guidelines
 
 Before working under `difftest/`, determine whether the task is a potentially complex, multi-file change or requires iterative testing/debugging. If so, review the relevant files in `difftest/docs/` as needed before proceeding.
+
+For complex tasks, follow the plan/progress workflow defined in [`difftest/docs/workflow.md`](docs/workflow.md): create a plan, create a progress log, execute in phases, and use `askQuestions` to confirm ambiguities and end-of-conversation next steps.
diff --git a/docs/README.md b/docs/README.md
@@ -7,11 +7,13 @@
 | [layout.md](./layout.md) | Project directory structure, key files, modification guide |
 | [hw-flow.md](./hw-flow.md) | Hardware transport pipeline: Preprocess → Squash → Delta -> Batch → Sink |
 | [sw-check.md](./sw-check.md) | Software checking flow: difftest_step, checkers, reference model, DPI-C |
-| [test.md](./test.md) | Build / run / debug commands: EMU, simv, FPGA Sim, and debug workflow |
+| [test.md](./test.md) | Build / run / debug commands: EMU, simv, FPGA Sim, reference DB comparison, phased verification |
+| [workflow.md](./workflow.md) | Task workflow: plan/progress specs, execution practices, sub-agent delegation, debugging escalation |
 
 ## Recommended Reading Order
 
 1. [layout.md](./layout.md) — Understand the overall project structure
 2. [hw-flow.md](./hw-flow.md) — Understand the hardware-side data flow
 3. [sw-check.md](./sw-check.md) — Understand the software-side checking logic
 4. [test.md](./test.md) — Reference when building, running, or debugging
+5. [workflow.md](./workflow.md) — Follow when executing complex multi-phase tasks
diff --git a/docs/test.md b/docs/test.md
@@ -180,6 +180,18 @@ bash difftest/scripts/fpga_sim/cosim.sh WORKLOAD=$WORKLOAD DIFF=$REF_SO WAVE=1
 | `DIFF=PATH` | Reference SO (required) |
 | `WAVE=1` | Enable waveform dump |
 
+#### Precautions
+
+- **Residual process check**: before every FPGA Sim run, check for leftover shared memory and processes:
+  ```bash
+  lsof /dev/shm/xdma_sim* 2>/dev/null
+  ```
+  If there is output, kill the listed PIDs and remove stale files (`rm -f /dev/shm/xdma_sim*`) before proceeding. Stale processes cause the next run to hang or produce incorrect results silently.
+
+- **Wait for full completion**: always use `bash cosim.sh ... 2>&1 | tee <logfile>` and wait for the script to exit completely before inspecting results. Do not interrupt, background, or `Ctrl-C` mid-run — doing so may leave orphan processes.
+
+- **Log preservation**: use `tee` to save each run's output to a distinct log file. Name logs after the change being tested (e.g. `build/cosim-phaseN-microbench.log`) so they can be compared across runs.
+
 #### Cleanup
 
 ```bash
@@ -246,6 +258,46 @@ Relevant code: [`src/test/csrc/common/query.cpp`](../src/test/csrc/common/query.
 | simv | `make clean && make simv DIFFTEST_QUERY=1 VCS=verilator -j2` | `./build/simv +workload=$WORKLOAD +diff=$REF_SO` |
 | FPGA Sim | Add `DIFFTEST_QUERY=1` to Step 2 (fpga-build) in §2.3 | `bash difftest/scripts/fpga_sim/cosim.sh WORKLOAD=$WORKLOAD DIFF=$REF_SO` |
 
+#### Reference DB Comparison
+
+When debugging transport-stage issues (squash/batch/delta), comparing a suspect Query DB against a known-good **reference DB** is the most effective approach.
+
+**Creating a reference DB:**
+
+Run the same workload with difftest on a known-good code revision (or with a simpler `--difftest-config` that bypasses the suspect stage). Save the resulting `build/difftest_query.db` as your reference:
+
+```bash
+cp build/difftest_query.db ref-microbench.db   # or ref-linux.db
+```
+
+**Hex conversion (required before comparison):**
+
+Query DB stores values in raw format. Convert both the suspect DB and the reference DB to hex for readable comparison:
+
+```bash
+python3 difftest/scripts/query/convert_hex.py build/difftest_query.db
+# Produces: build/difftest_query_hex.db
+
+python3 difftest/scripts/query/convert_hex.py ref-microbench.db
+# Produces: ref-microbench-hex.db
+```
+
+**Cross-DB comparison with ATTACH:**
+
+Use SQLite's `ATTACH` to join tables across the suspect and reference DBs in a single query:
+
+```bash
+sqlite3 build/difftest_query_hex.db \
+  "ATTACH 'ref-microbench-hex.db' AS ref;
+   SELECT a.STEP, a.NFUSED AS dut, b.NFUSED AS ref
+   FROM main.InstrCommit a
+   JOIN ref.InstrCommit b ON a.STEP=b.STEP AND a.MY_INDEX=b.MY_INDEX
+   WHERE a.NFUSED != b.NFUSED
+   ORDER BY a.STEP LIMIT 20;"
+```
+
+Replace the table name (`InstrCommit`) and column (`NFUSED`) with the checker and field that diverged. The first divergent STEP typically points to the root cause.
+
 ### 3.3 Waveforms
 
 Waveforms capture hardware signal transitions and are used to inspect timing and event ordering at the RTL level. Build with waveform support enabled, then dump at runtime.
@@ -288,3 +340,37 @@ EMU runtime waveform options:
 2. **Locate** the checker source in [`src/test/csrc/difftest/checkers/`](../src/test/csrc/difftest/checkers). Read its comparison logic and understand which fields diverged from the printed DUT/REF state.
 3. **Query DB** (if transport stages are suspected): rebuild with `DIFFTEST_QUERY=1`, collect `build/difftest_query.db` from at least two runs (e.g. different `--difftest-config` settings or code revisions). Compare the DBs to narrow which transport stage is implicated.
 4. **Waveform** (for RTL-level verification): after forming a hypothesis, rebuild with `EMU_TRACE=1` (or `EMU_TRACE=fst` for simv). Dump focused waveforms for the suspect time range to validate timing and signal ordering.
+
+---
+
+## 5. Phased Verification Strategy
+
+When making multi-step changes to the hardware transport pipeline (e.g. modifying Squash, Batch, and Delta together), use a phased approach to isolate regressions early.
+
+### Principles
+
+- **One logical change per phase.** Each phase should modify one module or one aspect of the pipeline. Do not combine unrelated changes in the same phase.
+- **Gate on tests before proceeding.** A phase is complete only when all required tests pass. Never start the next phase on a failing baseline.
+- **Use a progress log.** Record each phase's changes, test results, and any debugging notes in a dedicated progress file (e.g. `.github/difftest-progress.md`). This provides a clear audit trail and makes it easier to bisect regressions.
+
+### Test Ladder
+
+Run tests in order of increasing cost. Stop at the first failure.
+
+| Step | Test | Pass Criteria | Approx. Time |
+|------|------|---------------|---------------|
+| 1 | microbench | `HIT GOOD TRAP`, no `ABORT`/`mismatch` | ~1–2 min |
+| 2 | linux (short) | No `ABORT`/`mismatch` within a `timeout 300` window | 5 min |
+| 3 | linux (long) | No `ABORT`/`mismatch` within a `timeout 600` window | 10 min |
+
+- Step 1 catches most functional regressions quickly.
+- Step 2 exercises more complex boot code paths and interrupt handling.
+- Step 3 is a final confidence check for timing-sensitive or rare-event issues. Only run after all phases pass Steps 1–2.
+
+### Typical Workflow
+
+1. **Compile** (full three-step build for FPGA Sim, or `make emu` for EMU).
+2. **Run microbench.** If it fails, debug and fix before running linux.
+3. **Run linux (5 min).** If it fails, the issue is likely related to more complex instruction sequences or interrupt timing.
+4. After **all phases pass** Steps 1–2, run the **final 10-min linux test** once as the acceptance gate.
+5. Record results in the progress log after each step.
diff --git a/docs/workflow.md b/docs/workflow.md
@@ -0,0 +1,150 @@
+# DiffTest Task Workflow
+
+This document defines the standard workflow for complex difftest tasks — those involving multi-file changes, pipeline modifications, or iterative debugging. Simple one-shot edits do not require this process.
+
+## Overview
+
+```
+Task received
+  │
+  ▼
+Create Plan (.github/<task>-plan.md)
+  │
+  ▼
+Create Progress (.github/<task>-progress.md)
+  │
+  ▼
+┌─ Per Phase ──────────────────────────┐
+│  Re-read plan + progress             │
+│  │                                   │
+│  ▼                                   │
+│  Implement changes                   │
+│  │                                   │
+│  ▼                                   │
+│  Test (microbench → linux)           │
+│  │                                   │
+│  ├─ PASS → update progress, next     │
+│  └─ FAIL → debug, fix, re-test      │
+└──────────────────────────────────────┘
+  │
+  ▼
+Final verification (10-min linux)
+  │
+  ▼
+askQuestions: confirm completion / next steps
+```
+
+---
+
+## 1. Plan Document
+
+Every non-trivial task must start with a plan stored at `.github/<task>-plan.md`.
+
+### Required Sections
+
+| Section | Contents |
+|---------|----------|
+| **Header** | Task title, modification scope (which directories/files may be changed), target configuration, execution requirements |
+| **Design Rationale** | High-level description of *what* changes and *why*; core principles and key trade-offs |
+| **Prerequisites** | Environment variables, reference files, common build commands, common test commands, debug workflow |
+| **Phases** | One section per Phase: files to modify, specific changes with before/after logic, expected behavior, test instructions |
+| **Final Verification** | Acceptance test (typically 10-min linux) and exit criteria |
+
+### Guidelines
+
+- **Explicit build and test commands.** Every Phase must include the exact commands to build and test. Do not rely on "same as before" — copy the commands so each Phase is self-contained.
+- **Explicit pass/fail criteria.** For each test, state what constitutes PASS (e.g. `HIT GOOD TRAP`, no `ABORT`/`mismatch`) and FAIL.
+- **Logic description before code.** Describe the intended behavior change in words before showing code snippets. State the invariants that must hold.
+- **Scope boundaries.** Clearly state which files/directories are in-scope and which are off-limits (e.g. "do not modify `src/test/`").
+
+### Versioning
+
+If a plan needs significant revision (not minor fixes), create a new version: `<task>-plan-v2.md`, `<task>-plan-v3.md`, etc. Keep old versions for reference.
+
+---
+
+## 2. Progress Document
+
+Every task with a plan must have a corresponding progress file at `.github/<task>-progress.md` (versioned to match the plan).
+
+### Required Sections Per Phase
+
+| Section | Contents |
+|---------|----------|
+| **Changes Made** | Bullet list of actual modifications (file, what changed) |
+| **Test Results** | Each test with PASS/FAIL, key metrics (instrCnt, cycleCnt, IPC, duration) |
+| **Issues & Debugging** | Problems encountered, root-cause analysis, attempted fixes (including failed ones), and final resolution |
+
+### Status Table
+
+Use a summary table at the top for quick overview:
+
+```markdown
+| Phase | Change Summary | Microbench | Linux 5min | Linux 10min |
+|-------|---------------|-----------|-----------|-------------|
+| 1     | Squasher Decoupled | ✅ 783387 | ✅ 0 errors | — |
+| 2     | DeltaSplitter Decoupled | ✅ 783387 | ✅ 0 errors | — |
+| Final | All changes | — | — | ✅ 600s |
+```
+
+### Guidelines
+
+- **Record failed attempts.** When a fix fails, document the attempt, the failure mode, and why it was wrong. This prevents repeating the same mistake and preserves debugging context.
+- **Update immediately.** Write progress entries as each phase completes, not at the end. If a conversation is interrupted, the progress file is the recovery point.
+- **Include quantitative data.** Log exact instrCnt, cycleCnt, step numbers, error messages — not just "passed" or "failed".
+
+---
+
+## 3. Execution Practices
+
+### Context Recovery
+
+At the start of each conversation or after a context reset:
+
+1. **Re-read the plan** (`.github/<task>-plan.md`) to restore the task definition and Phase structure.
+2. **Re-read the progress** (`.github/<task>-progress.md`) to determine which Phase is current and what has already been tried.
+3. **Re-read relevant docs** (`difftest/docs/`) if the task involves unfamiliar modules.
+
+Do not rely on memory or assumptions about prior state. The plan and progress files are the source of truth.
+
+### Confirming Ambiguities
+
+When encountering unclear requirements, design choices, or unexpected test results:
+
+- Use `askQuestions` to confirm details with the user before proceeding.
+- Prefer asking early (before implementing a speculative fix) over asking late (after a failed debugging cycle).
+- Typical situations: scope clarification, which approach to take when multiple are viable, whether a failing test is a known issue, whether to proceed to next Phase despite a partial result.
+
+### End-of-Conversation Check
+
+At the end of every conversation:
+
+- Use `askQuestions` to confirm whether there are further requirements, open questions, or next steps.
+- This ensures nothing is left implicit and gives the user a chance to redirect before the context is lost.
+
+### Sub-Agent Delegation
+
+Delegate the following to sub-agents (e.g. Explore agent) to avoid filling the main conversation context window:
+
+| Task | Why Delegate |
+|------|-------------|
+| Reading and analyzing log files (mismatch logs, build logs) | Logs are large and detailed; extracting the root cause is a focused subtask |
+| Query DB inspection (sqlite3 queries, cross-DB comparisons) | Involves multiple queries and iterative interpretation |
+| Waveform analysis hypotheses | Requires reading signal traces and correlating with RTL logic |
+| Large-scale code reading for audit or review | Reviewing many files produces verbose context |
+
+The sub-agent should return only the **conclusion and key evidence** (e.g. "mismatch at step 4482, field NFUSED: DUT=65, REF=64, caused by ..."), not raw query output.
+
+---
+
+## 4. Debugging Escalation
+
+When a test fails, follow this escalation order:
+
+1. **Console output** — read the last ~100 lines for checker name, cycle, and DUT/REF state.
+2. **Query DB comparison** — convert to hex, compare with reference DB using `ATTACH`. Identify the first divergent STEP.
+3. **Waveform** — rebuild with `EMU_TRACE=fst`, dump the suspect time range, inspect signal transitions.
+
+At each level, form a hypothesis before escalating. If the hypothesis can be verified without the next level, do so.
+
+Detailed command references: see [test.md §3 (Debugging Artifacts)](./test.md#3-debugging-artifacts) and [test.md §4 (Debug Workflow)](./test.md#4-debug-workflow).