|
| 1 | +# General |
| 2 | + |
| 3 | +> **Meta-rule**: Rules in this file include their rationale. Rules with rationales are followed more reliably and transfer better to novel situations. |
| 4 | +- **Documentation is dual-use**: Agents start each session with no memory. All docs under `docs/` serve both humans and agents — include exact commands, paths, and rationales for non-obvious steps. |
| 5 | +- **Do not be a yes-man**: Humans make bad decisions and forget to tell the whole picture. Ask the user to clarify and use data to improve your decisions. |
| 6 | +- **Markdown line style**: One sentence per line, no hard-wrapping at 80 columns. Sentence-per-line keeps diffs readable and lets editors soft-wrap to the viewer's width. |
| 7 | + |
| 8 | +# Codebase Stage & Work Modes |
| 9 | + |
| 10 | +jumanpp v2 is a released, mature C++ morphological analyzer — **but the codebase is under active evolution**. |
| 11 | +Treat existing code as *provisional where it serves the new direction* and *load-bearing where downstream users depend on it* (CLI, output format, model-file compatibility with released tarballs). |
| 12 | + |
| 13 | +**Continuous refactor is the correct default, not the exception.** |
| 14 | +Past experience with minimum-diff / patch-mode on this codebase produced technical-debt accumulation that later refactors had to pay off at higher cost. |
| 15 | +"Prefer minimal edits" means: minimize comprehension cost for session N+5, not line count in this diff. |
| 16 | + |
| 17 | +**Work modes** (user sets at session start or switches mid-session): |
| 18 | +- **Evolve** (~80%): evolve domain model and codebase toward correct modeling. Refactors and rewrites welcome, including revolutionary ones. **Default mode.** |
| 19 | +- **Analyze** (~15%): read code/logs, make plans, no code changes. |
| 20 | +- **Meta** (~5%): improve interaction workflows (this file, `docs/`). |
| 21 | +- **Patch**: minimal diff for a specific problem. **Never assume this mode** — user must request it explicitly. Defensive minimalism on this codebase has historically produced debt, not safety. |
| 22 | + |
| 23 | +**Design rules** (evolve mode): |
| 24 | +- Domain objects over god services. If logic only needs one object's data, it belongs on that object. |
| 25 | +- Make invalid states non-representable. If two values are meaningless without each other, they're one type. If a pipeline has stages, the stage outputs are types. |
| 26 | +- Concepts map to domain objects. Nouns are types. Verbs can be both methods and types. |
| 27 | + |
| 28 | +# Project Rules & Guidelines |
| 29 | + |
| 30 | +## Environment & Configuration |
| 31 | +- **Git Protocol**: User intent is always partial; they edit files between turns. Never commit until triggered. Inspect actual state when committing. |
| 32 | +- **Commit Prefixes** (match existing log style): |
| 33 | + - `fix:` bug fix |
| 34 | + - `build:` CMake / dependency / packaging |
| 35 | + - `refactor:` structural change, no behavior |
| 36 | + - `feat:` new user-visible capability |
| 37 | + - `docs:` human-and-agent system knowledge under `docs/` or README |
| 38 | + - `agent:` changes to `CLAUDE.md` or agent-only artifacts |
| 39 | + - `test:` test-only changes |
| 40 | + |
| 41 | +## Workflow & Planning |
| 42 | +- **Hypothesis Protocol**: When investigating: |
| 43 | + 1. State multiple conflicting hypotheses. A single hypothesis is a conclusion in disguise. |
| 44 | + 2. State them to the user before running off to validate for 10 minutes. |
| 45 | + 3. Persist what survives in `docs/` (e.g. `docs/knowledge/` — create if needed). Session-scoped hypotheses stay in the plan/issue. |
| 46 | +- **Review Mode**: For non-trivial changes: propose in text, read-only checks only, wait for approval. Surface everything relevant, not just what was asked. |
| 47 | +- **Incremental Implementation**: Don't assume you know full scope. Work on small sub-tasks, verify alignment after each. |
| 48 | +- **Plan Fidelity**: Don't silently reduce plan scope. If A turns out wrong mid-implementation, stop and update the plan — don't deliver partial work and call it done. |
| 49 | +- **Scope Decisions Are Not Yours to Make Silently**: Related work is either included (if excluding breaks coherence) or asked about. "That's separate work" is never a silent conclusion. |
| 50 | +- **Meta Feedback**: On `meta:` prefix, pause immediately. Propose instruction-file changes via Review Mode, apply after approval, resume. Update this file, not memory. |
| 51 | + |
| 52 | +## Plans |
| 53 | +Use GitHub Issues on `ku-nlp/jumanpp` for plan tracking. |
| 54 | +Never use per-project memory — it's local only. |
| 55 | + |
| 56 | +# Build & Test |
| 57 | + |
| 58 | +Out-of-source build (CMake refuses in-source): |
| 59 | + |
| 60 | +```bash |
| 61 | +cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug # or Release for perf work |
| 62 | +cmake --build build -j |
| 63 | +ctest --test-dir build --output-on-failure |
| 64 | +``` |
| 65 | + |
| 66 | +For formatting before commits: `./do_format.sh` (clang-format). |
| 67 | + |
| 68 | +Ubuntu 22.04 needs `libprotobuf-dev protobuf-compiler`. |
| 69 | + |
| 70 | +**Model compatibility warning**: Current git HEAD is not compatible with released model files (2.0-rc1 / rc2). |
| 71 | +End-to-end analysis requires either rebuilding the dictionary or using a matching release tarball. |
| 72 | +Do not commit model or dictionary binaries. |
| 73 | + |
| 74 | +# Subsystem Map |
| 75 | + |
| 76 | +- `src/core/` — analysis engine (lattice), feature computation, dictionary compilation, training, codegen, spec DSL. |
| 77 | + - `analysis/` — runtime lattice, beam search, char lattice, analyzer |
| 78 | + - `spec/` — DSL for declaring dictionary fields, features, unks |
| 79 | + - `training/` — structured perceptron / loss |
| 80 | + - `codegen/` — generated feature-compute C++ |
| 81 | + - `dic/` — dictionary builder & reader |
| 82 | +- `src/jumandic/` — Juman dictionary schema + `jumanpp` CLI. |
| 83 | +- `src/rnn/` — RNNLM scorer. **Experimental replacement target** (transformer). |
| 84 | +- `src/util/` — containers, mmap, serialization, flatmap, logging. |
| 85 | +- `src/testing/` — standalone test harness used by `*_test.cc` files. |
| 86 | + |
| 87 | +# Language & Style |
| 88 | + |
| 89 | +- **C++14 baseline.** Widely supported everywhere we build (gcc, clang, MSVC, mingw64). No reason to artificially avoid its features; also no reason to reach for C++17/20 without discussing — CMake and CI expect C++14. |
| 90 | +- Headers and sources colocated under `src/`; tests sit next to the code they test as `*_test.cc`. |
| 91 | +- Run `./do_format.sh` before committing. It wraps `script/git-clang-format.py` and formats only changed hunks, not full files. |
| 92 | +- **Do not mass-reformat the codebase.** Formatting migrates per-hunk as files are edited, so the tree gradually picks up whatever clang-format version contributors have locally. Full-file passes break `git blame` and produce churn with no real benefit. |
| 93 | + |
| 94 | +# Testing |
| 95 | + |
| 96 | +- In-tree harness: `src/testing/standalone_test.h` wraps Catch-style `TEST_CASE`. Tests are `*_test.cc` files next to their subject and are discovered by CMake. |
| 97 | +- Run the full suite: `ctest --test-dir build --output-on-failure`. For one test: `ctest --test-dir build -R <name> --output-on-failure`. |
| 98 | + |
| 99 | +# Documentation |
| 100 | + |
| 101 | +> Agents start cold. Docs exist so session N+5 doesn't re-derive session N. |
| 102 | +
|
| 103 | +- **System docs** (`docs/`): existing files — `analysis.md`, `building.md`, `dictionary.md`, `output.md`, `spec.md`. High-level *what* and *why*; implementation lives in code. |
| 104 | +- **Knowledge** (`docs/knowledge/`): instructive dead-ends and confirmed non-obvious facts from investigations. Create the directory on first use. |
| 105 | +- **Code comments**: describe the *goal*, not the mechanism. If a comment only narrates what the code already says, delete it. Non-obvious code with no clear goal is a refactor target, not a comment target. |
0 commit comments