This document holds the paper-grounded derivations for Naga's five engines.
One section per engine listed in CLAUDE.md § Algorithms.
Reference: Zhang K. and Shasha D. (1989), "Simple fast algorithms for the editing distance between trees and related problems", SIAM Journal on Computing 18(6):1245-1262.
Signature: tree_edit_distance(source_tree: ast.AST, target_tree: ast.AST) -> int
For ordered labeled trees T1 and T2, the tree edit distance is the minimum cost of transforming T1 into T2 via a sequence of insert / delete / relabel operations on individual nodes. Zhang and Shasha showed that the problem admits an O(|T1| * |T2| * min(depth(T1), leaves(T1)) * min(depth(T2), leaves(T2))) solution via dynamic programming over postorder-numbered subtrees.
Naga's implementation uses the simplified Wagner-Fischer DP over postorder sequences (cost = 1 per insert / delete / relabel) — fast, deterministic, and sufficient for the structural-shape signature we need. Full Zhang-Shasha is a future optimisation if profiling shows DP cost dominates.
- Stdlib only.
ast.parse()for Python sources; postorder traversal viaast.iter_child_nodes. - Adjacency: implicit via the AST tree structure.
- For very large source trees (> 50 KB), the caller should chunk the source into top-level definitions and hash each subtree to a bounded representation.
- SyntaxError on source.
ast.parseraises; caller falls back to text-only N2 + N3 fingerprint and marksast_available: false. - Cross-language source/target. AST-shape does not transfer between Python and markdown; naga-orchestrator decides the relaxation set.
Reference: Spaerck Jones K. (1972), "A statistical interpretation of term specificity and its application in retrieval", Journal of Documentation 28(1):11-21.
Signature: tfidf_vector(tokens: list, corpus_df: dict, n_docs: int) -> dict
For term t in document d within a corpus C of N documents, the TF-IDF weight is:
w(t, d) = tf(t, d) * idf(t, C)
where tf(t, d) = count(t, d) / sum_{t'} count(t', d) and
idf(t, C) = log((N + 1) / (df(t) + 1)) + 1 (smoothed to handle unseen
terms). The intuition: terms common across the corpus get suppressed;
terms specific to this document get amplified.
- Stdlib only.
collections.Counterfor term frequency;math.logfor IDF. - Stopword stripping via a frozen set of low-information tokens.
corpus_dfis pre-computed per pattern-class and cached atstate/corpus_df.json. Cold-start: emptycorpus_df-> all terms getidf = log(N+1) + 1.
- Empty token list -> empty vector (callers must handle).
- Cold-start corpus -> over-weights every term equally; mark
cold_start: trueuntiln_docs >= 5.
Reference: Levenshtein V.I. (1966), "Binary codes capable of correcting deletions, insertions, and reversals", Soviet Physics Doklady 10(8):707-710.
Signature: name_distance(source_name: str, candidate_name: str) -> int
The Levenshtein distance between strings a and b is the minimum number of single-character insertions, deletions, or substitutions required to transform a into b. The Wagner-Fischer DP (1974) computes it in O(|a| * |b|) time and O(min(|a|, |b|)) space via a rolling row.
- Stdlib only. Explicit DP via two list rows — no third-party Levenshtein package.
- For [0, 1] similarity, the
name_similarityhelper falls back todifflib.SequenceMatcher.ratio()— quicker and more forgiving for N4-vector entry.
- Empty string in either argument -> distance = length of the other.
- Unicode normalisation -> caller is responsible; Naga compares strings as-is.
Reference: Salton G., Wong A., Yang C.S. (1975), "A vector space model for automatic indexing", Communications of the ACM 18(11):613-620.
Signature: cosine_similarity(vec_a: dict, vec_b: dict) -> float
For two vectors a and b in a shared term-space, the cosine similarity is:
cos(a, b) = (a . b) / (||a|| * ||b||)
bounded in [-1, 1]. For non-negative weight vectors (as in TF-IDF), the output is bounded in [0, 1]: 1.0 means identical direction, 0.0 means orthogonal.
- Stdlib only. Pure-Python list comprehensions over the intersection of keys
for the dot product;
math.sqrtfor norms. - Output clamped to [0, 1] to handle floating-point drift.
- No numpy, no scipy.
- Empty vector on either side -> 0.0 (callers must distinguish from legitimate zero-similarity).
- Disjoint key sets -> 0.0; single-axis matchers will trigger this. Counter: ensure N1 + N2 + N3 features all populate the vector.
Reference: Gauss C.F. (1809), "Theoria motus corporum coelestium in sectionibus conicis solem ambientium" (least-squares foundation for recursive EMA-with-posterior updates).
Ecosystem precedent: Wixie F6, Emu A7 (renamed EMA), Crow H6 (renamed EMA), Djinn D5, Gorgon G5.
Signature: update_posterior(prior: dict, observation: dict, alpha: float = 0.3) -> dict
For a stream of fidelity observations {x_t}, maintain the EMA mean and an EMA-of-squared-deviation variance estimator:
mu_t = (1 - alpha) * mu_{t-1} + alpha * x_t
sigma2_t = (1 - alpha) * sigma2_{t-1} + alpha * (x_t - mu_{t-1})^2
sigma_t = sqrt(sigma2_t)
The posterior 10th-percentile threshold under a normal approximation is:
p10_t = max(0, mu_t - 1.2816 * sigma_t)
p10_t is the per-(pattern-class, target-domain) acceptable-fidelity floor
that the shaper must clear. Half-life ~ 30 observations at alpha = 0.3.
- Stdlib only. JSONL append-only learnings via
state_io.append_jsonl; posterior persisted viastate_io.atomic_write_json. - Pattern classes (canonical): {claude-md, python-module, plugin-json, hook-script, agent-md, test-module, docs-md, generic}.
- Cold-start: first observation seeds the posterior with
sigma = 0,p10 = max(0, x - 0.1). Subsequent observations refine.
- Class drift — same pattern_class label across two repos with divergent
conventions. Sigma inflates; surface as
class_drift_detected: truerather than averaging away the divergence. - Cold-start advisory — when
n_observations < 5, markcold_start: trueand hold the threshold at the global default p10 = 0.6.