This document holds the paper-grounded derivations for Gorgon's five engines.
One section per engine listed in CLAUDE.md § Algorithms.
Reference: Tarjan R.E. (1972), "Depth-first search and linear graph algorithms", SIAM Journal on Computing 1(2):146-160.
Signature: tarjan_scc(adj: dict[str, list[str]]) -> list[list[str]]
Given a directed graph G = (V, E), a strongly-connected component is a maximal subset S ⊆ V such that ∀ u, v ∈ S there exists a directed path from u to v. Tarjan's algorithm performs a single depth-first traversal maintaining two values per node:
index[v]— DFS visit order.lowlink[v]— smallest index reachable from v's DFS subtree (including v).
When lowlink[v] == index[v], every node currently above v on the auxiliary
stack belongs to a single SCC rooted at v. Linear time O(V + E).
- Stdlib only. Iterative DFS via an explicit work-stack of
(node, child_index)tuples — no Python recursion-limit risk on large repos (sys.setrecursionlimitworkaround discouraged). - Adjacency:
dict[str, list[str]]. Destination-only nodes (referenced but not keys) are seeded into the visit set so their SCC is captured.
- Self-loop edge counts as a 1-node SCC (size = 1). Treat as non-trivial cycle by SCC size > 1 OR self-loop check.
Reference: McCabe T.J. (1976), "A Complexity Measure", IEEE Transactions on Software Engineering SE-2(4):308-320.
Signature: cyclomatic(tree: ast.AST) -> dict[str, int]
For a control-flow graph CFG = (N, E) of a single-entry single-exit function body, cyclomatic complexity is:
M = E − N + 2P (where P = number of connected components, here P = 1)
Equivalently, M = (decision points + 1). Decision points include if, for,
while, try/except, boolean operators (each additional operand in a
chain), conditional expressions, and assertions.
- Stdlib
ast.walkover the function body. - Counted node types:
If,For,AsyncFor,While,Try,ExceptHandler,With,AsyncWith,IfExp,Assert,comprehension, plusBoolOp(each additional operand adds one branch). - Lambda is folded into its enclosing scope (no separate score).
- Match/case (3.10+). Each
caseadds a decision point — track separately if expanding to that node type. - Class-level statements. Not scored — McCabe is per-function.
Reference: Brin S. and Page L. (1998), "The anatomy of a large-scale hypertextual Web search engine", Computer Networks and ISDN Systems 30(1-7):107-117.
Signature: pagerank(adj: dict[str, dict[str, float]] | dict[str, list[str]], damping: float = 0.85, eps: float = 1e-6, max_iter: int = 200) -> dict[str, float]
PageRank computes the stationary distribution of a random surfer model on a directed graph. Power iteration:
PR(v) = (1−d)/N + d · Σ_{u ∈ in(v)} PR(u) / |out(u)|
with damping factor d = 0.85. Dangling nodes (no out-edges) redistribute their mass uniformly across all N nodes — the standard Brin-Page treatment.
- Stdlib only. Sparse representation:
dict[node, list[neighbour]]ordict[node, dict[neighbour, weight]]. - Convergence: ‖PR_t − PR_{t-1}‖_1 < eps (default 1e-6).
- Hard cap: 200 iterations regardless of eps (large-repo guard). On cap,
caller may surface
convergence_warning: truein the snapshot payload. - Sum-to-one preserved within float error (~1e-9 on 10k-node graphs).
- Disconnected components. Each receives independent rank mass; both groups are scored coherently within themselves.
- Self-loop on a leaf. PageRank concentrates on the leaf; treat as a pathological hot-spot in advisory text.
Reference: Halstead M.H. (1977), "Elements of Software Science", Elsevier North-Holland.
Signature: halstead(source: str) -> dict[str, float]
(keys: n1, n2, N1, N2, vocabulary, length, volume).
Given:
- n1 = distinct operators
- n2 = distinct operands
- N1 = total operator occurrences
- N2 = total operand occurrences
Then:
- vocabulary η = n1 + n2
- length N = N1 + N2
- volume V = N · log2(η)
Volume captures the information content needed to encode the program in the chosen vocabulary. Complementary to G2 — Halstead catches dense data-shuffling code that McCabe under-weights.
- Stdlib
tokenize.generate_tokensover the source string. - Operators include:
OPtokens plus Python control-flow keywords (if, elif, else, for, while, try, except, finally, with, as, return, yield, break, continue, pass, import, from, raise, and, or, not, is, in, lambda, def, class, global, nonlocal, assert, del, async, await). - Operands include:
NAME(non-keyword),NUMBER,STRING, f-string parts. - Comments, indentation tokens, encoding markers excluded.
- Fail-open on
TokenizeError/IndentationError/SyntaxError— return zeros rather than aborting the snapshot.
- Single-token files. vocabulary = 1 → log2(1) = 0 → volume = 0. Returned as-is; not an error.
Reference: Gauss C.F. (1809), "Theoria motus corporum coelestium in sectionibus conicis solem ambientium" (least-squares foundation for recursive EMA-with-posterior updates). Ecosystem precedent: Wixie F6, Emu A7, Crow H6, Djinn D5.
Signature: update_posterior(prior: dict, observation: dict, alpha: float = 0.3) -> dict
Per-(repo × hotspot-kind) drift posterior maintained as an EMA mean and an EMA-of-squared-deviation variance estimator (Welford-like recurrence on the EMA — approximate but stable):
μ_t = (1 − α) · μ_{t−1} + α · x_t
σ²_t = (1 − α) · σ²_{t−1} + α · (x_t − μ_{t−1})²
with α = 0.3 (half-life ≈ 30 snapshots). Top-N stability is the Jaccard similarity of the current top-N set to the previous snapshot's top-N.
- Stdlib
math.sqrt. Atomic JSONL append forlearnings.jsonlviastate_io.append_jsonl. - Posterior persists in
plugins/gorgon-learning/state/posterior.jsonvia write-tmp-rename throughstate_io.atomic_write_json.
- First observation seeds the posterior with σ = 0 (cannot estimate variance from one sample). Subsequent observations accumulate.
- Cold-start drift. A new repo's first 3-5 snapshots produce wide bands. This is correct — the posterior tells the developer "I do not yet have a stable expected envelope".