Auditor: Kraken (Project Reality Manager)
Audit Date: 2026-03-06
Scope: Tracks 2C (Phase I Skills 3+9) + 2D (Phase I Skill 7) post-implementation verification
Branches: build/2c-epistemic-layer (merged), build/2d-contradiction-detector
Tracks 2C and 2D are APPROVED. All three Phase I Epistemic Layer modules are authentically implemented with real logic, no stubs, and full test coverage.
Overall confidence: 90%
The 10% gap reflects:
EpistemicMonitor.extract_premises()uses simple string search — multi-sentence extraction may miss complex sentence structures or create false positives.DisagreementSurfaceruses word-level Jaccard (not embeddings) — low accuracy for semantically equivalent but lexically different statements.ContradictionDetector.classify_contradiction()temporal detection requires the year/date keyword in BOTH facts — missing if only one fact has a date reference.
None block deployment. All documented.
Test counts:
- Track 2C: 1132 passed, 1 skipped, 0 failed
- Track 2D: 1112 passed, 1 skipped, 0 failed
- Unified 2C+2D+2A+2B: 1162 passed, 1 skipped, 0 failed
File: cato/orchestrator/epistemic_monitor.py
Status: CONFIRMED
Evidence:
- Premise extraction via 5 marker phrases: "because", "since", "assuming", "given that", "the fact that" — splits on ". " and "\n" boundaries
_premise_confidence_mapkeyed by lowercase normalized premise textget_gaps()returns premises where stored confidence < self.thresholdcan_interrupt()enforces_interrupt_count < max_interruptsbudgetrecord_unresolved()logs gap withtimestamp: time.time()and confidencereset_session()clears map + resets count — clean per-session isolationget_unresolved_summary()returns{"total": N, "gaps": [...]}
Tests verified: extract_premises finds markers across multi-sentence text; lowercase normalization makes "Python" == "python"; interrupt budget enforced; unresolved summary has correct total after multiple records.
Gap: Extraction splits on ". " (period+space) but not on "!" or "?". Complex academic text with these terminators may cause missed premises or over-long premise strings that dilute the marker matching.
File: cato/orchestrator/disagreement_surfacer.py
Status: CONFIRMED
Evidence:
_jaccard(a, b): set-based word token overlap —(a∩b)/(a∪b)— no external deps_stdev(values): puremath.sqrtstandard deviation — no numpy neededcompute_disagreement_score():0.6 * max_semantic_distance + 0.4 * confidence_stdev- semantic_distance = 1 - jaccard (higher = more different)
- all pairwise distances computed, max taken
- rounded to 4 decimal places
is_disagreement():score > threshold[task_type]classify_disagreement(): keyword scanning on combined output text, 4-waysurface(): returns None when no disagreement; else structured dict with consensus_view (highest-confidence model), minority_view (lowest-confidence), disagreement_type, score, recommended_action
Tests verified: identical outputs → score ~0; very different → score > 0; consensus = highest-confidence model output confirmed; minority = lowest confirmed.
Gap: Jaccard distance is lexical only. "The cat sat" vs "The feline rested" would score high disagreement (low overlap) despite identical meaning. For code-generation outputs this is acceptable (code is lexically precise), but for natural language the score may be noisy.
Thresholds confirmed: code=0.30, research=0.40, decision=0.25, default=0.35
File: cato/memory/contradiction_detector.py
Status: CONFIRMED
Evidence:
SAME_TOPIC_THRESHOLD = 0.35— Jaccard must exceed this to consider same topiccheck_and_log(): for each existing_fact above threshold, classify → log if type != NONEalready_detected(): bidirectional pair check(A,B) OR (B,A)— prevents duplicatesclassify_contradiction(): keyword sets for TEMPORAL (year/date words), SOURCE (attribution phrases), PREFERENCE (want/like/prefer etc.), FACTUAL (fallback)generate_explanation():f"{type} contradiction: '{a[:80]}' vs '{b[:80]}'"resolve()marksresolved=1, stores resolution text —rowcount > 0returnget_health_summary(): total/unresolved/by_type counts + top-3 entities by count- WAL mode, 3 indexes (resolved, entity, type)
Tests verified: TEMPORAL detected on year mismatch; SOURCE on "according to" phrases; PREFERENCE on "prefer"; FACTUAL on same-topic factual conflict; duplicate pair prevention confirmed (2nd call returns []); resolve marks as resolved; health summary keys present with correct counts.
Gap: TEMPORAL classification requires date/time keywords to appear in BOTH fact_a and fact_b. If only fact_b has "2023" and fact_a doesn't mention a date, the pattern won't fire TEMPORAL — it will fall through to FACTUAL. This is acceptable behavior (FACTUAL is a safe fallback) but means temporal contradictions with only one dated fact are misclassified.
test_epistemic_monitor.py::test_can_interrupt_false_at_max:
Creates monitor with max_interrupts=2, calls consume_interrupt() twice,
asserts can_interrupt() == False. Live in-memory state — not mocked.
test_disagreement_surfacer.py::test_surface_returns_none_when_no_disagreement:
Identical output text for all 3 models → Jaccard distance = 0 → score = 0.0 →
below threshold → surface() returns None. Confirmed.
test_contradiction_detector.py::test_check_and_log_prevents_duplicate:
Same pair checked twice. First call → [id]. Second call → []. DB query in
already_detected() confirmed via live SQLite in tmp_path.
test_contradiction_detector.py::test_health_summary_most_contradicted_entities_top3:
Writes 5 contradictions for "entityA", 3 for "entityB", 1 for "entityC".
most_contradicted_entities = ["entityA", "entityB", "entityC"]. Confirmed
correct ordering by count DESC.
Severity: Low (by-design limitation)
Both 2C modules use word-level Jaccard as a proxy for semantic distance.
This is accurate for code and technical text but inaccurate for paraphrased
natural language. Production upgrade path: swap in sentence-transformers
cosine similarity (already a Cato dependency).
Severity: Low
Missing temporal contradiction when only one fact is dated.
Fix: Change check to any(kw in combined for kw in temporal_keywords)
where combined = fact_a.lower() + " " + fact_b.lower(). One-line fix.
Severity: Low
_unresolved_gaps is in-memory only. Gaps are lost on process restart.
Fix: Add SQLite persistence (use existing cato.db, new epistemic_gaps table)
in a future release when the weekly digest feature is implemented.
| Category | Result |
|---|---|
| EpistemicMonitor (premise extraction + gaps) | CONFIRMED |
| EpistemicMonitor (interrupt budget + unresolved) | CONFIRMED |
| DisagreementSurfacer (Jaccard + stdev score) | CONFIRMED |
| DisagreementSurfacer (classify + surface) | CONFIRMED |
| ContradictionDetector (Jaccard topic matching) | CONFIRMED |
| ContradictionDetector (4-type classification) | CONFIRMED |
| ContradictionDetector (duplicate prevention) | CONFIRMED |
| ContradictionDetector (health summary) | CONFIRMED |
| Semantic similarity (embeddings) | NOT USED — Jaccard only (documented) |
| TEMPORAL both-fact date requirement | PARTIAL — documented gap |
| New tests (2C) | 50 tests (25 epistemic, 25 disagreement) |
| New tests (2D) | 30 tests |
| Unified total | 1162 passed, 1 skipped, 0 failed |
| Final confidence score | 90% |
APPROVED
All three Phase I modules are real, tested, and functionally correct for their stated purpose. The Jaccard limitations are known and acceptable for the current use case (the alternative would add 200ms+ latency per turn). The ContradictionDetector's TEMPORAL gap is low-severity. All 80 new tests use live in-memory or tmp_path SQLite — no mocks for behavioral assertions.
Signed: Kraken — 2026-03-06