SCCE 2.0

SCCE (Self-Contained Cognitive Engine) is a local-first, citation-driven question-answering system. It answers questions over an ingested corpus using classical retrieval (BM25 + entity graph + spectral), a planner-driven verification loop, and a local n-gram synthesizer. No remote LLM calls are required on the answering path.

Quickstart (one command)

git clone <this-repo> scce && cd scce
npm install -g pnpm@9        # if you don't have pnpm
pnpm install:local           # installs deps, builds, runs migrations if Postgres is up
pnpm dev:server              # terminal A
pnpm dev:web                 # terminal B  →  http://localhost:5173

pnpm install:local is idempotent. If Postgres isn't running, it prints exactly which commands to run (Homebrew / Docker / apt) and exits non-zero so CI can detect it. Re-run after starting Postgres.

Reviewer fast-path (10 seconds, no DB required)

A hostile reviewer with the brief "show me, don't tell me" runs:

pnpm install && pnpm build       # ~30s, deps + 9-package build
pnpm publishable:check           # ~5s — 18/18 audit + 4-axis eval + fuzz + bench
pnpm seed:test                   # ~5ms — bootstraps a 5-domain brain

Reproducible numbers (April 2026 baseline, M4 / Node 25):

18/18 hostile-review guards pass
100% capability_correctness on the 16-Q embedded eval
5,000 fuzz iters, 0 contract violations
25,000 q/s, p99 0.06ms end-to-end (16-triple brain)
99 pass / 0 fail / 3 skipped on the unit suite (clean checkout)

Optional diagnostic (NOT a release gate):

pnpm test:integration            # runs 45 gated end-to-end tests on a 5-domain seed

The integration suite stresses the extraction/reasoning heuristics on a deliberately tiny inline corpus; ~30% of them fail by design (the heuristics are tuned for live corpora ≥10× larger). The suite is a development-time aid, not a pass/fail signal — see the test file header for why. The actual release gate is pnpm publishable:check.

Framing: see docs/POST_TOKEN_ECONOMY.md. Honest gap closure log: docs/HOSTILE_AUDIT_PRODUCTIZATION.md §14.

The Brain Bundle

Everything SCCE has learned — the n-gram models, the linguistic-primitives lexicon, the concept graph (including Wikipedia-mined common-sense edges) — can be packed into a single portable file:

pnpm brain:export                      # → ./scce-<timestamp>.brain
pnpm brain:import ./scce-<ts>.brain    # CRC-verified, atomic, idempotent

Or from the GUI: Brain tab → Download .brain / drag-and-drop a file to inspect → review entries → Apply. Two-step import means you always see the manifest (and per-entry CRC32) before disk state is overwritten.

The bundle format is documented in brainBundle.ts: 6-byte magic, version, JSON manifest, length-prefixed entries each with their own CRC32, and a footer CRC32 over the concatenation. Hostile-audit posture is documented in the file header.

What It Is, And What It Is Not

SCCE is:

A local-first QA system that answers from documents you ingest.
A retrieval pipeline that fuses lexical (BM25), graph (entity co-occurrence), and spectral (TF-IDF + truncated SVD) channels.
A planner that hypothesizes, verifies against retrieved spans, and emits per-sentence provenance.
A Fastify server with explicit migrations, controlled shutdown, SSE streaming, and a job queue.

SCCE is not:

A general-purpose LLM. Its synthesizer is a Kneser-Ney n-gram model; fluency is bounded by what it has seen.
A reasoning engine in the chain-of-thought sense. "Reasoning" here means structured retrieval + verification, not free-form deduction.
A drop-in replacement for hosted models. It is built for use cases where traceability and offline operation outweigh stylistic polish.

Provenance is a hard requirement of the verification path: sentences without supporting span overlap are flagged, not hidden.

Known Honesty Bounds

Four gaps are documented openly because the project's posture is "say what we don't do" rather than paper over weaknesses:

Surface fluency is bounded by mined sentence templates plus the 6-gram synthesizer. SCCE now ships a fluency realizer that walks a proof tree over the concept graph and slots its claims into English frames mined directly from Wikipedia (sentenceTemplates.ts). Output is then polished, perplexity-ranked against the local n-gram model, and run through the self-evaluator, which can force-abstain when coverage, citations, or completeness fall below threshold. Answers in well-covered domains read like English; in thin domains the system prefers "I don't know" to fluent guessing.
Zero-shot generalization to unseen tokens is handled by the honest analogy engine — morphological + compositional + structural analogy with hard per-kind confidence caps (≤ 0.6). When no analogy can be honestly drawn, the planner says "I don't know."
Multi-step reasoning is capped at 4 hops by default and may extend to 8 hops only when the caller supplies a per-step entailment verifier (cite-or-stop). See multiHopWalker.ts.
Common-sense breadth is mined from Wikipedia ITSELF, not from crowdsourced graphs (no ConceptNet ingestion). Five novel signals — list-page enumeration, category co-membership, superlative typicality, infobox value priors, hyperlink-anchor aliasing — are implemented in commonSenseMiner.ts and committed with empirical confidence and per-signal provenance.

What SCCE Does

SCCE combines five capabilities into one deployable system:

Corpus ingestion across mixed sources (documents, spreadsheets, code, wiki-style corpora).
Knowledge structuring via entities, relations, and spectral projections.
Multi-channel retrieval (lexical, graph, spectral) with diversity-aware fusion.
Planner-driven reasoning loop that tests and refines candidate claims.
Local synthesis with quality gates, provenance checks, and uncertainty signaling.

Reasoning, Fluency, and Abstention

On top of retrieval, SCCE has a small post-LLM cognitive layer:

Proof-tree reasoner (reasoner.ts) resolves anchors in the question, harvests 1-hop edges from the concept graph, then beam-searches multi-hop chains with the multi-hop walker. Every claim carries citations and a confidence; the tree exposes a completeness score and an answer | abstain recommendation.
Sentence templates (sentenceTemplates.ts) are mined from Wikipedia during ingestion (first 10 sentences per document) and round-trip inside the .brain bundle as templates.json. Frames are keyed by predicate (is-a, worked-with, wrote, ...) so realization stays grounded in attested phrasings.
Fluency realizer (fluencyRealizer.ts) plans (definition → property → relation → multi-hop), slots each claim into a mined frame, joins with connectives ("Additionally", "By extension"), polishes (a/an, capitalization, punctuation), ranks candidates by n-gram perplexity, and appends a (sources: ...) suffix.
Self-evaluator (selfEval.ts) scores six honesty signals — coverage, citation density, unverified-chain risk, confidence floor, completeness, fabrication risk vs. excerpts — and can hard-abstain at severity ≥ 0.999.
Bundle federation (bundleFederation.ts) loads N signed .brain bundles in priority order into a single live brain, so a 50 GB shipped knowledge pack can ride alongside user-trained ones with per-bundle CRC and signature verification.
Code & environment readers (codeReader.ts, environmentReader.ts) ingest TypeScript / JavaScript / Python source and project trees (respecting .gitignore) into the same concept graph as prose, so the brain can reason about the project it lives in.

The full path is exposed at POST /api/reason (see docs/API_REFERENCE.md) and exercised end-to-end by pnpm smoke:post-llm.

End-to-End Pipeline

At a high level:

Ingest files into documents/spans/chunks.
Correlate entities and relations.
Build and refresh spectral basis/projections.
Train and load local n-gram models.
Resolve queries through perception, retrieval, planning, verification, and synthesis.
Return response text plus source-linked context.

This is implemented as a stable server runtime with background jobs and API visibility for each operational phase.

Operational Posture

SCCE is structured for real operations, not just demos. Hardening is opt-in via environment, and unsafe defaults fail closed in production.

Stateful service with explicit DB + model dependencies
Startup migration safety and controlled shutdown persistence
Async chat mode with SSE streaming and status events
Job queue control for indexing/training/spectral refresh
Operational endpoints for status, topology, activity, and audit export
Runbook coverage for backups, restore, incidents, and handoff

See full operating details in docs/OPERATIONS.md and docs/PRODUCTION_HANDOFF.md.

Architecture at a Glance

apps/server: Fastify API, startup/shutdown lifecycle, routes, worker orchestration
apps/web: React UI for chat, vault, training, artifacts, and system monitoring
packages/core: ingestion, correlation, retrieval, planner, synthesis, spectral logic
packages/db: PostgreSQL access and migration layer
packages/types: shared TypeScript types and contracts
packages/compute: parallel pipeline and compute dispatch utilities
packages/security: policy and audit support
packages/plugins: renderer and webapp template infrastructure
packages/sketches: probabilistic structures used by supporting workflows
data: local models, uploads, corpora, artifacts, and runtime state

Prerequisites

Node.js >= 20
pnpm >= 8 (via corepack)
PostgreSQL >= 14

Quick Start (Local Development)

Install dependencies.

corepack enable
pnpm install

Configure environment. For development, auth is bypassed when NODE_ENV=development (or SCCE_DEV_MODE=1):

export SCCE_DB_URL="postgres://scce_app:scce_app@localhost:5432/scce"
export NODE_ENV=development

Build all packages.

pnpm -r build

Start the server and the web app in separate terminals.

pnpm dev:server
pnpm dev:web

Verify runtime health.

curl http://127.0.0.1:3000/health
curl http://127.0.0.1:3000/api/system/status

Fast Local Bootstrap

For a full local bootstrap (DB path, demo seeding, ingest, training triggers, and validation request):

pnpm tsx scripts/setup-complete-system.ts

First API Interaction

Synchronous chat (no attachments):

curl -X POST http://127.0.0.1:3000/api/chat `
	-H "Content-Type: application/json" `
	-d '{"message":"What is in the vault?","conversationId":null,"attachments":[]}'

Asynchronous chat pattern (attachments -> SSE):

POST /api/chat with attachments.
Read conversationId from response.
Stream events from GET /api/events/:conversationId.

See detailed contracts and payload shapes in docs/API_REFERENCE.md.

Core Scripts

pnpm db-setup: create/apply database schema
pnpm smoke-test: validate key runtime paths
pnpm seed: seed demo corpus
pnpm status: status script
pnpm ingest:wiki: run wiki ingestion/training pipeline
pnpm brain:export / pnpm brain:import: pack / unpack the live brain as a portable .brain bundle
pnpm brain:keygen: generate an Ed25519 keypair for signed bundles
pnpm brain:federate <dir-or-file...>: load N .brain bundles into one live brain (50 GB ship path)
pnpm env:scan <root>: ingest a project directory (source + manifests + docs) into the concept graph
pnpm smoke:post-llm: end-to-end reasoner → realizer → self-eval smoke test
pnpm quality:check: headers + architecture checks
pnpm eval: run the gold-set QA evaluation harness
pnpm eval:strict: same, but exit non-zero if quality floors are not met
pnpm quality:deep: quality checks + hostile audit suite + strict eval

Evaluation

SCCE ships a gold-set runner at scripts/eval-qa.ts. It exercises the live /api/chat endpoint against a curated set of questions and computes:

retrieval_recall@k — fraction of gold documents present in top-k retrieved spans
provenance_precision — fraction of cited spans whose source actually appears in the answer
provenance_coverage — fraction of answer sentences with at least one supporting citation
answer_keyword_recall — fraction of expected keywords present in the answer
latency_ms_p50 / latency_ms_p95 — end-to-end answer latency

Gold sets live at data/eval/gold.json. A placeholder is auto-seeded on first run. Each entry has the shape:

{
  "id": "q1",
  "question": "What does SCCE use for retrieval?",
  "expected_doc_paths": ["docs/ARCHITECTURE.md"],
  "expected_keywords": ["BM25", "spectral"],
  "min_provenance": 1
}

Run:

pnpm eval            # writes a timestamped JSON report under data/eval/reports/
pnpm eval:strict     # additionally fails the process if quality floors are not met

Strict-mode floors (configurable in the script):

answer_present_rate >= 0.8
provenance_precision_avg >= 0.7
provenance_coverage_avg >= 0.5

Security and Trust Model

SCCE fails closed in production and is configured entirely through environment variables. There are no hard-coded credentials and no implicit allow-all behavior outside development.

Authentication

SCCE_API_KEYS — comma-separated list of accepted API keys. Required in production. Requests must send one of these as Authorization: Bearer <key> or x-api-key: <key>.
SCCE_DEV_MODE=1 — explicit dev/test bypass. Auth is skipped. Never set this in production.
NODE_ENV — when development or test, auth is bypassed automatically. Production requires NODE_ENV=production and a populated SCCE_API_KEYS, or the server refuses to start.

CORS

SCCE_CORS_ORIGINS — comma-separated list of exact-match allowed origins (e.g. https://app.example.com,https://admin.example.com).
In dev/test, localhost origins on any port are allowed automatically.
The null origin is always rejected.

Self-training

SCCE_ALLOW_SELF_TRAIN=1 — opt-in to letting the synthesizer learn from its own answers. Off by default to avoid model collapse. Per-call overrides exist for explicit user feedback (positive feedback only) and for force-trained corpus material.

Database

SCCE_DB_URL — PostgreSQL connection string. Required.
SCCE_DB_STATEMENT_TIMEOUT_MS — optional per-statement timeout (default applied at pool init).

Other operational guarantees

Upload/ingest paths are validated before filesystem operations.
Duplicate controls reduce accidental corpus bloat and replay noise.
Provenance verification is content-aware: cited spans are resolved against the underlying chunk text and rejected if there is no token-level overlap with the cited sentence.

Operating SCCE in Production

Operational priorities:

keep DB and model backups current
monitor chat error and timeout rates
watch training/job queue health
track ingestion growth and duplicate trends
validate release upgrades against migration path

Use these docs as your source of truth:

Contributing and Engineering Standards

SCCE expects disciplined, auditable changes.

keep changes scoped and reversible
preserve API contracts or document intentional changes
keep SQL parameterized and input validation explicit
update docs alongside behavior changes
validate with build/smoke/quality scripts before merge

Contributor workflow references:

Documentation Index

docs/ARCHITECTURE.md: full system architecture and pipeline internals
docs/DEVELOPMENT.md: development workflow and package boundaries
docs/ONBOARDING.md: first-contribution and first-PR path
docs/OPERATIONS.md: startup, ingest, training, backup, troubleshooting
docs/PRODUCTION_HANDOFF.md: SLOs, monitoring, incidents, ownership transfer
docs/API_REFERENCE.md: endpoint reference and payload examples
docs/MATH_OVERVIEW.md: code-grounded equations, scoring functions, and thresholds
docs/AI_SKILLS.md: repository-specific assistant guidance and guardrails

License

Proprietary. See LICENSE for terms.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
apps		apps
docs		docs
packages		packages
papers		papers
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.npmrc		.npmrc
LICENSE		LICENSE
LICENSE.md		LICENSE.md
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
start-server.ps1		start-server.ps1
tsconfig.base.json		tsconfig.base.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SCCE 2.0

Quickstart (one command)

Reviewer fast-path (10 seconds, no DB required)

The Brain Bundle

What It Is, And What It Is Not

Known Honesty Bounds

What SCCE Does

Reasoning, Fluency, and Abstention

End-to-End Pipeline

Operational Posture

Architecture at a Glance

Prerequisites

Quick Start (Local Development)

Fast Local Bootstrap

First API Interaction

Core Scripts

Evaluation

Security and Trust Model

Authentication

CORS

Self-training

Database

Other operational guarantees

Operating SCCE in Production

Contributing and Engineering Standards

Documentation Index

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SCCE 2.0

Quickstart (one command)

Reviewer fast-path (10 seconds, no DB required)

The Brain Bundle

What It Is, And What It Is Not

Known Honesty Bounds

What SCCE Does

Reasoning, Fluency, and Abstention

End-to-End Pipeline

Operational Posture

Architecture at a Glance

Prerequisites

Quick Start (Local Development)

Fast Local Bootstrap

First API Interaction

Core Scripts

Evaluation

Security and Trust Model

Authentication

CORS

Self-training

Database

Other operational guarantees

Operating SCCE in Production

Contributing and Engineering Standards

Documentation Index

License

About

Topics

Resources

License

Licenses found

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages