Skip to content

harness(phase-13): per-agent telemetry + agentic search foundation#462

Merged
korutx merged 41 commits into
trunkfrom
feat/harness-phase-13-telemetry-agentic
May 20, 2026
Merged

harness(phase-13): per-agent telemetry + agentic search foundation#462
korutx merged 41 commits into
trunkfrom
feat/harness-phase-13-telemetry-agentic

Conversation

@odtorres

@odtorres odtorres commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Phase A (telemetry foundation): OTel GenAI semconv emitters (nexo.run root + nexo.<agent> per-LLM-call spans) wired into every LangGraph node; Traceloop auto-instruments Anthropic/OpenAI SDKs; FastAPIInstrumentor adds HTTP-request spans; versioned pricing table converts token usage to gen_ai.usage.cost_usd at run time.
  • Phase B (backend deployment): infra/observability/ self-hosted Langfuse v3 stack (postgres + clickhouse + redis + minio + langfuse-web + langfuse-worker + otel-collector) with deterministic init keys, bootstrap.sh, and a README covering bring-up, backup, backend-swap.
  • Phase E (agentic search): LLM intent router + explicit mode field on RunRequest (auto/canned/meta/agentic); composable DB primitives (nexo_describe/nexo_select/nexo_grep/nexo_explain) with a multi-layer sqlglot safety gate (AST mutation rejection, LATERAL rejection, positive function allowlist, table allowlist ^nexo\.dx_*$, EXPLAIN cost gate); meta-question agent (no SQL); conversational memory (InMemoryConversationStore + ConversationStore Protocol); new agentic LangGraph; route-aware tenancy_gate; provenance JSONL log + curate CLI; cost-report CLI.
  • Supervisor integration: HarnessSupervisor.run() resolves the route via LLMIntentRouter (with keyword fallback below threshold), dispatches to nexo_graph / agentic_graph / meta_agent_node / storytelling, and round-trips conversation_id through the store. The FastAPI lifespan injects all Phase-E modules on Nexo boot.

What's NOT in this PR (follow-up)

  • F4 — week of agentic traffic: time-gated by definition; ships as a separate retrospective PR after a week of real use.
  • C2 dashboards: 6 panels (per-agent cost, cache hit ratio, per-tenant cost, per-mode cost split, latency per agent, tool execution success rate). Stack is up; the dashboards are operator-curated in the Langfuse UI then exported as JSON. Separate PR.
  • C4 alerting: needs the cost-CLI's live Langfuse fetcher (currently fixture-mode-only) plus 24 h of traffic for thresholds.
  • Agentic graph executor: the topology (tenancy_gate / planner / synthesizer / critic / summarizer) is wired, but planner and synthesizer are stubs. Connecting them to the composable primitives + provenance log is a contained follow-up; tests and the safety gate already cover the primitive surface.

Known limitations / caveats

  • Cost-report CLI --by agent is a trace-level approximation. Each trace can carry multiple agent:<name> tags (one per agent that fired during the run), but _project_langfuse_trace projects the first one. For per-tenant and per-mode rollups (single-valued tags) this is correct; for per-agent, dollars get attributed to whichever agent tag came first in the array. True per-agent attribution requires fetching the observations endpoint — ships in the stacked follow-up PR harness(phase-13): per-tenant billing surface (stacked on #462) #470.
  • NEXO_META inner LLM call lacks observation-level tenant attribution. The root nexo.run trace carries langfuse.tags = [tenant:<id>, …] (E10), but the inner anthropic.chat observation Traceloop auto-emits inside _run_nexo_meta does not — meta_model is invoked raw. Trace-level filters work; observation-level rollups under-count meta runs. Fixed in PR harness(phase-13): per-tenant billing surface (stacked on #462) #470 by wrapping meta_model via instrument_model(self.meta_model, "meta_agent", ctx).

Verification artifacts

Test suite: 295 passed, 1 skipped, 0 failed (uv run pytest from miot-harness/). Plan 12's 151-test baseline preserved; 144 new tests across tests/{observability,runtime,agents,integrations}/.

Live telemetry (Langfuse @ http://localhost:3000): verified via direct ClickHouse query after F2/F3 curl battery:

  • 11 nexo.run root spans
  • 6 nexo.synthesizer + 6 nexo.filter_expert per-agent spans
  • 22 anthropic.chat (Traceloop-auto-instrumented) child spans
  • Per-mode trace split: 5 auto / 3 canned / 2 meta / 1 agentic
  • Every span carries modular.{mode, run_id, tenant_id} in its metadata

F2 — canned-mode (mode=\"canned\"): plan 12's G6/G7/G8 still produce Markdown answers from the real Coordinador snapshot.

F3 — mode bypass + auto-router:

  • mode=\"meta\" for tenant_id=mintral → primer+catalog answer, no DB.
  • mode=\"meta\" for tenant_id=non-mintral-customer → allowed (meta is non-confidential).
  • mode=\"agentic\" for tenant_id=non-mintral-customerModeAccessDenied at request-validation; 200 response with explicit refusal answer.
  • mode=\"agentic\" for tenant_id=mintral → fires through the stub planner+synthesizer; emits the right spans.
  • AUTO route on "dimensionamiento para mañana" → LLMIntentRouter picks NEXO_QUERY at confidence 0.75 → answered from the canned graph.

E3 composable-primitives safety gate (the high-risk surface, reviewed twice):

  • 47 unit tests including 12 deliberate-bypass attempts (LATERAL function-in-FROM, side-effect functions pg_terminate_backend / pg_sleep / set_config / lo_import / pg_reload_conf / pg_advisory_lock / pg_read_file / dblink, multi-statement, CTE with mutation, MERGE, COPY, non-allowlisted schemas).
  • Positive function allowlist (~80 entries) is the only durable defense against side-effect functions that aren't INSERT/UPDATE/DELETE AST nodes but cause writes, kills, or DoS.

Test plan

  • uv run pytest -q from miot-harness/ → 295 passed
  • cd infra/observability && docker compose up -d && ./bootstrap.sh → 7 services healthy, .env paste-block printed
  • MIOT_HARNESS_OTEL_ENABLED=true uv run uvicorn miot_harness.api.server:create_app --factory/health returns nexo.enabled=true + curated tool list
  • POST /runs with mode=canned (×3 scenarios) → 200, real Coordinador data, spans in Langfuse
  • POST /runs with mode=meta (×2 tenants) → 200, primer+catalog answer, no SQL
  • POST /runs with mode=agentic (Mintral + non-Mintral) → 200/refusal as appropriate
  • POST /runs with implicit mode=auto → LLMIntentRouter classifies correctly
  • Trace metadata in ClickHouse carries modular.mode/modular.run_id/modular.tenant_id for grouping

Code review checkpoints

Two superpowers:requesting-code-review cycles run during the plan:

  • After Phase A: 5 Important issues found, all addressed (.env.example drift, conftest filter scope, configure_tracing no-op-on-second-call, spans.py docstring, A6 propagation test docstring) — commit 21037769.
  • After Phase E3 (the high-risk safety gate): 2 Critical bypasses found and closed (LATERAL function-in-FROM, side-effect functions in SELECT list) — commit 684fbd09.

Plan ref

.cursor/plans/ai-first/13-post-nexo-roadmap.md — Phases A, B, C (partial), D (partial), E, F1-F3.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Multi-mode request handling: auto-routing, canned responses, meta-agent answers, and agentic workflows
    • Conversation history with automatic turn summarization for multi-turn interactions
    • LLM-based intent router with confidence thresholds for intelligent request classification
    • Cost tracking and pricing reports for LLM usage across models
    • Database query safety gates and cost enforcement to prevent expensive operations
  • Infrastructure

    • Local observability stack with Langfuse + OpenTelemetry for development telemetry
    • Docker Compose setup for running observability services locally
    • Build attestations (SLSA provenance, SBOM) in CI/CD pipeline
  • Configuration

    • New environment variables for OpenTelemetry and Langfuse integration
    • Intent router confidence thresholds and cost limits for query operations

Review Change Stack

odtorres added 22 commits May 12, 2026 13:21
- Add `opentelemetry-{api,sdk,exporter-otlp}`,
  `opentelemetry-instrumentation-fastapi`, `traceloop-sdk` to runtime deps
  and `pytest-opentelemetry` to dev deps.
- `uv sync` resolves 163 packages cleanly with no version conflicts against
  the existing langchain/langgraph/fastapi/pydantic pins.
- New smoke tests under `tests/observability/test_deps.py` exercise the
  public import surface each Phase-A task depends on (TracerProvider,
  OTLP gRPC trace exporter, FastAPI instrumentation, traceloop.sdk,
  pytest_opentelemetry).
- Seed `.ralph/` state, log, blockers, and RALPH_PROMPT.md so iteration
  progress is tracked in-tree.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A1)
Add `miot_harness.observability` with four units that the rest of Phase A
plugs into:

- `pricing.py` — versioned `ModelPricing` table (Opus 4.7, Sonnet 4.6,
  Haiku 4.5, gpt-4o-mini) keyed by model id with `effective_date` so
  historical reads stay stable; Anthropic-style `TokenUsage` buckets;
  `compute_cost` returns a `Decimal` quantized to 6 decimals; raises
  `UnknownModelError` for unmapped models.

- `otel.py` — `configure_tracing(enabled, service_name, endpoint,
  environment)` builds a `TracerProvider` with a Resource carrying
  `service.name` / `service.namespace=modular` /
  `deployment.environment` and a `BatchSpanProcessor` wrapping
  `OTLPSpanExporter`. Returns `None` when disabled so the FastAPI
  lifespan (A4) can no-op without branches. `shutdown_tracing` is
  None-safe.

- `spans.py` — `agent_span(name, *, run_id, tenant_id?, mode?)` context
  manager that opens a `nexo.<name>` span carrying
  `gen_ai.operation.name`, `modular.run_id`, `modular.tenant_id`, and
  `modular.mode` (last two omitted when unset) and nests under the
  current context so traces form a tree.

- `callbacks.py` — `NexoTelemetryCallback(BaseCallbackHandler)` tracking
  per-`run_id` span handles in a dict (so concurrent LLM calls under
  the same node do not cross-talk). Maps LangChain's total-style
  `usage_metadata.input_tokens` back into Anthropic buckets, infers
  provider from `serialized.id`, and emits
  `gen_ai.{system, request.model, usage.input_tokens, usage.output_tokens,
  usage.cache_read.input_tokens, usage.cache_creation.input_tokens,
  usage.cost_usd}` plus `modular.{agent, run_id, tenant_id, mode}` on
  `on_llm_end`. `on_llm_error` records the exception and sets
  `Status(ERROR)`.

Tests under `tests/observability/` attach an `InMemorySpanExporter` to
the `TracerProvider` already installed by pytest-opentelemetry (OTel's
global provider is install-once) and filter to `nexo.*` spans so the
plugin's own per-test spans don't pollute assertions. 17 new tests cover
pricing math, span context, configure_tracing return contract, and the
callback's full span-emission surface incl. concurrent runs and provider
inference. Full suite: 175 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A2)
Per-agent attribution from day one:

- `runtime/nexo_graph.py`: new `_instrument(model, agent_name, ctx)` helper
  that returns `model.with_config(callbacks=[NexoTelemetryCallback(...)])`,
  threading `ctx.run_id` and `ctx.tenant_id` through every LLM-bearing seat
  (filter_expert, domain_analyst, synthesizer, critic, summarizer). Tool-only
  seats (data_fetcher, freshness_judge) stay model-less and produce no
  callback spans.

- `runtime/supervisor.py`: wrap `nexo_graph.ainvoke(...)` in
  `agent_span("run", run_id, tenant_id)` so a `nexo.run` root span lives
  over the whole invocation. Combined with the callback's per-agent spans,
  Langfuse can render a complete tree per run; the `modular.run_id` attr
  on every span lets us regroup even when LangGraph's parallel branches
  break OTel context propagation (the gotcha A6 covers).

Tests: `tests/observability/test_graph_wiring.py` builds the real
`build_nexo_graph(...)` with scripted `FakeListChatModel`s and asserts the
graph emits `nexo.<agent>` spans carrying `modular.{agent, run_id,
tenant_id}` for every seat that fired, plus a `nexo.run` root with the
right `gen_ai.operation.name`/`modular.run_id`/`modular.tenant_id`. Full
suite: 177 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A3)
Pull A5 ahead of A4 so the FastAPI lifespan (A4) has typed settings to
read instead of falling back to raw env reads.

New `HarnessSettings` fields, all under the existing
`env_prefix="MIOT_HARNESS_"`:

- `otel_enabled: bool = False` — telemetry is off by default; the
  lifespan no-ops on configure_tracing when this is unset.
- `otel_endpoint: str = "http://localhost:4317"` — OTLP gRPC target.
- `otel_service_name: str = "miot-harness"` — resource attr.
- `otel_environment: str = "local"` — resource attr (`deployment.environment`).
- `langfuse_public_key: str | None = None`
- `langfuse_secret_key: str | None = None`

Tests: 4 new in `test_config.py` covering safe defaults + env overrides
for both groups. Full suite: 181 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A5)
Wire telemetry boot into the FastAPI lifespan:

- Before Nexo boot, call `configure_tracing(enabled=settings.otel_enabled,
  service_name, endpoint, environment)`. When enabled, follow with
  `Traceloop.init(app_name, api_endpoint, disable_batch=False,
  telemetry_enabled=False)` so Anthropic/OpenAI SDK calls auto-emit
  `gen_ai.*` child spans against the same global TracerProvider that
  `NexoTelemetryCallback` per-agent spans flow through. Self-hosted —
  `telemetry_enabled=False` blocks Traceloop's phone-home pings.

- `app.state.tracer_provider` exposes the provider for introspection.

- `shutdown_tracing` runs in BOTH `finally` branches of the lifespan
  (the early Nexo-disabled return AND the regular pool-close path) so
  the batch exporter always flushes on shutdown.

Tests: `tests/observability/test_lifespan_init.py` patches
`configure_tracing`/`Traceloop`/`shutdown_tracing` and asserts:

- Off-path: configure_tracing called once with `enabled=False`,
  `Traceloop.init` NOT called.
- On-path: configure_tracing forwards all four settings,
  `Traceloop.init` called with `app_name=service_name`,
  `shutdown_tracing` called with the returned provider on teardown.

Full suite: 183 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A4)
Phase A's known-risk test: when LangGraph fans out into parallel
branches, OTel parent-context propagation can break — child spans
land under the wrong parent and usage attribution gets mixed.

`tests/observability/test_propagation.py` builds a 3-node synthetic
graph `planner → (branch_a ∥ branch_b) → joiner` where each parallel
branch instantiates its own `NexoTelemetryCallback`. After the run we
assert every `nexo.*` span carries the right `modular.{agent, run_id,
tenant_id}` — the explicit attrs Langfuse uses to regroup spans when
the OTel context-propagation tree is unreliable.

Second test: 20 concurrent callbacks via `asyncio.gather` confirm the
per-`run_id` span-dict inside `NexoTelemetryCallback` doesn't cross-talk
under realistic concurrency (multiple LangGraph parallel branches).

The two other A6 test files (`test_callbacks.py` for span emission,
`test_pricing.py` for cost math) were written under A2 as part of TDD
on the source — A6 adds only the propagation piece.

Phase A is now complete; Phase B (backend deployment) is blocked on
operator preconditions per `.ralph/blockers.md`. Full suite: 185 passed,
1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A6)
Reviewer flagged 0 Critical / 5 Important issues during the Phase A
checkpoint. All five fixed plus one recommendation:

1. `.env.example` was missing the new OTel/Langfuse keys, violating the
   file's "every config.py setting documented here" invariant. Added
   commented entries for MIOT_HARNESS_OTEL_{ENABLED,ENDPOINT,
   SERVICE_NAME,ENVIRONMENT} and the LANGFUSE keys.

2. `tests/observability/conftest.py` filtered to `nexo.*` only — Traceloop
   auto-instrumentation spans (`gen_ai.*` / `anthropic.*` / `openai.*`)
   would have been silently dropped in Phase B tests. Broadened the
   filter and renamed `_NexoSpanFilter` → `_HarnessSpanExporter`.

3. `observability/otel.py` silently no-op'd on re-init: a second
   `configure_tracing(enabled=True)` call with a provider already
   installed returned a NEW SDK provider whose spans went nowhere
   (`set_tracer_provider` is install-once). Now we detect the existing
   `TracerProvider`, log a warning, and return that one instead. Test
   rewritten with `patch` on `trace.get_tracer_provider`/`set_tracer_provider`
   to exercise both branches without polluting global state.

4. `observability/spans.py` docstring claimed nodes wrap themselves in
   `agent_span`, but no node body does that. The callback layer emits
   `nexo.<agent>` for LLM calls; `agent_span` is currently only used
   for the root `nexo.run` in `supervisor.py`. Docstring corrected.

5. A6 propagation test (`test_propagation.py`) used `FakeListChatModel`
   without any real `await` suspension — branches ran to completion
   inside their own tasks and never genuinely interleaved. Added
   `await asyncio.sleep(0)` around each `ainvoke` so the event loop
   yields between branches, and tightened the docstring to scope what
   the test does and doesn't prove (structural attr defense vs.
   OTel-context propagation).

Plus reviewer Recommendation 4: wired
`FastAPIInstrumentor.instrument_app(app, tracer_provider=...)` in the
lifespan so HTTP request spans parent `nexo.run` and the full
request → graph → LLM tree is visible in Langfuse.

Full suite: 186 passed, 1 skipped.
Skip ahead of blocked B/C/D (operator-precondition gated per
`.ralph/blockers.md`) and ship the offline-doable agentic-search
foundation.

`runtime/intent_router.py` — new `LLMIntentRouter`:
  - Async `route(message)` that asks the LLM to classify into one of
    NEXO_QUERY | NEXO_META | NEXO_AGENTIC | STORYTELLING_RUN | DIRECT |
    OTHER with calibrated confidence.
  - Tolerant JSON parsing: accepts bare JSON or a ```json``` fenced
    block (real LLMs often wrap). Unknown route names or malformed JSON
    fall back to the keyword router.
  - Confidence threshold (default 0.7): when LLM is uncertain, the
    deterministic keyword router decides — never silently misroute.

`runtime/mode_resolver.py` — new `resolve_mode`:
  - `mode="auto"` → LLM router decides.
  - `canned` → `NEXO_QUERY`, no LLM call.
  - `meta` → `NEXO_META`, allowed for any tenant (meta-info is
    non-confidential per the plan's tenant-gate decision).
  - `agentic` → `NEXO_AGENTIC`, but only when `request.tenant_id ==
    tenant_lock`; otherwise raises `ModeAccessDenied` BEFORE any LLM
    or graph call — request rejected at validation time.

`runtime/router.py` — extended `HarnessRoute` enum with NEXO_META,
NEXO_AGENTIC, OTHER. Existing keyword router unchanged; it now serves
as the deterministic fallback in LLM uncertain / failure paths.

`runtime/context.py` — `UserRequest.mode: RunMode = "auto"`
(`Literal["auto","canned","meta","agentic"]`). Pydantic rejects
unknown values (HTTP 422 in FastAPI).

`config.py` — `intent_router_model` (default Haiku tier),
`intent_router_confidence_threshold` (default 0.7).

Tests: 11 new under `tests/runtime/`:
  - `test_intent_router.py`: 30-prompt confusion matrix asserts ≥90%
    accuracy; plus low-conf fallback, unparseable response, fenced
    JSON, and invalid-route-name paths.
  - `test_mode_resolver.py`: all four mode-bypass paths (canned, meta,
    agentic-mintral, agentic-rejected), the auto delegation path, and
    pydantic validation rejecting `mode="bogus"`.

`.ralph/state.md`: note the out-of-order execution — E1 done before
B-D since those are operator-blocked.

Full suite: 197 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E1)
New `agents/meta_agent.py` (Haiku tier) answers schema / primer /
introspection questions WITHOUT touching the DB:

- `MetaAgentCatalogEntry` — narrow projection of `FunctionDescriptor`
  the agent needs (name, layer, title, body). Callers feed either the
  live introspection cache (production) or a hand-crafted list (tests).

- `meta_agent_node(state, *, model, primer, catalog)` — composes a
  system prompt from `primer + formatted catalog`, asks the LLM, and
  returns a `{"answer": str}` delta. The function signature
  **deliberately omits** any `registry` / `tools` / `pool` parameter:
  there's nothing to invoke and nothing to SQL against. The "no SQL"
  guarantee is structural, not aspirational.

Tests (`tests/agents/test_meta_agent.py`):
- Agent answers reference catalog functions by name.
- Primer text reaches the system prompt (so es_critico explanations
  stay grounded).
- Every catalog entry's title + body appears in the system prompt.
- Signature-level test asserts no DB/tool parameter ever creeps in
  via a future refactor.

Full suite: 201 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E2)
THE high-risk surface of plan 13: the agent gains four read-only DB
primitives. The safety gate has to be airtight before they go live.

Deps:
- Added `sqlglot>=25.0` (resolves to 30.7.0). Missed in A1; the plan
  calls for it.

Config:
- `nexo_explain_cost_threshold: float = 10000.0` — PostgreSQL plan-cost
  units. Above this, `nexo_explain` refuses the query.

`integrations/nexo/primitives.py`:

Safety gate `validate_select_sql(sql)`:
- Parses with sqlglot's postgres dialect.
- Rejects bare-or-trailing `;`, multi-statement input, or empty parses
  (`MultiStatementRejected`).
- Walks the AST for mutation nodes (Insert/Update/Delete/Drop/Create/
  Alter/AlterColumn/Grant/TruncateTable) — even inside CTEs
  (`MutationRejected`).
- Requires the top-level construct to be SELECT/Subquery/Union
  (`UnsupportedConstruct`).
- Asserts every `exp.Table` matches `^nexo\.dx_[a-zA-Z0-9_]+$` —
  `pg_catalog`, `information_schema`, `public.*`, even
  `nexo.private_*` are rejected (`AllowlistViolation`).
- Returns the validated AST so callers can re-serialize from it.

Primitives:
- `nexo_describe(table)` — narrow allowlist; queries
  `information_schema.columns` with parameterised schema/name. Bypasses
  the SELECT gate (own allowlist).
- `nexo_select(table, where?, order?, limit?, columns?)` — default
  LIMIT 100, hard cap 5000 (silent clamp). Identifiers quoted; SQL
  passes through `validate_select_sql` before execution.
- `nexo_grep(table, column, pattern, limit?)` — sugar over
  `WHERE col ILIKE $1`. Pattern parameterised; column identifier
  validated against `[a-zA-Z_][a-zA-Z0-9_]*`.
- `nexo_explain(query, cost_threshold)` — runs EXPLAIN (FORMAT JSON);
  raises `CostGateViolation` if total cost exceeds threshold.

Tests (`tests/integrations/nexo/test_primitives.py`): 27 new.

Safety gate (no DB):
- 3 multi-statement rejections (bare `;`, `; INSERT`, `SELECT; SELECT`).
- 8 mutation rejections (INSERT/UPDATE/DELETE/DROP/TRUNCATE/CREATE/
  ALTER/GRANT).
- 5 non-allowlisted rejections (pg_catalog, information_schema,
  pg_class, public.users, nexo.private_table).
- 3 allowlist-accepted SELECTs.
- 1 CTE-with-INSERT rejection.

Functional (mocked pool that mimics asyncpg's real `acquire()` shape —
sync method returning an async context manager):
- `nexo_describe` returns parsed rows + rejects non-allowlisted target.
- `nexo_select` clamps LIMIT to 5000; default is 100.
- `nexo_grep` builds the right ILIKE WHERE with a quoted column.
- `nexo_explain` refuses above threshold; accepts below; returns
  `{total_cost, node_type, plan}`.

Full suite: 228 passed, 1 skipped. Loop-prompt gate met: "Move past E3
without safety tests green" is satisfied.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E3)
Code review of E3 verified two Critical bypass paths through the
composable-primitives safety gate. Both fixed before E4 ships.

C1 — `LATERAL <function>` bypass.
  `SELECT * FROM nexo.dx_servicios, LATERAL pg_read_file('/etc/passwd')`
  was accepted by the gate because `LATERAL` holds an `exp.Anonymous`,
  not an `exp.Table` — the allowlist iterator never saw it. Fix:
  `validate_select_sql` walks `exp.Lateral` nodes and rejects with
  `UnsupportedConstruct`. Agentic queries don't need LATERAL.

C2 — side-effect functions in SELECT list.
  `pg_terminate_backend`, `pg_cancel_backend`, `set_config`, `pg_sleep`,
  `lo_import`, `pg_reload_conf`, `pg_advisory_lock`, `pg_read_file`,
  `dblink` are all SELECT-shaped per the AST (no Insert/Update/Delete
  node) but cause DoS, session-state mutation, file exfiltration, or
  kill other backends. Positive function allowlist (`_SAFE_FUNCTIONS`,
  ~80 entries: aggregates, math, string, date/time, JSON, array,
  window) is the only durable defense.

`_function_name_candidates(func)` returns the FULL set of sqlglot
spellings for a given function — `Anonymous.this`, `sql_name()`, `key`,
class name — so the gate matches regardless of how sqlglot normalizes
the AST (`to_char` → `TimeToStr`/`time_to_str`/`timetostr`,
`date_trunc` → `TimestampTrunc`/`timestamp_trunc`/`timestamptrunc`,
`now()` → `CurrentTimestamp`/`current_timestamp`). The allowlist
includes every form.

Additional review fixes:

I1 — CTEs were silently broken (`WITH ok AS (SELECT ...) SELECT * FROM ok`
  returned a confusing `AllowlistViolation: table 'ok' is outside allowlist`).
  Fix: collect CTE alias names from `exp.With.expressions` and exempt them
  from the table allowlist. The CTE body itself was already walked, so
  the inner SELECT's tables still get checked.

I4 — `validate_select_sql` returned the AST but `nexo_select`/`nexo_grep`/
  `nexo_explain` sent the raw user string to the DB. Fix: new
  `_render_safe(ast)` returns `ast.sql(dialect='postgres')` — comments,
  whitespace tricks, and any structurally-redundant input get normalized
  away. DB sees only what the gate validated.

M3 — `_MUTATION_NODES` gained `exp.Merge` (PG15+ MERGE) and `exp.Copy`.

Tests added (20 new in `test_primitives.py`):
- 3 LATERAL function-in-FROM rejections
- 9 side-effect function rejections in SELECT list
- 5 safe-builtin acceptances (count, lower+max, coalesce+to_char+now,
  date_trunc, json_build_object)
- 1 CTE-allowlisted acceptance
- 2 MERGE/COPY rejections

Direct re-probe via `uv run python` confirms all 8 bypass attempts
from the review are now blocked. Full suite: 248 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E3 review)
`observability/provenance.py`:
- `ProvenanceEntry` dataclass: question, sql, plan_cost,
  rows_returned, refreshed_at (optional), run_id, tenant_id.
- `ProvenanceLog(root, enabled=True)`: append-only JSONL log
  partitioned by UTC date (`YYYY-MM-DD.jsonl`). Creates the parent
  directory on first write so callers don't have to. Naive `now`
  arguments coerced to UTC. `enabled=False` makes every call a no-op
  so production can flip the log off without untangling call sites.

`scripts/provenance-curate.py` (new, worktree-root scripts dir):
- argparse CLI reading the last N days of JSONL.
- WHERE-clause normalization: collapses `'literals'` and numeric
  values to `$`, so `WHERE id = 5` and `WHERE id = 7` cluster into
  one bucket. Repeated (table, where-pattern) combos surface as new
  curated `fn_dx_*` function candidates.
- Surfaces three things in the weekly diff:
  1. Top-N most-repeated (table, where-pattern) buckets.
  2. Cross-tenant tables (≥2 tenants) — multi-tenant abstraction
     signal for plan 14.
  3. Plan-cost outliers — index or denial candidates.
- `--format json` for machine-readable output.

Tests (`tests/observability/test_provenance.py`, 7 new):
- Every field round-trips through JSONL.
- Append-only (one line per call).
- Date partitioning across days.
- mkdir on missing parent dir.
- Disable flag is a true no-op.
- `refreshed_at` may be None.

Full suite: 255 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E4)
`runtime/conversation.py`:
- `ConversationTurn(user_message, assistant_answer)` — dataclass.
- `ConversationHistory(conversation_id, turns, summary)` — dataclass.
- `ConversationStore` Protocol — the v2 Redis-backed implementation
  drops in here without changing call sites.
- `InMemoryConversationStore` — v1 dict-keyed store. Lost on process
  restart; acceptable for v1.
- `summarize_if_needed(conversation_id, summarizer)` returns False
  below `summarize_at_turns` (default 10) and fires the async
  summarizer at the (n+1)th turn, storing the result in
  `history.summary` so the next prompt can be compacted before
  context-window pressure hits.

`UserRequest.conversation_id: str | None = None` and
`HarnessRunRecord.conversation_id: str | None = None` added so Langfuse
groups multi-turn runs by this attribute. None for one-shot requests.

Tests (`tests/runtime/test_conversation.py`, 7 new):
- `get` returns None for unknown id.
- Append round-trips a single turn.
- Order preserved across multiple turns.
- Two conversations are isolated.
- `summarize_if_needed` no-ops below threshold (asserts summarizer
  not called).
- `summarize_if_needed` fires above threshold, stores summary.
- `conversation_id` round-trips across two /runs calls — the seam
  contract that matters most.

Full suite: 262 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E5)
E6 — new `runtime/agentic_graph.py`:
- `build_agentic_graph(settings, models, provenance_log)` returns a
  compiled LangGraph `StateGraph[NexoState]` with the nodes the plan
  names: `tenancy_gate / planner / synthesizer / critic / summarizer`.
- Entry point is the tenancy gate; planner bumps `turn_count` and
  exits with a failure at 12 (the agentic turn cap, vs 8 for plan-
  execute).
- Behavioral specifics — planner choosing curated-vs-composable,
  executor invoking primitives, freshness_judge/domain_analyst/
  critic verdict routing — are scaffolded but stubbed in this
  iteration. The plan acknowledges F3 (live verification) is the
  behavior gate; the wiring tests assert topology + happy-path
  reachability + turn-cap safety.
- `provenance_log` is threaded through the closure so the executor
  (when implemented) writes one row per primitive call (see E4).

E7 — new `runtime/tenancy.py`:
- `tenancy_gate_decision(ctx, route, settings) → TenancyDecision(
  allowed, refusal_message, audit_attr)` — pure function, no I/O.
- Behavior matrix from plan 13 §E7:
  - `NEXO_QUERY` / `NEXO_AGENTIC` → tenant must match
    `nexo_tenant_lock`; else refuse with a Mintral-only message.
  - `NEXO_META` → allow any tenant. Off-lock tenants get an audit
    attribute `tenant.bypass: meta_route` so Langfuse can panel
    them.
  - Unknown route → safe-default refuse for off-lock tenants.
- `nexo_graph._tenant_gate_node` and `agentic_graph.tenancy_gate`
  both delegate here. The mode resolver already refuses
  `mode="agentic"` for non-Mintral at request validation; the
  graph-level gate is defense in depth.

Tests:
- 4 in `test_agentic_graph.py`: non-Mintral refusal, happy path
  runs to synthesizer answer, turn-cap exit, required-nodes
  topology check.
- 7 in `test_tenancy_gate.py`: data routes refuse non-Mintral,
  data routes allow Mintral, NEXO_META allows any tenant + audit
  attr when off-lock, NEXO_META locked tenant no audit attr,
  unknown route safe-default refuse.

Full suite: 273 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E6, E7)
…spans

E9 — the C2 per-mode cost dashboard needs `modular.mode` on every span
so it can group canned vs agentic vs meta runs and compare cost-per-
question. Wire-up:

- `HarnessContext.mode: RunMode = "auto"` — propagated from
  `UserRequest.to_context()`.
- `supervisor._run_nexo` passes `mode=ctx.mode` to
  `agent_span("run", ...)`, so the root `nexo.run` span carries
  `modular.mode`.
- `nexo_graph._instrument(...)` forwards `ctx.mode` to
  `NexoTelemetryCallback`, so every per-agent `nexo.<agent>` span
  Langfuse renders for the run carries the same attribute.

Tests (`tests/observability/test_mode_telemetry.py`, 2 new):
- Root `nexo.run` carries `modular.mode='agentic'` when the caller
  sets it on the context.
- After running the real graph, every per-agent span carries
  `modular.mode='canned'` — proves the attr survives from
  HarnessContext → callback → emitted span.

Marks E8 done as well: the seven test files the plan lists for E8
(test_intent_router, test_mode_resolver, test_meta_agent,
test_primitives, test_provenance, test_conversation,
test_agentic_graph, test_tenancy_gate) were all written via TDD
during E1-E7; E8 is implicitly satisfied.

The `nexo.critic` non-zero requirement and C2 dashboards in E9 are
infrastructure-dependent (Phase B+C blocked on operator
preconditions). The attribute surface is wired and verified offline;
live verification gates on infra.

Full suite: 275 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E8, E9)
`uv run pytest` reports 275 passed, 1 skipped, 0 failed, 0 errors.

- Plan 12's 151-test baseline preserved (no regressions).
- 124 new tests across `tests/observability/`, `tests/runtime/`,
  `tests/agents/`, `tests/integrations/nexo/` covering Phase A
  (telemetry foundation) and Phase E (agentic search + composable
  primitives) — every offline-verifiable task in plan 13.

F2 (canned-mode live verification), F3 (agentic-mode live verification),
F4 (one week of agentic traffic), and F5 (PR with verification
artifacts) require operator preconditions per `.ralph/blockers.md`:
Docker daemon for the Langfuse stack, SSH tunnel to coordinador-prod,
and `ANTHROPIC_API_KEY`. Until those are in place the loop cannot
complete Phase F.

Phase E review checkpoint dispatched separately.
…p gap

Phase E code review checkpoint dispatched after E1-E9 + F1 landed.
Verdict: 0 Critical / 7 Important / 5 Minor. Critical bypasses from
the earlier E3 review remain closed.

Cheap Important fixes applied here:

- **.env.example** (#4): added MIOT_HARNESS_INTENT_ROUTER_MODEL,
  MIOT_HARNESS_INTENT_ROUTER_CONFIDENCE_THRESHOLD, and
  MIOT_HARNESS_NEXO_EXPLAIN_COST_THRESHOLD so the file's own
  "every config.py setting documented here" invariant holds.

- **agentic_graph.py docstring** (#2): rewritten to be honest about
  which nodes are wired in this iteration (tenancy_gate, planner stub,
  synthesizer, critic stub, summarizer stub) vs which are deferred to
  the F-phase supervisor integration (executor, freshness_judge,
  domain_analyst). The old docstring listed all 8 nodes as if they
  were present.

- **`_ = provenance_log`** (#3): clarified the intent comment so a
  reader knows the parameter is captured for the future executor node,
  not silently ignored.

Big Important #1 — `HarnessSupervisor` does NOT yet consume the new
Phase E modules (resolve_mode, LLMIntentRouter, InMemoryConversation
Store, build_agentic_graph, meta_agent_node). The supervisor still
uses the keyword `IntentRouter` for routing decisions. The seven new
modules are correct and tested in isolation but are an orphan island
until the F-phase wire-up lands.

Documented as a new OPEN entry in `.ralph/blockers.md` with the
concrete 5-step wire-up plan the next iteration follows when operator
preconditions are cleared.

Phase F status:
- F1 (uv run pytest green): **DONE** (275 passed, 1 skipped).
- F2, F3, F4 (live verification): **BLOCKED** on operator preconditions
  (Docker for Langfuse stack, SSH tunnel to coordinador-prod,
  ANTHROPIC_API_KEY).
- F5 (PR open with verification artifacts): **BLOCKED** on F2-F4.

Loop output: BLOCKER. Every remaining task requires either operator
preconditions or the supervisor wire-up that depends on them. The
plan's offline-doable scope is complete; the autonomous loop cannot
proceed further from this seat.

Full suite: 275 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E review)
The Phase E modules (LLMIntentRouter, mode_resolver, agentic_graph,
meta_agent_node, InMemoryConversationStore) were orphan-islands —
HarnessSupervisor.run() still used the keyword router. This commit
lands the integration the post-Phase-E review flagged as Important #1.

`runtime/supervisor.py`:
- `__init__` gains optional kwargs `llm_router`, `agentic_graph`,
  `meta_model`, `meta_primer`, `meta_catalog`, `conversation_store`,
  `tenant_lock`. Positional sig unchanged — existing call sites and
  tests don't break.
- `run()` resolves the route via `await resolve_mode(...)` when
  `llm_router` is injected; else falls back to the keyword router
  (Plan 12 default).
- `ModeAccessDenied` (agentic-mode for non-Mintral) is caught at
  the route-resolution boundary and surfaced as a refusal answer.
- Dispatch tree extended:
  - NEXO_QUERY     → existing `_run_nexo` (nexo_graph.ainvoke)
  - NEXO_META      → new `_run_nexo_meta` (calls meta_agent_node
                     with primer + catalog; no graph, no SQL)
  - NEXO_AGENTIC   → new `_run_nexo_agentic` (agentic_graph.ainvoke,
                     wrapped in `agent_span("run", ...)`)
  - STORYTELLING_RUN / DIRECT / OTHER unchanged.
- `conversation_id` round-trips: hydrate via store on entry, append
  one `ConversationTurn(user_message, assistant_answer)` on exit
  when both `conversation_store` and `request.conversation_id` are
  set. `record.conversation_id` mirrors the request id for telemetry.

`runtime/factory.py`:
- `build_harness` wires the always-on `InMemoryConversationStore` (it
  costs nothing without a request).
- The Phase-E modules that need live Nexo boot (LLM router, agentic
  graph, meta model) are wired by the FastAPI lifespan once the Nexo
  integration succeeds.

`api/server.py` lifespan:
- After `build_nexo_graph(...)` succeeds, builds the agentic graph
  (sharing the analyst model as planner), wires `meta_model` +
  primer + an auto-derived catalog from the registered curated
  tools, instantiates `LLMIntentRouter` (Haiku tier from settings),
  and sets `tenant_lock` from `nexo_tenant_lock`.
- Falls back gracefully — supervisor's keyword-router path stays
  intact when any of the boot steps fails.

Tests: 7 new in `tests/runtime/test_supervisor_phase_e.py`:
- explicit canned mode dispatches to nexo_graph
- explicit meta mode calls meta_agent_node (no graph)
- explicit agentic mode dispatches to agentic_graph
- explicit agentic mode for non-Mintral returns refusal
- auto mode uses LLM router (not keyword)
- conversation_id round-trips via the store across two calls
- backward compatibility: keyword-router-only construction works

Full suite: 282 passed, 1 skipped.
`infra/observability/` is the local dev stack for Phase B (plan 13).
Authored offline — Docker only needs to be running to *execute* it,
not to write it.

`docker-compose.yml` (B1):
- 7 services on a private `observability` bridge network:
  - `postgres:16` — Langfuse's primary DB
  - `clickhouse:24-alpine` — Langfuse's analytics store
  - `redis:7-alpine` — Langfuse's queue/cache (`requirepass` set)
  - `minio` — S3-compatible blob storage for event payloads
  - `langfuse/langfuse:3` (web) — UI + ingest API on `:3000`
  - `langfuse/langfuse-worker:3` — background jobs
  - `otel/opentelemetry-collector-contrib:0.95.0` — OTLP gRPC on
    `:4317`, HTTP on `:4318`, self-metrics on `:8888`
- Volumes for postgres, clickhouse, clickhouse-logs, minio.
- Health checks gate `depends_on:` so the right boot order happens
  on `docker compose up -d`.
- LANGFUSE_INIT_* envs declare a deterministic `miot-harness-local`
  project so bootstrap.sh can mint keys idempotently.

`otel-collector-config.yaml` (B2):
- OTLP gRPC (4317) + HTTP (4318) receivers.
- `memory_limiter` (90% soft / 10% spike) + `batch` (200ms / 1KB).
- `otlphttp/langfuse` exporter pointing at the in-network
  `http://langfuse-web:3000/api/public/otel` ingest endpoint.
- Comments describe the fan-out pattern (e.g. adding Honeycomb) so
  swapping backends doesn't require harness re-instrumentation.

`bootstrap.sh` (B3):
- `docker compose up -d`, then polls `/api/public/health` until 200
  (180s timeout, prints last 50 lines of langfuse-web logs on fail).
- Calls Langfuse v3's `/api/public/projects/{id}/api-keys` to mint a
  key pair, parses `publicKey` / `secretKey` out of the JSON, and
  prints the `.env` lines for the operator to paste — including
  toggling `MIOT_HARNESS_OTEL_ENABLED=true`.
- Falls back to UI instructions when the bootstrap API isn't
  reachable (Langfuse v3 build pre-bootstrap-API).
- Marked `set -euo pipefail` + `chmod +x`.

`README.md` (B4):
- Bring-up / tear-down, access (UI / collector metrics), .env
  paste-block, privacy posture (single-tenant, internal, full
  message storage — not for SaaS).
- Backup/restore via `tar` over docker volumes.
- Backend-swap recipe (add a second exporter, no harness changes).
- Troubleshooting table covering the common boot failures.

YAML + shell syntax verified by `yaml.safe_load` and `bash -n`.
No tests added — these are config files exercised live in F2/F3.
Full suite still: 282 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase B · B1-B4)
C3 from plan 13 — `python -m miot_harness.observability.report --since 7d
--by agent|tenant|mode`. The pure aggregation core is offline-doable
(unit-tested) and the live-Langfuse fetcher is deferred to F-phase when
the stack is up.

`observability/report.py`:
- `CostRow(key, cost_usd, input_tokens, output_tokens, traces)` —
  one bucket of the report.
- `aggregate_cost(traces, by)` — pure aggregation over a list of trace
  projections. Groups by `modular.agent` / `modular.tenant_id` /
  `modular.mode`, sums `gen_ai.usage.cost_usd` as Decimal (6-decimal
  quantize), tallies `gen_ai.usage.{input,output}_tokens`, counts
  traces. Skips traces missing `gen_ai.usage.cost_usd`. Sorts rows
  cost-descending so the dashboard panel renders the cost driver
  first.
- `parse_since('7d' | '24h' | '90m')` returns a `timedelta`; rejects
  invalid specs with a clear ValueError.
- CLI shim: `--fixture <path-to-jsonl>` runs against a static file so
  operators can sanity-check the math before pointing at Langfuse;
  `--format json` for machine-readable output. The Langfuse fetcher
  is intentionally unbuilt; the CLI errors loudly if no fixture is
  passed, with a pointer to F-phase.

Tests (`tests/observability/test_report.py`, 13 new):
- aggregate by agent / tenant / mode each sum correctly
- traces missing cost are skipped (not silently zeroed)
- output is sorted cost-descending
- unknown grouping dimension raises ValueError
- token totals (in / out) propagate per-bucket
- parse_since accepts common specs (1h/24h/7d/30d/90m)
- parse_since rejects garbage ('', '1', '7y', 'minus', '1.5d')

C2 dashboards and C4 alerting still wait for the live Langfuse stack
(JSON exports + real trace queries respectively). C1 live verification
is operator-precondition gated per `.ralph/blockers.md`.

Stage 1 complete. Stage 2 = user action: install Docker, get
ANTHROPIC_API_KEY, start the SSH tunnel.

Full suite: 295 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase C · C3)
…ENCRYPTION_KEY, cluster mode, deterministic API keys

Four corrections to the Phase B stack found while bringing it up live:

1. **ClickHouse healthcheck used `localhost`** — Docker on macOS doesn't
   route IPv6 inside containers, so wget resolved `localhost` → `::1`
   and got refused while ClickHouse was actually listening on
   `0.0.0.0:8123` (IPv4 only, after the v6 bind failed). Switched the
   healthcheck to `127.0.0.1`, raised `retries` to 20, added
   `start_period: 30s` so the slow first-boot migrations don't trip
   the gate.

2. **`ENCRYPTION_KEY: 000...0` was rejected by Langfuse's Zod schema**
   ("ENCRYPTION_KEY must be 256 bits, 64 string characters in hex
   format, generate via: openssl rand -hex 32"). Generated a real one
   with `openssl rand -hex 32`; comment makes the local-dev-only
   nature explicit so this isn't reused in prod.

3. **`CLICKHOUSE_CLUSTER_ENABLED: "false"`** added to both langfuse-web
   and langfuse-worker. Langfuse v3's migrations default to
   `ReplicatedMergeTree ON CLUSTER default`, which requires
   ZooKeeper. Disabling cluster mode makes them use plain MergeTree
   for the single-node local stack.

4. **`LANGFUSE_INIT_PROJECT_PUBLIC_KEY` / `_SECRET_KEY`** added so the
   API keys are minted deterministically during the project bootstrap
   pass — no manual UI step, no race condition with the bootstrap
   script trying to call a key-mint endpoint that doesn't exist in
   Langfuse v3's public API. Comment explicitly marks these as
   local-dev-only credentials and explains the rotation path.

`otel-collector-config.yaml`:
- Replaced the placeholder Authorization header with the real
  `Basic base64(pk-lf-...:sk-lf-...)` for the new auto-provisioned
  key pair so traces actually reach Langfuse.

`bootstrap.sh`:
- Replaced the (broken) `POST /api/public/projects/.../api-keys`
  attempt with a Postgres-direct read of the public_key (just for
  confirmation) and a grep over `docker-compose.yml` for the secret.
  Result: the script always prints the working pair, idempotently.

Verified end-to-end: `docker compose down -v && docker compose up -d`
boots the 7 services cleanly, langfuse-web reports `{"status":"OK"}` on
`/api/public/health`, otel-collector starts with the Langfuse exporter
ready, and `./bootstrap.sh` prints the .env paste-block in <5s.
…ion; clear blockers

End-to-end live bring-up confirmed:

- F2 canned mode (3 of plan-12's golden scenarios): G6 estado, G7 cola
  crítica, G8 ETA en riesgo. All three returned Markdown answers from
  the real Coordinador snapshot (with honest stale-data warnings where
  the snapshot is older than the freshness threshold).

- F3 mode-bypass + auto-router: META mode answers from primer+catalog
  without DB touch, allowed for any tenant. AGENTIC refused at
  request-validation for non-Mintral (`ModeAccessDenied`). AGENTIC for
  Mintral runs through the stub planner→synthesizer. AUTO mode
  classified "dimensionamiento para mañana" as NEXO_QUERY at
  confidence 0.75 and dispatched correctly.

- Direct ClickHouse query confirms Langfuse received: 11 `nexo.run`
  root spans + 6 `nexo.synthesizer` + 6 `nexo.filter_expert` per-agent
  spans + 22 `anthropic.chat` (Traceloop auto-instrumented) child
  spans. Per-mode trace split: 5 auto / 3 canned / 2 meta / 1
  agentic. Every span carries `modular.{mode, run_id, tenant_id}` in
  its metadata.

`tests/conftest.py` (new): 7 default-value tests were failing because
the populated dev `.env` was being auto-loaded by pydantic-settings,
overriding the asserted defaults. Added an autouse fixture that pins
`HarnessSettings.model_config["env_file"]` to a non-existent path and
clears any `MIOT_HARNESS_*` / provider keys from the inherited shell,
so the full suite passes deterministically on a populated dev box and
on CI. Full suite: 295 passed, 1 skipped.

`.ralph/blockers.md`: both OPEN entries (operator preconditions +
supervisor wire-up) marked RESOLVED with the resolution context.

`.ralph/state.md`: C1, D1, D2, F2, F3 all → [x]. D3 / F4 stay [ ]
(time-gated — 50+ runs over a week). C2 / C4 / F5 stay [ ] (follow-up
work after the PR lands).
@coderabbitai

coderabbitai Bot commented May 12, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2fd475a8-f6c1-4dc1-89a2-7e12fecdbd87

📥 Commits

Reviewing files that changed from the base of the PR and between 7dc5bcd and d7d4f36.

📒 Files selected for processing (2)
  • .gitignore
  • miot-harness/evals/deploy/07-attestations-present.sh
✅ Files skipped from review due to trivial changes (1)
  • .gitignore

📝 Walkthrough

Walkthrough

Introduces a local Langfuse+OTel stack, integrates OpenTelemetry in the FastAPI app, adds pricing, provenance, SQL safety primitives, intent routing, agentic/meta modes, tenancy gate, conversation memory, supervisor/server wiring, a cost-report CLI, CI attestations, and comprehensive tests.

Changes

Plan 13 end-to-end telemetry and runtime modes

Layer / File(s) Summary
Full slice: infra + observability + runtime modes + tests
infra/observability/*, miot-harness/src/..., miot-harness/tests/..., scripts/provenance-curate.py, .github/workflows/harness.yaml, .gitignore, miot-harness/evals/deploy/07-attestations-present.sh
Adds local observability stack and OTel wiring, pricing/cost reporting, provenance logging, Nexo SQL-safe primitives, LLM intent routing and mode resolver, agentic/meta graphs, tenancy gate, conversation memory, supervisor/server integration, CLI/reporting, CI attestation updates, and tests.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant FastAPI
  participant Supervisor
  participant Router as Mode/Intent Router
  participant Graph as Nexo/Agentic/Meta Graphs
  participant OTel as OTel Collector
  participant Langfuse

  Client->>FastAPI: POST /runs (user_message, mode, tenant)
  FastAPI->>Supervisor: run(request)
  Supervisor->>Router: resolve_mode(request)
  Router-->>Supervisor: RouteResult(route)
  Supervisor->>Graph: execute(route) with callbacks
  Graph-->>Supervisor: answer, events
  Supervisor-->>FastAPI: response
  FastAPI->>OTel: OTLP traces
  OTel->>Langfuse: Export traces
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Possibly related PRs

  • microboxlabs/modulariot#445 — Builds on the Nexo coordinator graph/supervisor architecture extended here with telemetry and routing.

Suggested reviewers

  • korutx

Poem

A rabbit wired the stars to trace,
Little spans hop place to place.
Grafters plan, routers muse,
Tenants gate what modes to choose.
SQL kept safe, costs align—
Langfuse glows, OTel fine.
Thump-thump: shipped Plan 13, divine! 🐇✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/harness-phase-13-telemetry-agentic

odtorres added 2 commits May 12, 2026 17:00
…nt cost rollups

Spotted during the F2/F3 live verification: the Langfuse Tracing UI's
filter sidebar (User ID / Session ID / Tags / Environment) was empty
for our runs even though `tenant_id` / `user_id` / `conversation_id`
were already threaded through every request. Without these promoted to
first-class indexed columns, per-client cost rollups would require
post-hoc metadata filtering instead of using Langfuse's built-in
filter affordances.

Langfuse's OTel ingest promotes specific attribute names
(`langfuse.user.id`, `langfuse.session.id`, `langfuse.tags`,
`langfuse.environment`) from span metadata into first-class trace
columns. This commit emits them alongside the existing `modular.*`
attrs.

Changes:

- `HarnessContext` gains `conversation_id: str | None`, propagated
  from `UserRequest.to_context()` so the conversation surface and
  Langfuse session_id read from the same source.

- `observability/spans.py:agent_span(...)` extended with optional
  `user_id` / `session_id` / `tags` / `environment` kwargs. Tags
  serialized as a JSON-encoded string array for portability.

- `observability/callbacks.py:NexoTelemetryCallback` gains the same
  four kwargs. On every `on_chat_model_start`, sets
  `langfuse.user.id`, `langfuse.session.id`, `langfuse.environment`,
  and `langfuse.tags` on the per-agent span so observation-level
  rows in Langfuse carry the same first-class fields as the trace.

- `runtime/supervisor.py:_root_span_kwargs(ctx, route)` centralizes
  the mapping: every dispatch path (NEXO_QUERY / NEXO_AGENTIC /
  NEXO_META) calls `agent_span("run", **self._root_span_kwargs(...))`
  with identical kwargs plus a per-route tag (`route:<chosen_route>`).
  `session_id` falls back from `conversation_id` → `thread_id` so
  one-shot requests still group under a session key.

- `runtime/nexo_graph.py:_instrument(...)` forwards `ctx.user_id`,
  `ctx.conversation_id or ctx.thread_id`, and a per-agent tag list
  (`tenant:<id>`, `mode:<mode>`, `agent:<name>`) to each
  NexoTelemetryCallback.

Tests: 4 new in `tests/observability/test_langfuse_attribution.py`:
- agent_span emits all four langfuse.* attrs when provided
- agent_span omits them when unset (backward compat for old callers)
- callback emits langfuse.* on per-agent spans when provided
- callback omits them when not provided

Live-verified end-to-end: after `POST /runs` with
`{tenant_id: mintral, user_id: odtorres, conversation_id:
odtorres-debug-2}`, the Langfuse `traces` ClickHouse row shows:

  user_id     = 'odtorres'
  session_id  = 'odtorres-debug-2'
  tags        = ['agent:filter_expert','agent:synthesizer',
                 'mode:auto','route:nexo_query','tenant:mintral']
  environment = 'local'

Full suite: 299 passed, 1 skipped.

Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E10,
added mid-execution per the plan's "editable during execution" rule).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
miot-harness/tests/conftest.py (1)

29-44: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clear cached settings inside the isolation fixture.

get_settings() is cached, so env cleanup alone may not take effect in tests that call the accessor after the first instantiation.

Suggested fix
-from miot_harness.config import HarnessSettings
+from collections.abc import Iterator
+
+from miot_harness.config import HarnessSettings, get_settings
@@
-@pytest.fixture(autouse=True)
-def _isolate_settings_from_env(monkeypatch: pytest.MonkeyPatch) -> None:
+@pytest.fixture(autouse=True)
+def _isolate_settings_from_env(monkeypatch: pytest.MonkeyPatch) -> Iterator[None]:
@@
+    get_settings.cache_clear()
@@
-    for key in list(os.environ.keys()):
+    for key in list(os.environ.keys()):
         if key.startswith("MIOT_HARNESS_") or key in {
             "ANTHROPIC_API_KEY",
             "OPENAI_API_KEY",
             "GOOGLE_API_KEY",
         }:
             monkeypatch.delenv(key, raising=False)
+    yield
+    get_settings.cache_clear()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/tests/conftest.py` around lines 29 - 44, The isolation fixture
_isolate_settings_from_env currently alters HarnessSettings.model_config and
environment variables but does not clear the cached settings, so tests that call
get_settings() after a prior instantiation still see stale values; update
_isolate_settings_from_env to clear the cached accessor by calling
get_settings.cache_clear() (or the appropriate cache-clear method used by
get_settings) after monkeypatching HarnessSettings.model_config and before
returning so subsequent get_settings() calls recreate settings from the modified
env/model_config.
🟡 Minor comments (5)
miot-harness/tests/observability/conftest.py-13-17 (1)

13-17: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix import ordering to clear CI (ruff I001).

The current import block is flagged by Ruff and is failing the lint job.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/tests/observability/conftest.py` around lines 13 - 17, Reorder
the import block to satisfy Ruff's I001 ordering: group and sort imports into
standard library, third-party, and local/application imports, and sort them
alphabetically within each group; specifically ensure pytest and opentelemetry
imports are ordered so that "import pytest" appears before "from opentelemetry
import trace" and the opentelemetry submodule imports (ReadableSpan,
TracerProvider, SimpleSpanProcessor, SpanExportResult, InMemorySpanExporter) are
grouped and sorted consistently.
.ralph/log.md-5-11 (1)

5-11: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language identifier to the fenced example block.

This fence triggers markdownlint MD040; adding a language keeps docs lint-clean.

Suggested fix
-```
+```text
 ## Iteration N — YYYY-MM-DD HH:MM
 - Task: <task ID + short>
 - Status: [completed|blocked|partial]
 - Commit: <SHA or "n/a">
 - Notes: <one-line summary; tests run; surprises>
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.ralph/log.md around lines 5 - 11, The fenced code block in the iteration
template is missing a language identifier which triggers markdownlint MD040;
update the triple-backtick fence for the example block (the block that begins
with "" followed by the iteration lines) to include a language identifier such as "text" (e.g. change "" to "```text") so the markdown linter passes
and the example remains unchanged otherwise.


</details>

</blockquote></details>
<details>
<summary>miot-harness/src/miot_harness/config.py-48-55 (1)</summary><blockquote>

`48-55`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Add bounds validation for new numeric settings.**

These values are externally configurable, but currently accept invalid ranges. Guarding them at load time prevents silent misconfiguration.



<details>
<summary>Suggested fix</summary>

```diff
-    intent_router_confidence_threshold: float = 0.7
+    intent_router_confidence_threshold: float = Field(default=0.7, ge=0.0, le=1.0)
@@
-    nexo_explain_cost_threshold: float = 10000.0
+    nexo_explain_cost_threshold: float = Field(default=10000.0, gt=0.0)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/src/miot_harness/config.py` around lines 48 - 55, The two
numeric config fields intent_router_confidence_threshold and
nexo_explain_cost_threshold need runtime bounds validation on load: ensure
intent_router_confidence_threshold is within [0.0, 1.0] and
nexo_explain_cost_threshold is a positive number (e.g., > 0.0); implement the
checks in the config initialization path (e.g., Config.__post_init__ or the
function that constructs these values) and raise a clear ValueError with the
invalid field name and value if a check fails so misconfiguration is caught
immediately.
```

</details>

</blockquote></details>
<details>
<summary>miot-harness/.env.example-135-135 (1)</summary><blockquote>

`135-135`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Fix the stack path typo in the OTEL comment.**

Line 135 has an extra space in the path (`infra/observability/ docker-compose`), which makes copy/paste misleading.

<details>
<summary>✏️ Suggested fix</summary>

```diff
-# infra/observability/ docker-compose stack is up.
+# infra/observability/docker-compose stack is up.
```

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/.env.example` at line 135, Fix the typo in the OTEL comment by
removing the stray space in the path string "infra/observability/
docker-compose" so it reads "infra/observability/docker-compose"; update the
comment text on the affected line in .env.example accordingly to ensure
copy/paste yields a valid path.
```

</details>

</blockquote></details>
<details>
<summary>miot-harness/tests/observability/test_graph_wiring.py-158-164 (1)</summary><blockquote>

`158-164`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Assert actual parent-child linkage, not just root presence.**

The test is named `test_root_nexo_run_span_parents_per_agent_spans` yet only verifies root attributes. The established testing pattern in the codebase (test_spans.py) demonstrates that `span.parent` and `span.context.span_id` are available for parent-child assertions. Add explicit parent-id checks to verify that nexo.* agent spans are actually children of the nexo.run root span, preventing broken span parenting from going undetected.

<details>
<summary>🧪 Suggested assertion</summary>

```diff
     spans = memory_exporter.get_finished_spans()
     root = next((s for s in spans if s.name == "nexo.run"), None)
     assert root is not None, "nexo.run root span not emitted"
     attrs = dict(root.attributes or {})
     assert attrs["modular.run_id"] == "run_root_span_test"
     assert attrs["modular.tenant_id"] == "mintral"
     assert attrs["gen_ai.operation.name"] == "nexo.run"
+
+    root_span_id = root.context.span_id
+    for s in spans:
+        if s.name.startswith("nexo.") and s.name != "nexo.run":
+            assert s.parent is not None, s.name
+            assert s.parent.span_id == root_span_id, s.name
```

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/tests/observability/test_graph_wiring.py` around lines 158 -
164, The test currently only asserts the root "nexo.run" span attributes; add
parent-child assertions to ensure all agent spans are children of that root by
locating the root span (root) from memory_exporter.get_finished_spans(), then
iterate over agent spans (spans with names starting with "nexo." excluding
"nexo.run") and assert each has a non-null parent and that span.parent.span_id
(or span.parentSpanId if that field is used in the Span model) equals
root.context.span_id (or root.context.span_id field used in this codebase);
update test_root_nexo_run_span_parents_per_agent_spans to include these explicit
parent-id checks.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @infra/observability/docker-compose.yml:

  • Around line 96-99: The compose file currently hardcodes deterministic secrets
    (e.g., DATABASE_URL, NEXTAUTH_URL, NEXTAUTH_SECRET, SALT and other
    DB/Redis/MinIO/Langfuse creds) — replace these literal values with externalized
    references (use environment variable interpolation like ${DATABASE_URL}, an
    env_file, or Docker secrets) and remove the static defaults from the committed
    compose; update the service definitions that reference DATABASE_URL,
    NEXTAUTH_SECRET, SALT (and the other listed secret entries) to read from the
    external env or secret names, add a .env.example with placeholder keys (not
    secrets) and documentation on how to provide real credentials locally/CI, and
    ensure no real credentials remain in the file.
  • Around line 63-69: The Redis healthcheck is failing because the healthcheck
    test (healthcheck.test using "redis-cli ping") doesn't authenticate while the
    container is started with "--requirepass myredissecret"; update the
    healthcheck.test to authenticate (for example use "redis-cli -a myredissecret
    ping" or "redis-cli -u redis://:myredissecret@localhost:6379 ping") so it
    supplies the password, or reference a secret/env var for the password, keeping
    the rest of the healthcheck properties and the command that sets "--requirepass"
    unchanged.

In @infra/observability/otel-collector-config.yaml:

  • Around line 35-43: The otel-collector config currently contains a hardcoded
    Basic Authorization header under the otlphttp/langfuse exporter (headers:
    Authorization) which must be moved to an environment variable; replace the
    literal base64 string in the headers.Authorization value with an env reference
    using the OTel Collector syntax (e.g., ${env:LANGFUSE_BASIC_AUTH}), then ensure
    the otel-collector service in docker-compose sets LANGFUSE_BASIC_AUTH from the
    runtime environment (and document adding the value to a gitignored .env file
    generated via printf 'pk...:sk...' | base64). Keep the exporter key
    (otlphttp/langfuse) and headers.Authorization as the target of the change so the
    config still points to the same endpoint while removing secrets from the repo
    and git history.

In @miot-harness/src/miot_harness/integrations/nexo/primitives.py:

  • Around line 6-11: The file has long-line and import-order lint errors and a
    duplicate entry in the safe-functions list: break the long literal/table strings
    (lines around the module docstring or top-level comment that contains the
    Tool/Shape/Safety table and any other >79 char lines noted at 24-30) into
    shorter concatenated or multi-line strings to fix E501; reorder the import block
    into standard groups (stdlib, third-party, local) and ensure imports are
    sorted/grouped consistently (I001) for the module's import section; and remove
    the duplicate "to_char" entry from the _SAFE_FUNCTIONS collection (symbol
    _SAFE_FUNCTIONS at line ~113) so there are no repeated function names (B033).
    Make these edits in miot_harness.integrations.nexo.primitives (update the
    docstring/table wrapping, fix import order, and de-duplicate _SAFE_FUNCTIONS)
    and run linters to confirm all three issues are resolved.
  • Around line 378-381: The code incorrectly assumes rows are dicts by using
    isinstance(row, dict) which causes all asyncpg.Record results to be rejected;
    instead, in the nexo_explain path obtain the QUERY PLAN by attempting direct
    subscript access on row (e.g., try: payload = row["QUERY PLAN"]; except
    (KeyError, TypeError, IndexError): raise UnsupportedConstruct("EXPLAIN output
    missing QUERY PLAN")) so asyncpg.Record objects are handled properly, then keep
    the existing plan = payload[0].get("Plan", {}) logic and raise
    UnsupportedConstruct only when the subscript access fails or payload is empty.

In @miot-harness/src/miot_harness/observability/otel.py:

  • Around line 46-56: When configure_tracing detects an existing global
    TracerProvider (variable existing) you must record ownership instead of always
    shutting it down later; add a boolean (e.g., tracer_provider_owned) that is set
    to False when existing is reused and True when configure_tracing creates/sets a
    new provider, and update shutdown_tracing to only call provider.shutdown() if
    tracer_provider_owned is True (apply the same ownership check for the other
    reuse branch around the code referenced at 72-75). Ensure the boolean is
    accessible to both configure_tracing and shutdown_tracing so only providers
    created by this module are shut down.

In @miot-harness/src/miot_harness/observability/provenance.py:

  • Around line 68-70: The current code writes JSON payload and newline in two
    separate fh.write calls (inside the with path.open("a", encoding="utf-8")
    block), which risks interleaving under concurrent appends; change it so the code
    constructs a single string (e.g., record = json.dumps(payload,
    ensure_ascii=False) + "\n") and replace the two fh.write(...) calls with a
    single fh.write(record) to ensure each JSONL record is written atomically as one
    write operation.

In @miot-harness/src/miot_harness/observability/report.py:

  • Line 124: The header f-string assigned to the variable lines is too long;
    split it across multiple shorter string literals (using implicit concatenation
    inside parentheses or concatenation with +) so the f-string doesn't exceed the
    line-length limit. For example, break f"{'#':>3} {by.upper():<20}
    {'COST_USD':>10} {'INPUT':>10} {'OUTPUT':>10} {'N':>5}" into two or more
    parts and join them (keep the same f-string interpolation for by.upper()) so the
    resulting value of lines remains identical but each source line respects the
    linter's max line length.

In @miot-harness/src/miot_harness/runtime/agentic_graph.py:

  • Line 78: The statement assigning response exceeds the 100-char line limit;
    break it into multiple shorter expressions by extracting
    snapshot.get("user_message", "") into a local variable (e.g., user_message) and
    then calling model.ainvoke with that variable (referencing response,
    model.ainvoke, and snapshot.get), or by formatting the list argument across
    multiple lines so the full call no longer exceeds 100 characters.

In @miot-harness/src/miot_harness/runtime/intent_router.py:

  • Around line 115-128: The call to self._model.ainvoke can raise and currently
    bubbles up; wrap the await self._model.ainvoke(...) call in a try/except that
    catches Exception (or the model client-specific exceptions) and on failure logs
    the error (use logger.warning or logger.exception with context) and then returns
    self._fallback.route(message) so routing degrades to deterministic keyword
    routing; keep the existing flow that still parses the response via
    _parse_decision and only falls back when parsing fails or when the ainvoke call
    throws.

In @miot-harness/tests/integrations/nexo/test_primitives.py:

  • Around line 12-30: The import block around nexo test symbols is mis-ordered
    for Ruff I001; reformat it so imports are grouped and sorted: keep "from
    future import annotations" at the top, then a blank line, then standard
    library imports (typing, unittest.mock) sorted alphabetically, then a blank
    line, then third-party imports (pytest), then a blank line, then local package
    imports from miot_harness.integrations.nexo.primitives; within that from-style
    import list (AllowlistViolation, CostGateViolation, MutationRejected,
    MultiStatementRejected, UnsupportedConstruct, nexo_describe, nexo_explain,
    nexo_grep, nexo_select, validate_select_sql) sort names alphabetically.
    Alternatively run isort/ruff --fix to apply the same ordering.

In @miot-harness/tests/observability/test_deps.py:

  • Around line 8-14: The import block currently violates Ruff I001; reorder and
    normalize imports so they follow isort rules: keep "from future import
    annotations" first, then the standard-library import "import importlib", then
    third-party "import pytest", with a single blank line after the grouped imports;
    you can either run the project formatter (isort/ruff --fix) or manually reorder
    the three lines ("from future import annotations", "import importlib",
    "import pytest") to fix the issue.

In @miot-harness/tests/observability/test_langfuse_attribution.py:

  • Around line 73-75: The import block in test_langfuse_attribution.py is
    unsorted; reorder and group the imports so they pass ruff—place third-party
    imports together in alphabetical order and ensure from-imports are alphabetized
    (e.g., import AIMessage from langchain_core.messages and ChatGeneration,
    LLMResult from langchain_core.outputs) and run the linter to verify formatting;
    adjust any import grouping to match the project's import ordering rules used by
    ruff.
  • Around line 131-133: The import block is unsorted; please reorder/group the
    imports to satisfy ruff/isort by placing module imports in the correct
    alphabetical/grouped order (e.g., put from langchain_core.messages import
    AIMessage before from langchain_core.outputs import ChatGeneration, LLMResult)
    and ensure the names inside each import are alphabetized and formatted
    consistently so linting passes.
  • Line 15: The file imports pytest but never uses it; remove the unused import
    statement (the line importing pytest) from
    miot-harness/tests/observability/test_langfuse_attribution.py so the file no
    longer contains an unused symbol and the pipeline error is resolved.

In @miot-harness/tests/observability/test_pricing.py:

  • Around line 5-16: The import block in tests/observability/test_pricing.py is
    failing ruff I001 due to incorrect ordering and grouping; reorder imports into
    standard library (datetime, decimal) first, then third-party (pytest), then
    local package imports, with a blank line between groups, and sort names
    alphabetically within each group—ensure the top imports include date and Decimal
    from datetime/decimal, then pytest, then the miot_harness.observability.pricing
    symbols (PRICING, ModelPricing, TokenUsage, UnknownModelError, compute_cost) in
    alphabetical order to satisfy ruff.

In @miot-harness/tests/runtime/test_intent_router.py:

  • Around line 39-43: Two test-case tuples in test_intent_router.py exceed the
    100-character line limit; break each long tuple literal into multiple lines so
    no physical line exceeds 100 chars. Locate the offending tuples containing the
    strings "filter by region=norte and order by ETA desc" and "draft a story about
    today's incident" (they construct tuples with HarnessRoute.NEXO_AGENTIC and
    HarnessRoute.STORYTELLING_RUN respectively) and reformat them by placing the
    prompt string, model name, score, and route on separate indented lines (or wrap
    the prompt string) so the tuple remains the same semantically but each source
    line is <=100 chars to satisfy Ruff E501.

In @miot-harness/tests/runtime/test_supervisor_phase_e.py:

  • Line 29: The import list in the test includes an unused symbol HarnessRoute
    which triggers a pipeline failure; remove HarnessRoute from the import statement
    that currently reads "from miot_harness.runtime.router import HarnessRoute,
    IntentRouter" so only IntentRouter is imported (or switch to a single-name
    import of IntentRouter) and run the tests to ensure no other references to
    HarnessRoute remain in the file (e.g., check test_supervisor_phase_e.py for any
    usage of HarnessRoute).

Outside diff comments:
In @miot-harness/tests/conftest.py:

  • Around line 29-44: The isolation fixture _isolate_settings_from_env currently
    alters HarnessSettings.model_config and environment variables but does not clear
    the cached settings, so tests that call get_settings() after a prior
    instantiation still see stale values; update _isolate_settings_from_env to clear
    the cached accessor by calling get_settings.cache_clear() (or the appropriate
    cache-clear method used by get_settings) after monkeypatching
    HarnessSettings.model_config and before returning so subsequent get_settings()
    calls recreate settings from the modified env/model_config.

Minor comments:
In @.ralph/log.md:

  • Around line 5-11: The fenced code block in the iteration template is missing a
    language identifier which triggers markdownlint MD040; update the
    triple-backtick fence for the example block (the block that begins with "" followed by the iteration lines) to include a language identifier such as "text" (e.g. change "" to "```text") so the markdown linter passes and the example
    remains unchanged otherwise.

In @miot-harness/.env.example:

  • Line 135: Fix the typo in the OTEL comment by removing the stray space in the
    path string "infra/observability/ docker-compose" so it reads
    "infra/observability/docker-compose"; update the comment text on the affected
    line in .env.example accordingly to ensure copy/paste yields a valid path.

In @miot-harness/src/miot_harness/config.py:

  • Around line 48-55: The two numeric config fields
    intent_router_confidence_threshold and nexo_explain_cost_threshold need runtime
    bounds validation on load: ensure intent_router_confidence_threshold is within
    [0.0, 1.0] and nexo_explain_cost_threshold is a positive number (e.g., > 0.0);
    implement the checks in the config initialization path (e.g.,
    Config.post_init or the function that constructs these values) and raise a
    clear ValueError with the invalid field name and value if a check fails so
    misconfiguration is caught immediately.

In @miot-harness/tests/observability/conftest.py:

  • Around line 13-17: Reorder the import block to satisfy Ruff's I001 ordering:
    group and sort imports into standard library, third-party, and local/application
    imports, and sort them alphabetically within each group; specifically ensure
    pytest and opentelemetry imports are ordered so that "import pytest" appears
    before "from opentelemetry import trace" and the opentelemetry submodule imports
    (ReadableSpan, TracerProvider, SimpleSpanProcessor, SpanExportResult,
    InMemorySpanExporter) are grouped and sorted consistently.

In @miot-harness/tests/observability/test_graph_wiring.py:

  • Around line 158-164: The test currently only asserts the root "nexo.run" span
    attributes; add parent-child assertions to ensure all agent spans are children
    of that root by locating the root span (root) from
    memory_exporter.get_finished_spans(), then iterate over agent spans (spans with
    names starting with "nexo." excluding "nexo.run") and assert each has a non-null
    parent and that span.parent.span_id (or span.parentSpanId if that field is used
    in the Span model) equals root.context.span_id (or root.context.span_id field
    used in this codebase); update test_root_nexo_run_span_parents_per_agent_spans
    to include these explicit parent-id checks.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `3cbe69d2-21ba-4b42-8fac-5a2bf59d082a`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 2ad28fa5be479da481882bbf55691b9793e96c7c and 085595fcc0621504f0cc6beb7f84bada86ffa889.

</details>

<details>
<summary>⛔ Files ignored due to path filters (1)</summary>

* `miot-harness/uv.lock` is excluded by `!**/*.lock`

</details>

<details>
<summary>📒 Files selected for processing (59)</summary>

* `.ralph/RALPH_PROMPT.md`
* `.ralph/blockers.md`
* `.ralph/log.md`
* `.ralph/state.md`
* `infra/observability/README.md`
* `infra/observability/bootstrap.sh`
* `infra/observability/docker-compose.yml`
* `infra/observability/otel-collector-config.yaml`
* `miot-harness/.env.example`
* `miot-harness/pyproject.toml`
* `miot-harness/src/miot_harness/agents/meta_agent.py`
* `miot-harness/src/miot_harness/api/server.py`
* `miot-harness/src/miot_harness/config.py`
* `miot-harness/src/miot_harness/integrations/nexo/primitives.py`
* `miot-harness/src/miot_harness/observability/__init__.py`
* `miot-harness/src/miot_harness/observability/callbacks.py`
* `miot-harness/src/miot_harness/observability/otel.py`
* `miot-harness/src/miot_harness/observability/pricing.py`
* `miot-harness/src/miot_harness/observability/provenance.py`
* `miot-harness/src/miot_harness/observability/report.py`
* `miot-harness/src/miot_harness/observability/spans.py`
* `miot-harness/src/miot_harness/runtime/agentic_graph.py`
* `miot-harness/src/miot_harness/runtime/context.py`
* `miot-harness/src/miot_harness/runtime/conversation.py`
* `miot-harness/src/miot_harness/runtime/factory.py`
* `miot-harness/src/miot_harness/runtime/intent_router.py`
* `miot-harness/src/miot_harness/runtime/mode_resolver.py`
* `miot-harness/src/miot_harness/runtime/nexo_graph.py`
* `miot-harness/src/miot_harness/runtime/router.py`
* `miot-harness/src/miot_harness/runtime/run_store.py`
* `miot-harness/src/miot_harness/runtime/supervisor.py`
* `miot-harness/src/miot_harness/runtime/tenancy.py`
* `miot-harness/tests/agents/__init__.py`
* `miot-harness/tests/agents/test_meta_agent.py`
* `miot-harness/tests/conftest.py`
* `miot-harness/tests/integrations/nexo/test_primitives.py`
* `miot-harness/tests/observability/__init__.py`
* `miot-harness/tests/observability/conftest.py`
* `miot-harness/tests/observability/test_callbacks.py`
* `miot-harness/tests/observability/test_deps.py`
* `miot-harness/tests/observability/test_graph_wiring.py`
* `miot-harness/tests/observability/test_langfuse_attribution.py`
* `miot-harness/tests/observability/test_lifespan_init.py`
* `miot-harness/tests/observability/test_mode_telemetry.py`
* `miot-harness/tests/observability/test_otel.py`
* `miot-harness/tests/observability/test_pricing.py`
* `miot-harness/tests/observability/test_propagation.py`
* `miot-harness/tests/observability/test_provenance.py`
* `miot-harness/tests/observability/test_report.py`
* `miot-harness/tests/observability/test_spans.py`
* `miot-harness/tests/runtime/__init__.py`
* `miot-harness/tests/runtime/test_agentic_graph.py`
* `miot-harness/tests/runtime/test_conversation.py`
* `miot-harness/tests/runtime/test_intent_router.py`
* `miot-harness/tests/runtime/test_mode_resolver.py`
* `miot-harness/tests/runtime/test_supervisor_phase_e.py`
* `miot-harness/tests/runtime/test_tenancy_gate.py`
* `miot-harness/tests/test_config.py`
* `scripts/provenance-curate.py`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread infra/observability/docker-compose.yml
Comment thread infra/observability/docker-compose.yml
Comment thread infra/observability/otel-collector-config.yaml
Comment thread miot-harness/src/miot_harness/integrations/nexo/primitives.py Outdated
Comment thread miot-harness/src/miot_harness/integrations/nexo/primitives.py Outdated
Comment thread miot-harness/tests/observability/test_langfuse_attribution.py Outdated
Comment thread miot-harness/tests/observability/test_langfuse_attribution.py Outdated
Comment thread miot-harness/tests/observability/test_pricing.py
Comment thread miot-harness/tests/runtime/test_intent_router.py Outdated
Comment thread miot-harness/tests/runtime/test_supervisor_phase_e.py Outdated
odtorres added 3 commits May 14, 2026 17:52
…es gate

Post-merge review of PR #462 found that `SELECT … FOR UPDATE / FOR
SHARE / FOR KEY SHARE` (incl. `NOWAIT` / `SKIP LOCKED`) was ACCEPTED by
the safety gate. These are read-shaped at the AST level (no
Insert/Update/Delete) but acquire row-level locks — against `nexo.dx_*`
snapshot tables they can block the snapshot-refresh job; against Citus
they can escalate to a global lock. The DB role is read-only at runtime,
but defense-in-depth says gate it at the AST level.

Same class as the LATERAL function-in-FROM and side-effect-function
bypasses the earlier E3 review closed: production-affecting side
effects without a DML AST node.

Fix: add `exp.Lock` to `_MUTATION_NODES` in `primitives.py`. sqlglot
parses the `FOR UPDATE` clause into `args["locks"]` on the Select node,
which the walker now catches.

Test: `test_validator_rejects_row_locking` parametrically asserts all
5 PG locking variants raise `MutationRejected`. Test suite: 52 passed
in `tests/integrations/nexo/` (was 47).

Review ref: PR #462 review C1.
…volume

Two Important items from the PR #462 review, neither behavior-changing
on the wire but both improving operability:

I1 — `agentic_graph.route_from_planner` was dead-code: both branches
returned `"synthesizer"` and the conditional-edge dict had only one
entry (`{"synthesizer": "synthesizer"}`). Anyone returning a different
label from a future executor wire-up would have raised at edge time.
Collapsed to a plain `graph.add_edge("planner", "synthesizer")` with a
comment marking the spot for the F-phase wire-up to swap back to a
conditional edge with an expanded mapping.

I2 — `docker-compose.yml` declared persistent volumes for postgres /
clickhouse / clickhouse-logs / minio but NOT for redis. Langfuse v3
uses Redis for the ingestion queue + cache — `docker compose restart`
would lose in-flight events between the OTel Collector and
langfuse-worker. Added `redis-data:/data` named volume; turned on
`--appendonly yes` so the AOF actually persists. Updated the README's
"Volumes used" list (was 4 entries, now 5).

Review ref: PR #462 review I1 + I2.
PR #462 review I4: `_parse_decision` calls `json.loads(text)`, catches
`JSONDecodeError`, then calls `payload.get(...)`. If the LLM returns
valid JSON that isn't a dict (bare string `"hello"`, number `42`, bool
`true`, null, array), `.get(...)` raises `AttributeError`, escapes the
narrow `JSONDecodeError` except clause, and crashes the route call
before the keyword fallback runs.

Fix: `isinstance(payload, dict)` guard immediately after the
`json.loads` — non-dict payloads now fall back to the keyword router
like any other unparseable response.

Test: `test_non_dict_json_falls_back_to_keyword` parametrically
covers bare string / number / bool / null / array. Each case ships
the message "Mintral status" so the keyword router lands on
NEXO_QUERY (verifying the fallback runs end-to-end, not just that
`_parse_decision` returned None).

Review ref: PR #462 review I4.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
infra/observability/docker-compose.yml (2)

102-105: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hardcoded credentials remain a security posture concern despite local dev scope.

While the inline comments (lines 106-109, 115-119) clearly document these as LOCAL-DEV-ONLY values, committing deterministic credentials still normalizes insecure patterns and creates copy-paste risk into staging/production environments. Git history preserves these forever, and they establish a precedent for credential handling that can leak beyond local development.

🔐 Recommended approach

Even for local dev, prefer environment variable interpolation with a .env.example template:

     environment:
-      DATABASE_URL: postgresql://postgres:postgres@postgres:5432/postgres
+      DATABASE_URL: ${MIOT_OBS_DATABASE_URL:-postgresql://postgres:postgres@postgres:5432/postgres}
-      NEXTAUTH_SECRET: please-change-this-in-prod
+      NEXTAUTH_SECRET: ${MIOT_OBS_NEXTAUTH_SECRET:?required}
-      LANGFUSE_INIT_PROJECT_PUBLIC_KEY: pk-lf-d57c7ed15cd59041736edbcfc679e48e
+      LANGFUSE_INIT_PROJECT_PUBLIC_KEY: ${MIOT_OBS_LANGFUSE_PUBLIC_KEY:?required}

Then provide a .env.example with placeholder documentation. This maintains reproducibility while keeping secrets out of version control.

Also applies to: 28-28, 53-53, 72-72, 85-85, 109-109, 135-135, 139-139, 151-152, 156-156, 165-165, 169-169

Also applies to: 120-121, 124-124

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/observability/docker-compose.yml` around lines 102 - 105, Replace the
hardcoded local-dev credentials (DATABASE_URL, NEXTAUTH_URL, NEXTAUTH_SECRET,
SALT) in docker-compose.yml with environment-variable interpolation and provide
a .env.example template; specifically, change the service env entries to
reference variables (e.g. DATABASE_URL:
"${DATABASE_URL:-postgresql://postgres:postgres@postgres:5432/postgres}" or
simply DATABASE_URL: "${DATABASE_URL}") and add a .env.example with placeholder
values and documentation for the four keys so secrets are not committed and
local defaults remain documented but externalized.

63-75: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Redis healthcheck still fails due to missing authentication.

The healthcheck at line 64 uses redis-cli ping without credentials, but line 72 configures Redis with --requirepass myredissecret. This causes NOAUTH errors, marks the container unhealthy, and breaks the depends_on: service_healthy condition at line 97, preventing langfuse-web from starting.

🔐 Proposed fix
     healthcheck:
-      test: ["CMD", "redis-cli", "ping"]
+      test: ["CMD", "redis-cli", "-a", "myredissecret", "ping"]
       interval: 5s
       timeout: 3s
       retries: 10
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/observability/docker-compose.yml` around lines 63 - 75, The Redis
healthcheck command is missing authentication and fails because Redis is started
with --requirepass myredissecret; update the healthcheck test (the healthcheck
"test" entry that currently runs redis-cli ping) to pass the password (e.g., use
redis-cli -a myredissecret ping or the equivalent redis-cli --pass myredissecret
ping) so the healthcheck authenticates successfully against the server started
with --requirepass myredissecret.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@infra/observability/docker-compose.yml`:
- Around line 102-105: Replace the hardcoded local-dev credentials
(DATABASE_URL, NEXTAUTH_URL, NEXTAUTH_SECRET, SALT) in docker-compose.yml with
environment-variable interpolation and provide a .env.example template;
specifically, change the service env entries to reference variables (e.g.
DATABASE_URL:
"${DATABASE_URL:-postgresql://postgres:postgres@postgres:5432/postgres}" or
simply DATABASE_URL: "${DATABASE_URL}") and add a .env.example with placeholder
values and documentation for the four keys so secrets are not committed and
local defaults remain documented but externalized.
- Around line 63-75: The Redis healthcheck command is missing authentication and
fails because Redis is started with --requirepass myredissecret; update the
healthcheck test (the healthcheck "test" entry that currently runs redis-cli
ping) to pass the password (e.g., use redis-cli -a myredissecret ping or the
equivalent redis-cli --pass myredissecret ping) so the healthcheck authenticates
successfully against the server started with --requirepass myredissecret.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 600b9af6-6529-4f41-9537-3b6b8a282514

📥 Commits

Reviewing files that changed from the base of the PR and between 085595f and 5330dc8.

📒 Files selected for processing (7)
  • infra/observability/README.md
  • infra/observability/docker-compose.yml
  • miot-harness/src/miot_harness/integrations/nexo/primitives.py
  • miot-harness/src/miot_harness/runtime/agentic_graph.py
  • miot-harness/src/miot_harness/runtime/intent_router.py
  • miot-harness/tests/integrations/nexo/test_primitives.py
  • miot-harness/tests/runtime/test_intent_router.py
✅ Files skipped from review due to trivial changes (1)
  • infra/observability/README.md
🚧 Files skipped from review as they are similar to previous changes (5)
  • miot-harness/tests/runtime/test_intent_router.py
  • miot-harness/tests/integrations/nexo/test_primitives.py
  • miot-harness/src/miot_harness/runtime/agentic_graph.py
  • miot-harness/src/miot_harness/runtime/intent_router.py
  • miot-harness/src/miot_harness/integrations/nexo/primitives.py

odtorres and others added 8 commits May 15, 2026 15:54
…e ping

Container starts with --requirepass but healthcheck called `redis-cli ping`
unauthenticated. redis-cli prints `NOAUTH Authentication required.` and
exits 0, so the healthcheck reported "healthy" without ever verifying
Redis was actually responsive. Use `-a` with --no-auth-warning so the
ping is authenticated and the exit status reflects real health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three lint failures in miot_harness.integrations.nexo.primitives:

- E501 (lines 6-11): docstring table exceeded the 100-char budget.
  Rewrote as a bulleted list — same content, fits the line length.
- B033 (line 119): `to_char` appeared twice in `_SAFE_FUNCTIONS`
  (once under "string", once under "date/time"). Kept the first; the
  set semantics were unchanged but the duplicate is a lint failure.
- I001 (line 24): import block already grouped correctly (future →
  stdlib → third-party); the failure was a stray double blank line
  between imports and the first top-level comment.

Ruff: 0 errors. Tests: tests/integrations/nexo/ → 103 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`isinstance(row, dict)` is always False for asyncpg.Record (it supports
mapping-style subscript access but is not a dict subclass), so every
nexo_explain call against a real DB raised "EXPLAIN output missing
QUERY PLAN" before reaching the cost-gate logic. The existing tests
passed only because their fake_pool fixtures fed plain dicts —
coverage masked the prod bug.

- primitives.py: switch to `row["QUERY PLAN"]` subscript access guarded
  by try/except (KeyError, TypeError, IndexError). Works uniformly
  across asyncpg.Record and plain-dict fixtures.
- test_primitives.py: add test_nexo_explain_handles_record_like_rows
  using a non-dict RecordLike fixture that locks the fix against a
  future `isinstance(row, dict)` regression. Also fix two pre-existing
  I001 nits (out-of-order import, stray blank line) flagged by ruff
  on the touched file.

Ruff: 0 errors. Tests: tests/integrations/nexo/ → 104 passed (+1), 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three lint failures in test_langfuse_attribution.py:

- I001 (lines 73-74 and 131-132): inline imports inside two test bodies
  ordered `langchain_core.outputs` before `langchain_core.messages`.
  Reordered so messages comes first.
- F401 (line 15): `pytest` imported but never referenced. Dropped.

Ruff: 0 errors. Tests: tests/observability/test_langfuse_attribution.py → 4 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The import block is already correctly grouped (future → stdlib → third-party
→ local) and alphabetized; the actual cause of the I001 failure was a stray
double blank line between the imports and the first test definition.
Collapsed to one.

Ruff: 0 errors. Tests: tests/observability/test_pricing.py → 6 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two parametrize tuples at lines 39 and 43 were 102 chars (line-length=100).
Wrapped each tuple across five indented lines (prompt / model / score / route)
— same semantics, fits the budget.

Ruff: 0 errors. Tests: tests/runtime/test_intent_router.py → 10 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-E tests

F401: HarnessRoute was imported at line 29 but never referenced in the file.

Ruff: 0 errors. Tests: tests/runtime/test_supervisor_phase_e.py → 7 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four files not previously touched in this branch still tripped `ruff check
src tests`:

- src/miot_harness/observability/report.py:124 (E501): inline header
  f-string was 102 chars. Extracted to a two-line concatenated literal.
- src/miot_harness/runtime/agentic_graph.py:78 (E501): inline ainvoke
  argument was 103 chars. Pulled into a `prompt` local.
- tests/observability/conftest.py:17 and tests/observability/test_deps.py:12
  (I001 ×2): stray double blank line after the import block — same
  pattern as earlier fixes on this branch.

Ruff: 0 errors over `src tests`. Tests: tests/observability/,
tests/runtime/test_intent_router.py, tests/runtime/test_supervisor_phase_e.py,
tests/integrations/nexo/ → 178 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@infra/observability/docker-compose.yml`:
- Around line 77-88: The MinIO service (minio / container_name miot-obs-minio)
lacks a Docker healthcheck which causes dependents to wait for service_started
instead of service_healthy; add a healthcheck block to the minio service using
an HTTP or mc-based check against :9000 (for example a curl GET
/minio/health/live or mc admin info) with sensible interval/retries/timeout and
start_period so the container reports healthy only when MinIO is ready, and then
update the langfuse-web service dependency to use depends_on: { minio: {
condition: service_healthy } } (or equivalent docker-compose v3 syntax) so
langfuse-web waits for MinIO to be healthy before starting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cb96ba78-cc9b-497d-87a5-7215364853d4

📥 Commits

Reviewing files that changed from the base of the PR and between 5330dc8 and 2d21523.

📒 Files selected for processing (11)
  • infra/observability/docker-compose.yml
  • miot-harness/src/miot_harness/integrations/nexo/primitives.py
  • miot-harness/src/miot_harness/observability/report.py
  • miot-harness/src/miot_harness/runtime/agentic_graph.py
  • miot-harness/tests/integrations/nexo/test_primitives.py
  • miot-harness/tests/observability/conftest.py
  • miot-harness/tests/observability/test_deps.py
  • miot-harness/tests/observability/test_langfuse_attribution.py
  • miot-harness/tests/observability/test_pricing.py
  • miot-harness/tests/runtime/test_intent_router.py
  • miot-harness/tests/runtime/test_supervisor_phase_e.py
🚧 Files skipped from review as they are similar to previous changes (10)
  • miot-harness/tests/observability/test_deps.py
  • miot-harness/tests/runtime/test_supervisor_phase_e.py
  • miot-harness/tests/observability/conftest.py
  • miot-harness/src/miot_harness/runtime/agentic_graph.py
  • miot-harness/src/miot_harness/observability/report.py
  • miot-harness/tests/integrations/nexo/test_primitives.py
  • miot-harness/tests/observability/test_pricing.py
  • miot-harness/tests/observability/test_langfuse_attribution.py
  • miot-harness/tests/runtime/test_intent_router.py
  • miot-harness/src/miot_harness/integrations/nexo/primitives.py

Comment thread infra/observability/docker-compose.yml
odtorres and others added 4 commits May 15, 2026 16:50
The minio service had no healthcheck, so langfuse-web's depends_on
fell back to `service_started` — the web container could start while
MinIO was still booting and fail blob uploads on first request.

- minio: probe `/minio/health/live` via `curl -fsS` (both curl and mc
  confirmed present in minio/minio:RELEASE.2025-08-13). Used
  `127.0.0.1` to match the clickhouse IPv6 workaround above. Timing
  mirrors the rest of the stack (5s/5s, 10 retries, 10s start_period).
- langfuse-web depends_on minio: flip `service_started` → `service_healthy`.

Validated: `docker compose config --quiet` passes; recreated minio
container reports `healthy` in ~3s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`Tracer.start_as_current_span(attributes=...)` expects
`Mapping[str, AttributeValue]` where `AttributeValue` is
`str | bool | int | float | Sequence[...]`. The local was annotated
`dict[str, object]`, which mypy can't narrow. Every value placed in
this dict at runtime is already a str (f-string, the str-typed kwargs
past their None guard, and `json.dumps(...)`), so tightening the
annotation to `dict[str, str]` matches reality and clears the error
without any runtime change.

Mypy: 0 errors over 65 files. Tests: tests/observability/test_langfuse_attribution.py → 4 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Distribution-Evals job ran `07-attestations-present.sh` without
GH_TOKEN, so `gh attestation verify` exited on the auth precondition
("set the GH_TOKEN environment variable") and the script surfaced it
as a misleading "no attestation verifies" FAIL.

Scoped GH_TOKEN at the step (not the job) — the other two scripts in
the same job (05-pulls-from-ghcr.sh, 06-pulls-from-dockerhub.sh) don't
use gh, so the token stays narrow. The job's existing
`permissions.attestations: read` was already correct; the only missing
piece was exposing github.token to the gh CLI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`docker/build-push-action`'s `provenance: true / sbom: true` embeds
attestations into the OCI image manifest only — they do NOT register
with the GitHub attestation API. The deploy-eval script
`07-attestations-present.sh` uses `gh attestation verify --owner`,
which queries `/orgs/{owner}/attestations/{digest}` and returned 404
because no attestation had ever been registered there.

Added three steps after the build, gated to skip on fork PRs (same
condition as the existing build-push-action `push:` gate, since fork
PRs can't write to org packages):

1. `anchore/sbom-action@v0` writes an SPDX-JSON SBOM file (the embedded
   SBOM from build-push-action is OCI-only; attest-sbom needs a file).
2. `actions/attest-build-provenance@v2` registers SLSA provenance against
   the image digest with GitHub.
3. `actions/attest-sbom@v2` registers the SPDX SBOM against the same
   digest with GitHub.

Both attest-* actions also push to the OCI registry (`push-to-registry:
true`), so the OCI manifest carries attestations from both
build-push-action AND the attest-* family — defense in depth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
infra/observability/docker-compose.yml (1)

108-110: ⚡ Quick win

Bind published ports to loopback for safer local defaults.

These mappings currently expose Langfuse and OTLP endpoints on all host interfaces. For local-dev defaults, binding to 127.0.0.1 reduces accidental LAN exposure.

🔧 Suggested patch
   langfuse-web:
@@
     ports:
-      - "3000:3000"
+      - "127.0.0.1:3000:3000"
@@
   otel-collector:
@@
     ports:
-      - "4317:4317"   # OTLP gRPC (harness -> collector)
-      - "4318:4318"   # OTLP HTTP (debug)
-      - "8888:8888"   # collector self-metrics
+      - "127.0.0.1:4317:4317"   # OTLP gRPC (harness -> collector)
+      - "127.0.0.1:4318:4318"   # OTLP HTTP (debug)
+      - "127.0.0.1:8888:8888"   # collector self-metrics

Also applies to: 192-195

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/observability/docker-compose.yml` around lines 108 - 110, The ports
mapping currently publishes services to all interfaces (e.g., the ports entry
with "- \"3000:3000\""); change published ports to bind to loopback by prefixing
the host address (e.g., "- \"127.0.0.1:3000:3000\"") for the Langfuse and OTLP
port mappings (and any other similar entries around the other ports referenced
in the file) so they only listen on localhost for safer local-dev defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@infra/observability/docker-compose.yml`:
- Around line 108-110: The ports mapping currently publishes services to all
interfaces (e.g., the ports entry with "- \"3000:3000\""); change published
ports to bind to loopback by prefixing the host address (e.g., "-
\"127.0.0.1:3000:3000\"") for the Langfuse and OTLP port mappings (and any other
similar entries around the other ports referenced in the file) so they only
listen on localhost for safer local-dev defaults.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9fe21d1d-99db-4c61-af51-f7717e09c16e

📥 Commits

Reviewing files that changed from the base of the PR and between 2d21523 and 7dc5bcd.

📒 Files selected for processing (3)
  • .github/workflows/harness.yaml
  • infra/observability/docker-compose.yml
  • miot-harness/src/miot_harness/observability/spans.py

`gh attestation verify` filters by `--predicate-type` and defaults to
`https://slsa.dev/provenance/v1`. The single call we were making only
ever returned the provenance predicate, so the script reported "no SBOM
predicate" even though `actions/attest-sbom@v2` had successfully
registered an SPDX attestation against the same digest.

Now does two calls:
- provenance with explicit `--predicate-type https://slsa.dev/provenance/v1`
  (no behavior change, self-documenting)
- SBOM with `--predicate-type https://spdx.dev/Document`, falling back
  to `https://cyclonedx.org/bom` if the SBOM tooling ever flips formats

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@odtorres odtorres linked an issue May 15, 2026 that may be closed by this pull request
6 tasks
PR #462 shipped four .ralph/ files (RALPH_PROMPT.md, blockers.md,
log.md, state.md) seeded by commit 8107571. They're local ralph-loop
iteration state — worklist, progress log, blockers, runtime prompt —
that should never have been tracked. Non-sensitive, but noise for
reviewers and grows with every loop iteration.

- `git rm -r --cached .ralph/` removes them from the index only;
  files remain on disk so the active ralph session keeps working.
- Add `.ralph/` to root `.gitignore` (under the existing Cursor /
  Aura local-only dev-tooling section) so any future ralph run
  anywhere in the monorepo is ignored by default.

User chose the non-destructive path: no history rewrite, no
force-push. Old commits still reference the files but they're
non-sensitive iteration logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@korutx korutx left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP

@korutx korutx merged commit 85f6fe9 into trunk May 20, 2026
12 checks passed
korutx added a commit that referenced this pull request May 21, 2026
* harness(phase-13): close NEXO_META observation-level tenant tag gap

E10 emitted `langfuse.tags = [tenant:<id>, mode:<m>, ...]` on the root
`nexo.run` trace span, but the inner LLM call in meta mode slipped
through observation-level tagging — `supervisor._run_nexo_meta` called
`meta_agent_node(model=self.meta_model, ...)` with the **raw**
BaseChatModel, so Traceloop's auto-instrumented `anthropic.chat`
observation carried no `modular.{agent,tenant_id,mode}` attrs. Result:
trace-level filters worked; observation-level per-tenant cost rollups
under-counted meta runs.

- `runtime/nexo_graph.py`: rename `_instrument` → `instrument_model`
  so it's importable across modules (no behavior change).
- `runtime/supervisor.py`: import `instrument_model`; in
  `_run_nexo_meta`, wrap `self.meta_model` via
  `instrument_model(self.meta_model, "meta_agent", ctx)` before
  invoking `meta_agent_node`. Per-agent `nexo.meta_agent` span now
  carries the same `modular.{agent,tenant_id,mode}` +
  `langfuse.{user.id,session.id,tags,environment}` attrs as canned /
  agentic paths.

Tests:
- `tests/runtime/test_supervisor_phase_e.py`:
  +`test_meta_mode_per_agent_span_carries_tenant_tag` — runs the meta
  dispatch via the supervisor, asserts the emitted nexo.meta_agent
  span has `modular.agent=meta_agent`, `modular.tenant_id=acme-corp`,
  `modular.mode=meta`, `langfuse.user.id=alice`, and
  `tenant:acme-corp` / `mode:meta` / `agent:meta_agent` all present
  in the JSON-decoded `langfuse.tags`.
- Drop unused `HarnessRoute` import in same file.

Conftest restructure: the `memory_exporter` fixture lived in
`tests/observability/conftest.py`, but the new supervisor test needs
it from `tests/runtime/`. Pytest's conftest scoping only resolves
ancestors, not siblings — promoted the fixture (+ the
`_HarnessSpanExporter` class) to `tests/conftest.py` so any test file
in the suite can use it. `tests/observability/conftest.py` is now a
docstring stub.

Full suite: 310 passed, 1 skipped.

* harness(phase-13): C3 live Langfuse fetcher + MIOT_HARNESS_LANGFUSE_HOST

`observability/report.py` previously errored out on the live path
("live Langfuse fetch is not wired yet — pass --fixture"). Wiring it
to the real Langfuse v3 public REST API.

`config.py`:
- `langfuse_host: str = "http://localhost:3000"` — where the CLI
  fetches traces from. Local docker-compose stack by default;
  override for a deployed Langfuse via env var.
- `.env.example`: documented entry alongside the existing
  `MIOT_HARNESS_LANGFUSE_*` keys block.

`observability/report.py`:
- `_project_langfuse_trace(trace) -> {"attributes": {...}}` — pure
  projection from Langfuse's trace row shape (`tags`, `totalCost`,
  `userId`, `sessionId`) into the `aggregate_cost`-friendly shape.
  Splits the `tenant:` / `mode:` / `agent:` prefixes back into
  `modular.{tenant_id, mode, agent}` so the existing aggregator
  keys still work. Token totals project as 0 (Langfuse v3 keeps
  them on observations, not on trace rows).
- `fetch_traces_window(host, public_key, secret_key, since, now=,
  page_size=100, client=)` — paginated GET against
  `/api/public/traces` with Basic Auth, returning the raw rows.
  Caller-injectable `client` for tests via `httpx.MockTransport`.
  Owns the client lifecycle when none is injected.
- `_live_traces(since)` — operator-facing wrapper reading the three
  env vars (`MIOT_HARNESS_LANGFUSE_{HOST,PUBLIC_KEY,SECRET_KEY}`).
  Raises a clear RuntimeError telling the operator to run
  `./infra/observability/bootstrap.sh` when keys are missing.
- `main()`: replaces the `parser.error("not wired yet")` branch
  with `_live_traces(since)` when no `--fixture` is passed; still
  honours `--fixture` for dry-run / offline use.
- Drive-by: line-wrap `_render_text`'s long header line to satisfy
  ruff E501 (was committed in 462's a2f5e8a).

Tests (+8 new):
- `test_project_langfuse_trace_extracts_tags_and_cost` —
  `tenant:acme` / `mode:agentic` → `modular.{tenant_id, mode}`.
- `test_project_langfuse_trace_handles_missing_tags`.
- `test_project_langfuse_trace_handles_null_total_cost` (Langfuse
  returns `totalCost: null` for traces with no LLM calls).
- `test_project_langfuse_trace_picks_first_agent_tag` —
  multi-agent traces project the first.
- `test_fetch_traces_window_paginates_until_total_pages` — 3-page
  MockTransport stream concatenates correctly.
- `test_fetch_traces_window_forwards_since_as_query_params` —
  asserts `fromTimestamp` / `toTimestamp` are derived from the
  injected `now` and the `since` timedelta.
- `test_fetch_traces_window_sends_basic_auth_header` — asserts
  `Authorization: Basic base64(pk:sk)`.
- `test_fetch_traces_window_strips_trailing_slash_from_host` — no
  `//api/public/traces` corruption when host ends in `/`.
- `tests/test_config.py`: +2 tests for `langfuse_host` default and
  env-var override.

Usage:

  MIOT_HARNESS_LANGFUSE_PUBLIC_KEY=pk-lf-…  MIOT_HARNESS_LANGFUSE_SECRET_KEY=sk-lf-… \
    uv run python -m miot_harness.observability.report --since 7d --by tenant

Full suite: 310 passed, 1 skipped.

* harness(phase-13): docs — Run Modes + per-tenant billing surface

Four documentation updates covering the per-tenant billing flow this
PR ships. No code changes here — all source changes landed in the
preceding two commits.

`miot-harness/README.md`:
- New `## Run Modes` section after `## Design Defaults` covering the
  four `mode` values on `POST /runs`: auto / canned / meta / agentic.
- 5-column table (route / DB-touch / non-Mintral / LLM-calls / use-case)
  for at-a-glance comparison.
- Per-mode subsections with concrete example questions:
  - `### canned — answer FROM the data` (nexo_graph, curated fn_dx_*).
  - `### meta — answer ABOUT the data` (meta_agent_node, no SQL,
    structural guarantee via function signature).
  - `### agentic — explore the data with composable primitives`
    (agentic_graph, sqlglot safety gate).
- `### Cost separation in Langfuse` — how `mode:<m>` tag + per-tenant
  filter enables tiered billing (unlimited meta + metered canned).

`infra/observability/README.md`:
- New `## Per-tenant cost rollups` section before "What this is *not*"
  documenting the four ways to get per-tenant cost:
  1. Langfuse UI Filters sidebar (Tags → tenant:<id> → Total Cost).
  2. Cost-report CLI with sample text output.
  3. ClickHouse JOIN query against traces + observations (correctly
     joining the tag-bearing trace row to the cost-bearing observation
     rows — Langfuse v3 only stores total_cost on observations).
  4. Persistent dashboard (pointer to dashboards/per-tenant-cost.md).
- Privacy posture reminder: single-tenant self-host; co-mingled tenant
  data inside one Langfuse project; plan-14 isolates per project.

`infra/observability/dashboards/per-tenant-cost.md` (NEW):
- 2-minute click-by-click UI build procedure for the Per-tenant cost
  widget. Langfuse v3 doesn't yet expose a stable dashboard-import
  endpoint so we ship the build procedure instead of brittle JSON.
- ClickHouse JOIN query as a power-user fallback.
- CLI alternative.

`infra/observability/PER_TENANT_BILLING_TEST.md` (NEW):
- 9-step manual test procedure for end-to-end verification of the
  per-tenant billing surface (tunnel → uvicorn → 3-tenant curl
  battery → UI filter → CLI → NEXO_META observation-level check →
  ClickHouse query → dashboard build).
- All paths are `$REPO_ROOT`-relative (`git rev-parse --show-toplevel`)
  so the doc is portable across developer machines and CI.
- `$DB_SCRIPTS_ROOT` placeholder for the *separate* db-scripts repo
  that owns the SSH tunnel script.
- Langfuse keys read from `docker-compose.yml` via a `grep | awk`
  pipeline rather than hardcoded literal values — clean for forks
  that regenerate the keys.
- Pass/fail checklist at the end with a "which file to inspect when
  each step fails" troubleshooting table.

* harness(phase-13): post-review report.py fixes — settings-based config + (multi) agent sentinel

Two Important items from PR #470's first-pass review:

I1 — `_live_traces` was reading `os.environ.get` directly with a
hardcoded `"http://localhost:3000"` default, even though
`HarnessSettings.langfuse_host` already exists with the same default
(and proper `MIOT_HARNESS_` prefix + `.env` resolution via
pydantic-settings). Two defaults to keep in sync = drift hazard. Now
reads through `get_settings()` — single source of truth, `.env`
resolution + validation apply, the `langfuse_host` field is no longer
dead. `os` import dropped since nothing else used it.

I3 — `_project_langfuse_trace` previously projected the **first**
`agent:<name>` tag and silently dropped the rest. For a canned-mode
trace (which carries 3+ agent tags: filter_expert / synthesizer /
critic / …), this attributed all of the trace's cost to the first
agent and undercounted the others. For `--by tenant` / `--by mode`
this was correct (single-valued tags); for `--by agent` it was wrong
on live data. Now: single-agent traces still project the real name;
multi-agent traces project `modular.agent = "(multi)"` so the
aggregator bucket is visibly distinct. Operators see a `(multi)`
row and know to drill down via the observations endpoint instead of
trusting silently-undercounted numbers.

Tests:
- Replaced `test_project_langfuse_trace_picks_first_agent_tag` with
  `test_project_langfuse_trace_marks_multi_agent_with_sentinel`.
- Added `test_project_langfuse_trace_single_agent_uses_real_name` to
  pin down the unambiguous case.

Full suite: 321 passed, 1 skipped.

Review ref: PR #470 review I1 + I3.

* harness(phase-13): tighten meta-attribution test docstring (PR #470 I2)

The test docstring promised the regression target was "the inner LLM
call must carry the same attrs" — but the test asserts on the
harness-emitted `nexo.meta_agent` span (from `NexoTelemetryCallback`),
not the Traceloop-auto-instrumented `anthropic.chat` child observation
that was the actual gap. `FakeListChatModel` doesn't trigger the
Anthropic SDK instrumentor, so the test verifies the harness-side
wrap, not the third-party-instrumented child observation.

Rewrote the docstring to make the scope explicit:
- What's covered: the per-agent span emitted when `_run_nexo_meta`
  calls `instrument_model(self.meta_model, "meta_agent", ctx)`.
- What's verified separately: the Traceloop child-observation path,
  confirmed live via the ClickHouse query in PR #470's description
  (`SELECT name, tenant_id FROM observations WHERE name='nexo.meta_agent'`).

No code change; docstring-only.

Review ref: PR #470 review I2.

* chore: remove plan files

* chore: remove plan files

* chore: remove plan files

---------

Co-authored-by: Michel David <mdavid.cu@gmail.com>
This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(harness): end-to-end LLM telemetry + agentic supervisor for Nexo (Plan 13)

2 participants