harness(phase-13): per-agent telemetry + agentic search foundation by odtorres · Pull Request #462 · microboxlabs/modulariot

odtorres · 2026-05-12T21:00:10Z

Summary

Phase A (telemetry foundation): OTel GenAI semconv emitters (nexo.run root + nexo.<agent> per-LLM-call spans) wired into every LangGraph node; Traceloop auto-instruments Anthropic/OpenAI SDKs; FastAPIInstrumentor adds HTTP-request spans; versioned pricing table converts token usage to gen_ai.usage.cost_usd at run time.
Phase B (backend deployment): infra/observability/ self-hosted Langfuse v3 stack (postgres + clickhouse + redis + minio + langfuse-web + langfuse-worker + otel-collector) with deterministic init keys, bootstrap.sh, and a README covering bring-up, backup, backend-swap.
Phase E (agentic search): LLM intent router + explicit mode field on RunRequest (auto/canned/meta/agentic); composable DB primitives (nexo_describe/nexo_select/nexo_grep/nexo_explain) with a multi-layer sqlglot safety gate (AST mutation rejection, LATERAL rejection, positive function allowlist, table allowlist ^nexo\.dx_*$, EXPLAIN cost gate); meta-question agent (no SQL); conversational memory (InMemoryConversationStore + ConversationStore Protocol); new agentic LangGraph; route-aware tenancy_gate; provenance JSONL log + curate CLI; cost-report CLI.
Supervisor integration: HarnessSupervisor.run() resolves the route via LLMIntentRouter (with keyword fallback below threshold), dispatches to nexo_graph / agentic_graph / meta_agent_node / storytelling, and round-trips conversation_id through the store. The FastAPI lifespan injects all Phase-E modules on Nexo boot.

What's NOT in this PR (follow-up)

F4 — week of agentic traffic: time-gated by definition; ships as a separate retrospective PR after a week of real use.
C2 dashboards: 6 panels (per-agent cost, cache hit ratio, per-tenant cost, per-mode cost split, latency per agent, tool execution success rate). Stack is up; the dashboards are operator-curated in the Langfuse UI then exported as JSON. Separate PR.
C4 alerting: needs the cost-CLI's live Langfuse fetcher (currently fixture-mode-only) plus 24 h of traffic for thresholds.
Agentic graph executor: the topology (tenancy_gate / planner / synthesizer / critic / summarizer) is wired, but planner and synthesizer are stubs. Connecting them to the composable primitives + provenance log is a contained follow-up; tests and the safety gate already cover the primitive surface.

Known limitations / caveats

Cost-report CLI --by agent is a trace-level approximation. Each trace can carry multiple agent:<name> tags (one per agent that fired during the run), but _project_langfuse_trace projects the first one. For per-tenant and per-mode rollups (single-valued tags) this is correct; for per-agent, dollars get attributed to whichever agent tag came first in the array. True per-agent attribution requires fetching the observations endpoint — ships in the stacked follow-up PR harness(phase-13): per-tenant billing surface (stacked on #462) #470.
NEXO_META inner LLM call lacks observation-level tenant attribution. The root nexo.run trace carries langfuse.tags = [tenant:<id>, …] (E10), but the inner anthropic.chat observation Traceloop auto-emits inside _run_nexo_meta does not — meta_model is invoked raw. Trace-level filters work; observation-level rollups under-count meta runs. Fixed in PR harness(phase-13): per-tenant billing surface (stacked on #462) #470 by wrapping meta_model via instrument_model(self.meta_model, "meta_agent", ctx).

Verification artifacts

Test suite: 295 passed, 1 skipped, 0 failed (uv run pytest from miot-harness/). Plan 12's 151-test baseline preserved; 144 new tests across tests/{observability,runtime,agents,integrations}/.

Live telemetry (Langfuse @ http://localhost:3000): verified via direct ClickHouse query after F2/F3 curl battery:

11 nexo.run root spans
6 nexo.synthesizer + 6 nexo.filter_expert per-agent spans
22 anthropic.chat (Traceloop-auto-instrumented) child spans
Per-mode trace split: 5 auto / 3 canned / 2 meta / 1 agentic
Every span carries modular.{mode, run_id, tenant_id} in its metadata

F2 — canned-mode (mode=\"canned\"): plan 12's G6/G7/G8 still produce Markdown answers from the real Coordinador snapshot.

F3 — mode bypass + auto-router:

mode=\"meta\" for tenant_id=mintral → primer+catalog answer, no DB.
mode=\"meta\" for tenant_id=non-mintral-customer → allowed (meta is non-confidential).
mode=\"agentic\" for tenant_id=non-mintral-customer → ModeAccessDenied at request-validation; 200 response with explicit refusal answer.
mode=\"agentic\" for tenant_id=mintral → fires through the stub planner+synthesizer; emits the right spans.
AUTO route on "dimensionamiento para mañana" → LLMIntentRouter picks NEXO_QUERY at confidence 0.75 → answered from the canned graph.

E3 composable-primitives safety gate (the high-risk surface, reviewed twice):

47 unit tests including 12 deliberate-bypass attempts (LATERAL function-in-FROM, side-effect functions pg_terminate_backend / pg_sleep / set_config / lo_import / pg_reload_conf / pg_advisory_lock / pg_read_file / dblink, multi-statement, CTE with mutation, MERGE, COPY, non-allowlisted schemas).
Positive function allowlist (~80 entries) is the only durable defense against side-effect functions that aren't INSERT/UPDATE/DELETE AST nodes but cause writes, kills, or DoS.

Test plan

uv run pytest -q from miot-harness/ → 295 passed
cd infra/observability && docker compose up -d && ./bootstrap.sh → 7 services healthy, .env paste-block printed
MIOT_HARNESS_OTEL_ENABLED=true uv run uvicorn miot_harness.api.server:create_app --factory → /health returns nexo.enabled=true + curated tool list
POST /runs with mode=canned (×3 scenarios) → 200, real Coordinador data, spans in Langfuse
POST /runs with mode=meta (×2 tenants) → 200, primer+catalog answer, no SQL
POST /runs with mode=agentic (Mintral + non-Mintral) → 200/refusal as appropriate
POST /runs with implicit mode=auto → LLMIntentRouter classifies correctly
Trace metadata in ClickHouse carries modular.mode/modular.run_id/modular.tenant_id for grouping

Code review checkpoints

Two superpowers:requesting-code-review cycles run during the plan:

After Phase A: 5 Important issues found, all addressed (.env.example drift, conftest filter scope, configure_tracing no-op-on-second-call, spans.py docstring, A6 propagation test docstring) — commit 21037769.
After Phase E3 (the high-risk safety gate): 2 Critical bypasses found and closed (LATERAL function-in-FROM, side-effect functions in SELECT list) — commit 684fbd09.

Plan ref

.cursor/plans/ai-first/13-post-nexo-roadmap.md — Phases A, B, C (partial), D (partial), E, F1-F3.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Multi-mode request handling: auto-routing, canned responses, meta-agent answers, and agentic workflows
- Conversation history with automatic turn summarization for multi-turn interactions
- LLM-based intent router with confidence thresholds for intelligent request classification
- Cost tracking and pricing reports for LLM usage across models
- Database query safety gates and cost enforcement to prevent expensive operations
Infrastructure
- Local observability stack with Langfuse + OpenTelemetry for development telemetry
- Docker Compose setup for running observability services locally
- Build attestations (SLSA provenance, SBOM) in CI/CD pipeline
Configuration
- New environment variables for OpenTelemetry and Langfuse integration
- Intent router confidence thresholds and cost limits for query operations

- Add `opentelemetry-{api,sdk,exporter-otlp}`, `opentelemetry-instrumentation-fastapi`, `traceloop-sdk` to runtime deps and `pytest-opentelemetry` to dev deps. - `uv sync` resolves 163 packages cleanly with no version conflicts against the existing langchain/langgraph/fastapi/pydantic pins. - New smoke tests under `tests/observability/test_deps.py` exercise the public import surface each Phase-A task depends on (TracerProvider, OTLP gRPC trace exporter, FastAPI instrumentation, traceloop.sdk, pytest_opentelemetry). - Seed `.ralph/` state, log, blockers, and RALPH_PROMPT.md so iteration progress is tracked in-tree. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A1)

Add `miot_harness.observability` with four units that the rest of Phase A plugs into: - `pricing.py` — versioned `ModelPricing` table (Opus 4.7, Sonnet 4.6, Haiku 4.5, gpt-4o-mini) keyed by model id with `effective_date` so historical reads stay stable; Anthropic-style `TokenUsage` buckets; `compute_cost` returns a `Decimal` quantized to 6 decimals; raises `UnknownModelError` for unmapped models. - `otel.py` — `configure_tracing(enabled, service_name, endpoint, environment)` builds a `TracerProvider` with a Resource carrying `service.name` / `service.namespace=modular` / `deployment.environment` and a `BatchSpanProcessor` wrapping `OTLPSpanExporter`. Returns `None` when disabled so the FastAPI lifespan (A4) can no-op without branches. `shutdown_tracing` is None-safe. - `spans.py` — `agent_span(name, *, run_id, tenant_id?, mode?)` context manager that opens a `nexo.<name>` span carrying `gen_ai.operation.name`, `modular.run_id`, `modular.tenant_id`, and `modular.mode` (last two omitted when unset) and nests under the current context so traces form a tree. - `callbacks.py` — `NexoTelemetryCallback(BaseCallbackHandler)` tracking per-`run_id` span handles in a dict (so concurrent LLM calls under the same node do not cross-talk). Maps LangChain's total-style `usage_metadata.input_tokens` back into Anthropic buckets, infers provider from `serialized.id`, and emits `gen_ai.{system, request.model, usage.input_tokens, usage.output_tokens, usage.cache_read.input_tokens, usage.cache_creation.input_tokens, usage.cost_usd}` plus `modular.{agent, run_id, tenant_id, mode}` on `on_llm_end`. `on_llm_error` records the exception and sets `Status(ERROR)`. Tests under `tests/observability/` attach an `InMemorySpanExporter` to the `TracerProvider` already installed by pytest-opentelemetry (OTel's global provider is install-once) and filter to `nexo.*` spans so the plugin's own per-test spans don't pollute assertions. 17 new tests cover pricing math, span context, configure_tracing return contract, and the callback's full span-emission surface incl. concurrent runs and provider inference. Full suite: 175 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A2)

Per-agent attribution from day one: - `runtime/nexo_graph.py`: new `_instrument(model, agent_name, ctx)` helper that returns `model.with_config(callbacks=[NexoTelemetryCallback(...)])`, threading `ctx.run_id` and `ctx.tenant_id` through every LLM-bearing seat (filter_expert, domain_analyst, synthesizer, critic, summarizer). Tool-only seats (data_fetcher, freshness_judge) stay model-less and produce no callback spans. - `runtime/supervisor.py`: wrap `nexo_graph.ainvoke(...)` in `agent_span("run", run_id, tenant_id)` so a `nexo.run` root span lives over the whole invocation. Combined with the callback's per-agent spans, Langfuse can render a complete tree per run; the `modular.run_id` attr on every span lets us regroup even when LangGraph's parallel branches break OTel context propagation (the gotcha A6 covers). Tests: `tests/observability/test_graph_wiring.py` builds the real `build_nexo_graph(...)` with scripted `FakeListChatModel`s and asserts the graph emits `nexo.<agent>` spans carrying `modular.{agent, run_id, tenant_id}` for every seat that fired, plus a `nexo.run` root with the right `gen_ai.operation.name`/`modular.run_id`/`modular.tenant_id`. Full suite: 177 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A3)

Pull A5 ahead of A4 so the FastAPI lifespan (A4) has typed settings to read instead of falling back to raw env reads. New `HarnessSettings` fields, all under the existing `env_prefix="MIOT_HARNESS_"`: - `otel_enabled: bool = False` — telemetry is off by default; the lifespan no-ops on configure_tracing when this is unset. - `otel_endpoint: str = "http://localhost:4317"` — OTLP gRPC target. - `otel_service_name: str = "miot-harness"` — resource attr. - `otel_environment: str = "local"` — resource attr (`deployment.environment`). - `langfuse_public_key: str | None = None` - `langfuse_secret_key: str | None = None` Tests: 4 new in `test_config.py` covering safe defaults + env overrides for both groups. Full suite: 181 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A5)

Wire telemetry boot into the FastAPI lifespan: - Before Nexo boot, call `configure_tracing(enabled=settings.otel_enabled, service_name, endpoint, environment)`. When enabled, follow with `Traceloop.init(app_name, api_endpoint, disable_batch=False, telemetry_enabled=False)` so Anthropic/OpenAI SDK calls auto-emit `gen_ai.*` child spans against the same global TracerProvider that `NexoTelemetryCallback` per-agent spans flow through. Self-hosted — `telemetry_enabled=False` blocks Traceloop's phone-home pings. - `app.state.tracer_provider` exposes the provider for introspection. - `shutdown_tracing` runs in BOTH `finally` branches of the lifespan (the early Nexo-disabled return AND the regular pool-close path) so the batch exporter always flushes on shutdown. Tests: `tests/observability/test_lifespan_init.py` patches `configure_tracing`/`Traceloop`/`shutdown_tracing` and asserts: - Off-path: configure_tracing called once with `enabled=False`, `Traceloop.init` NOT called. - On-path: configure_tracing forwards all four settings, `Traceloop.init` called with `app_name=service_name`, `shutdown_tracing` called with the returned provider on teardown. Full suite: 183 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A4)

Phase A's known-risk test: when LangGraph fans out into parallel branches, OTel parent-context propagation can break — child spans land under the wrong parent and usage attribution gets mixed. `tests/observability/test_propagation.py` builds a 3-node synthetic graph `planner → (branch_a ∥ branch_b) → joiner` where each parallel branch instantiates its own `NexoTelemetryCallback`. After the run we assert every `nexo.*` span carries the right `modular.{agent, run_id, tenant_id}` — the explicit attrs Langfuse uses to regroup spans when the OTel context-propagation tree is unreliable. Second test: 20 concurrent callbacks via `asyncio.gather` confirm the per-`run_id` span-dict inside `NexoTelemetryCallback` doesn't cross-talk under realistic concurrency (multiple LangGraph parallel branches). The two other A6 test files (`test_callbacks.py` for span emission, `test_pricing.py` for cost math) were written under A2 as part of TDD on the source — A6 adds only the propagation piece. Phase A is now complete; Phase B (backend deployment) is blocked on operator preconditions per `.ralph/blockers.md`. Full suite: 185 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A6)

Reviewer flagged 0 Critical / 5 Important issues during the Phase A checkpoint. All five fixed plus one recommendation: 1. `.env.example` was missing the new OTel/Langfuse keys, violating the file's "every config.py setting documented here" invariant. Added commented entries for MIOT_HARNESS_OTEL_{ENABLED,ENDPOINT, SERVICE_NAME,ENVIRONMENT} and the LANGFUSE keys. 2. `tests/observability/conftest.py` filtered to `nexo.*` only — Traceloop auto-instrumentation spans (`gen_ai.*` / `anthropic.*` / `openai.*`) would have been silently dropped in Phase B tests. Broadened the filter and renamed `_NexoSpanFilter` → `_HarnessSpanExporter`. 3. `observability/otel.py` silently no-op'd on re-init: a second `configure_tracing(enabled=True)` call with a provider already installed returned a NEW SDK provider whose spans went nowhere (`set_tracer_provider` is install-once). Now we detect the existing `TracerProvider`, log a warning, and return that one instead. Test rewritten with `patch` on `trace.get_tracer_provider`/`set_tracer_provider` to exercise both branches without polluting global state. 4. `observability/spans.py` docstring claimed nodes wrap themselves in `agent_span`, but no node body does that. The callback layer emits `nexo.<agent>` for LLM calls; `agent_span` is currently only used for the root `nexo.run` in `supervisor.py`. Docstring corrected. 5. A6 propagation test (`test_propagation.py`) used `FakeListChatModel` without any real `await` suspension — branches ran to completion inside their own tasks and never genuinely interleaved. Added `await asyncio.sleep(0)` around each `ainvoke` so the event loop yields between branches, and tightened the docstring to scope what the test does and doesn't prove (structural attr defense vs. OTel-context propagation). Plus reviewer Recommendation 4: wired `FastAPIInstrumentor.instrument_app(app, tracer_provider=...)` in the lifespan so HTTP request spans parent `nexo.run` and the full request → graph → LLM tree is visible in Langfuse. Full suite: 186 passed, 1 skipped.

Skip ahead of blocked B/C/D (operator-precondition gated per `.ralph/blockers.md`) and ship the offline-doable agentic-search foundation. `runtime/intent_router.py` — new `LLMIntentRouter`: - Async `route(message)` that asks the LLM to classify into one of NEXO_QUERY | NEXO_META | NEXO_AGENTIC | STORYTELLING_RUN | DIRECT | OTHER with calibrated confidence. - Tolerant JSON parsing: accepts bare JSON or a ```json``` fenced block (real LLMs often wrap). Unknown route names or malformed JSON fall back to the keyword router. - Confidence threshold (default 0.7): when LLM is uncertain, the deterministic keyword router decides — never silently misroute. `runtime/mode_resolver.py` — new `resolve_mode`: - `mode="auto"` → LLM router decides. - `canned` → `NEXO_QUERY`, no LLM call. - `meta` → `NEXO_META`, allowed for any tenant (meta-info is non-confidential per the plan's tenant-gate decision). - `agentic` → `NEXO_AGENTIC`, but only when `request.tenant_id == tenant_lock`; otherwise raises `ModeAccessDenied` BEFORE any LLM or graph call — request rejected at validation time. `runtime/router.py` — extended `HarnessRoute` enum with NEXO_META, NEXO_AGENTIC, OTHER. Existing keyword router unchanged; it now serves as the deterministic fallback in LLM uncertain / failure paths. `runtime/context.py` — `UserRequest.mode: RunMode = "auto"` (`Literal["auto","canned","meta","agentic"]`). Pydantic rejects unknown values (HTTP 422 in FastAPI). `config.py` — `intent_router_model` (default Haiku tier), `intent_router_confidence_threshold` (default 0.7). Tests: 11 new under `tests/runtime/`: - `test_intent_router.py`: 30-prompt confusion matrix asserts ≥90% accuracy; plus low-conf fallback, unparseable response, fenced JSON, and invalid-route-name paths. - `test_mode_resolver.py`: all four mode-bypass paths (canned, meta, agentic-mintral, agentic-rejected), the auto delegation path, and pydantic validation rejecting `mode="bogus"`. `.ralph/state.md`: note the out-of-order execution — E1 done before B-D since those are operator-blocked. Full suite: 197 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E1)

New `agents/meta_agent.py` (Haiku tier) answers schema / primer / introspection questions WITHOUT touching the DB: - `MetaAgentCatalogEntry` — narrow projection of `FunctionDescriptor` the agent needs (name, layer, title, body). Callers feed either the live introspection cache (production) or a hand-crafted list (tests). - `meta_agent_node(state, *, model, primer, catalog)` — composes a system prompt from `primer + formatted catalog`, asks the LLM, and returns a `{"answer": str}` delta. The function signature **deliberately omits** any `registry` / `tools` / `pool` parameter: there's nothing to invoke and nothing to SQL against. The "no SQL" guarantee is structural, not aspirational. Tests (`tests/agents/test_meta_agent.py`): - Agent answers reference catalog functions by name. - Primer text reaches the system prompt (so es_critico explanations stay grounded). - Every catalog entry's title + body appears in the system prompt. - Signature-level test asserts no DB/tool parameter ever creeps in via a future refactor. Full suite: 201 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E2)

THE high-risk surface of plan 13: the agent gains four read-only DB primitives. The safety gate has to be airtight before they go live. Deps: - Added `sqlglot>=25.0` (resolves to 30.7.0). Missed in A1; the plan calls for it. Config: - `nexo_explain_cost_threshold: float = 10000.0` — PostgreSQL plan-cost units. Above this, `nexo_explain` refuses the query. `integrations/nexo/primitives.py`: Safety gate `validate_select_sql(sql)`: - Parses with sqlglot's postgres dialect. - Rejects bare-or-trailing `;`, multi-statement input, or empty parses (`MultiStatementRejected`). - Walks the AST for mutation nodes (Insert/Update/Delete/Drop/Create/ Alter/AlterColumn/Grant/TruncateTable) — even inside CTEs (`MutationRejected`). - Requires the top-level construct to be SELECT/Subquery/Union (`UnsupportedConstruct`). - Asserts every `exp.Table` matches `^nexo\.dx_[a-zA-Z0-9_]+$` — `pg_catalog`, `information_schema`, `public.*`, even `nexo.private_*` are rejected (`AllowlistViolation`). - Returns the validated AST so callers can re-serialize from it. Primitives: - `nexo_describe(table)` — narrow allowlist; queries `information_schema.columns` with parameterised schema/name. Bypasses the SELECT gate (own allowlist). - `nexo_select(table, where?, order?, limit?, columns?)` — default LIMIT 100, hard cap 5000 (silent clamp). Identifiers quoted; SQL passes through `validate_select_sql` before execution. - `nexo_grep(table, column, pattern, limit?)` — sugar over `WHERE col ILIKE $1`. Pattern parameterised; column identifier validated against `[a-zA-Z_][a-zA-Z0-9_]*`. - `nexo_explain(query, cost_threshold)` — runs EXPLAIN (FORMAT JSON); raises `CostGateViolation` if total cost exceeds threshold. Tests (`tests/integrations/nexo/test_primitives.py`): 27 new. Safety gate (no DB): - 3 multi-statement rejections (bare `;`, `; INSERT`, `SELECT; SELECT`). - 8 mutation rejections (INSERT/UPDATE/DELETE/DROP/TRUNCATE/CREATE/ ALTER/GRANT). - 5 non-allowlisted rejections (pg_catalog, information_schema, pg_class, public.users, nexo.private_table). - 3 allowlist-accepted SELECTs. - 1 CTE-with-INSERT rejection. Functional (mocked pool that mimics asyncpg's real `acquire()` shape — sync method returning an async context manager): - `nexo_describe` returns parsed rows + rejects non-allowlisted target. - `nexo_select` clamps LIMIT to 5000; default is 100. - `nexo_grep` builds the right ILIKE WHERE with a quoted column. - `nexo_explain` refuses above threshold; accepts below; returns `{total_cost, node_type, plan}`. Full suite: 228 passed, 1 skipped. Loop-prompt gate met: "Move past E3 without safety tests green" is satisfied. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E3)

Code review of E3 verified two Critical bypass paths through the composable-primitives safety gate. Both fixed before E4 ships. C1 — `LATERAL <function>` bypass. `SELECT * FROM nexo.dx_servicios, LATERAL pg_read_file('/etc/passwd')` was accepted by the gate because `LATERAL` holds an `exp.Anonymous`, not an `exp.Table` — the allowlist iterator never saw it. Fix: `validate_select_sql` walks `exp.Lateral` nodes and rejects with `UnsupportedConstruct`. Agentic queries don't need LATERAL. C2 — side-effect functions in SELECT list. `pg_terminate_backend`, `pg_cancel_backend`, `set_config`, `pg_sleep`, `lo_import`, `pg_reload_conf`, `pg_advisory_lock`, `pg_read_file`, `dblink` are all SELECT-shaped per the AST (no Insert/Update/Delete node) but cause DoS, session-state mutation, file exfiltration, or kill other backends. Positive function allowlist (`_SAFE_FUNCTIONS`, ~80 entries: aggregates, math, string, date/time, JSON, array, window) is the only durable defense. `_function_name_candidates(func)` returns the FULL set of sqlglot spellings for a given function — `Anonymous.this`, `sql_name()`, `key`, class name — so the gate matches regardless of how sqlglot normalizes the AST (`to_char` → `TimeToStr`/`time_to_str`/`timetostr`, `date_trunc` → `TimestampTrunc`/`timestamp_trunc`/`timestamptrunc`, `now()` → `CurrentTimestamp`/`current_timestamp`). The allowlist includes every form. Additional review fixes: I1 — CTEs were silently broken (`WITH ok AS (SELECT ...) SELECT * FROM ok` returned a confusing `AllowlistViolation: table 'ok' is outside allowlist`). Fix: collect CTE alias names from `exp.With.expressions` and exempt them from the table allowlist. The CTE body itself was already walked, so the inner SELECT's tables still get checked. I4 — `validate_select_sql` returned the AST but `nexo_select`/`nexo_grep`/ `nexo_explain` sent the raw user string to the DB. Fix: new `_render_safe(ast)` returns `ast.sql(dialect='postgres')` — comments, whitespace tricks, and any structurally-redundant input get normalized away. DB sees only what the gate validated. M3 — `_MUTATION_NODES` gained `exp.Merge` (PG15+ MERGE) and `exp.Copy`. Tests added (20 new in `test_primitives.py`): - 3 LATERAL function-in-FROM rejections - 9 side-effect function rejections in SELECT list - 5 safe-builtin acceptances (count, lower+max, coalesce+to_char+now, date_trunc, json_build_object) - 1 CTE-allowlisted acceptance - 2 MERGE/COPY rejections Direct re-probe via `uv run python` confirms all 8 bypass attempts from the review are now blocked. Full suite: 248 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E3 review)

`observability/provenance.py`: - `ProvenanceEntry` dataclass: question, sql, plan_cost, rows_returned, refreshed_at (optional), run_id, tenant_id. - `ProvenanceLog(root, enabled=True)`: append-only JSONL log partitioned by UTC date (`YYYY-MM-DD.jsonl`). Creates the parent directory on first write so callers don't have to. Naive `now` arguments coerced to UTC. `enabled=False` makes every call a no-op so production can flip the log off without untangling call sites. `scripts/provenance-curate.py` (new, worktree-root scripts dir): - argparse CLI reading the last N days of JSONL. - WHERE-clause normalization: collapses `'literals'` and numeric values to `$`, so `WHERE id = 5` and `WHERE id = 7` cluster into one bucket. Repeated (table, where-pattern) combos surface as new curated `fn_dx_*` function candidates. - Surfaces three things in the weekly diff: 1. Top-N most-repeated (table, where-pattern) buckets. 2. Cross-tenant tables (≥2 tenants) — multi-tenant abstraction signal for plan 14. 3. Plan-cost outliers — index or denial candidates. - `--format json` for machine-readable output. Tests (`tests/observability/test_provenance.py`, 7 new): - Every field round-trips through JSONL. - Append-only (one line per call). - Date partitioning across days. - mkdir on missing parent dir. - Disable flag is a true no-op. - `refreshed_at` may be None. Full suite: 255 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E4)

`runtime/conversation.py`: - `ConversationTurn(user_message, assistant_answer)` — dataclass. - `ConversationHistory(conversation_id, turns, summary)` — dataclass. - `ConversationStore` Protocol — the v2 Redis-backed implementation drops in here without changing call sites. - `InMemoryConversationStore` — v1 dict-keyed store. Lost on process restart; acceptable for v1. - `summarize_if_needed(conversation_id, summarizer)` returns False below `summarize_at_turns` (default 10) and fires the async summarizer at the (n+1)th turn, storing the result in `history.summary` so the next prompt can be compacted before context-window pressure hits. `UserRequest.conversation_id: str | None = None` and `HarnessRunRecord.conversation_id: str | None = None` added so Langfuse groups multi-turn runs by this attribute. None for one-shot requests. Tests (`tests/runtime/test_conversation.py`, 7 new): - `get` returns None for unknown id. - Append round-trips a single turn. - Order preserved across multiple turns. - Two conversations are isolated. - `summarize_if_needed` no-ops below threshold (asserts summarizer not called). - `summarize_if_needed` fires above threshold, stores summary. - `conversation_id` round-trips across two /runs calls — the seam contract that matters most. Full suite: 262 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E5)

E6 — new `runtime/agentic_graph.py`: - `build_agentic_graph(settings, models, provenance_log)` returns a compiled LangGraph `StateGraph[NexoState]` with the nodes the plan names: `tenancy_gate / planner / synthesizer / critic / summarizer`. - Entry point is the tenancy gate; planner bumps `turn_count` and exits with a failure at 12 (the agentic turn cap, vs 8 for plan- execute). - Behavioral specifics — planner choosing curated-vs-composable, executor invoking primitives, freshness_judge/domain_analyst/ critic verdict routing — are scaffolded but stubbed in this iteration. The plan acknowledges F3 (live verification) is the behavior gate; the wiring tests assert topology + happy-path reachability + turn-cap safety. - `provenance_log` is threaded through the closure so the executor (when implemented) writes one row per primitive call (see E4). E7 — new `runtime/tenancy.py`: - `tenancy_gate_decision(ctx, route, settings) → TenancyDecision( allowed, refusal_message, audit_attr)` — pure function, no I/O. - Behavior matrix from plan 13 §E7: - `NEXO_QUERY` / `NEXO_AGENTIC` → tenant must match `nexo_tenant_lock`; else refuse with a Mintral-only message. - `NEXO_META` → allow any tenant. Off-lock tenants get an audit attribute `tenant.bypass: meta_route` so Langfuse can panel them. - Unknown route → safe-default refuse for off-lock tenants. - `nexo_graph._tenant_gate_node` and `agentic_graph.tenancy_gate` both delegate here. The mode resolver already refuses `mode="agentic"` for non-Mintral at request validation; the graph-level gate is defense in depth. Tests: - 4 in `test_agentic_graph.py`: non-Mintral refusal, happy path runs to synthesizer answer, turn-cap exit, required-nodes topology check. - 7 in `test_tenancy_gate.py`: data routes refuse non-Mintral, data routes allow Mintral, NEXO_META allows any tenant + audit attr when off-lock, NEXO_META locked tenant no audit attr, unknown route safe-default refuse. Full suite: 273 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E6, E7)

…spans E9 — the C2 per-mode cost dashboard needs `modular.mode` on every span so it can group canned vs agentic vs meta runs and compare cost-per- question. Wire-up: - `HarnessContext.mode: RunMode = "auto"` — propagated from `UserRequest.to_context()`. - `supervisor._run_nexo` passes `mode=ctx.mode` to `agent_span("run", ...)`, so the root `nexo.run` span carries `modular.mode`. - `nexo_graph._instrument(...)` forwards `ctx.mode` to `NexoTelemetryCallback`, so every per-agent `nexo.<agent>` span Langfuse renders for the run carries the same attribute. Tests (`tests/observability/test_mode_telemetry.py`, 2 new): - Root `nexo.run` carries `modular.mode='agentic'` when the caller sets it on the context. - After running the real graph, every per-agent span carries `modular.mode='canned'` — proves the attr survives from HarnessContext → callback → emitted span. Marks E8 done as well: the seven test files the plan lists for E8 (test_intent_router, test_mode_resolver, test_meta_agent, test_primitives, test_provenance, test_conversation, test_agentic_graph, test_tenancy_gate) were all written via TDD during E1-E7; E8 is implicitly satisfied. The `nexo.critic` non-zero requirement and C2 dashboards in E9 are infrastructure-dependent (Phase B+C blocked on operator preconditions). The attribute surface is wired and verified offline; live verification gates on infra. Full suite: 275 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E8, E9)

`uv run pytest` reports 275 passed, 1 skipped, 0 failed, 0 errors. - Plan 12's 151-test baseline preserved (no regressions). - 124 new tests across `tests/observability/`, `tests/runtime/`, `tests/agents/`, `tests/integrations/nexo/` covering Phase A (telemetry foundation) and Phase E (agentic search + composable primitives) — every offline-verifiable task in plan 13. F2 (canned-mode live verification), F3 (agentic-mode live verification), F4 (one week of agentic traffic), and F5 (PR with verification artifacts) require operator preconditions per `.ralph/blockers.md`: Docker daemon for the Langfuse stack, SSH tunnel to coordinador-prod, and `ANTHROPIC_API_KEY`. Until those are in place the loop cannot complete Phase F. Phase E review checkpoint dispatched separately.

…p gap Phase E code review checkpoint dispatched after E1-E9 + F1 landed. Verdict: 0 Critical / 7 Important / 5 Minor. Critical bypasses from the earlier E3 review remain closed. Cheap Important fixes applied here: - **.env.example** (#4): added MIOT_HARNESS_INTENT_ROUTER_MODEL, MIOT_HARNESS_INTENT_ROUTER_CONFIDENCE_THRESHOLD, and MIOT_HARNESS_NEXO_EXPLAIN_COST_THRESHOLD so the file's own "every config.py setting documented here" invariant holds. - **agentic_graph.py docstring** (#2): rewritten to be honest about which nodes are wired in this iteration (tenancy_gate, planner stub, synthesizer, critic stub, summarizer stub) vs which are deferred to the F-phase supervisor integration (executor, freshness_judge, domain_analyst). The old docstring listed all 8 nodes as if they were present. - **`_ = provenance_log`** (#3): clarified the intent comment so a reader knows the parameter is captured for the future executor node, not silently ignored. Big Important #1 — `HarnessSupervisor` does NOT yet consume the new Phase E modules (resolve_mode, LLMIntentRouter, InMemoryConversation Store, build_agentic_graph, meta_agent_node). The supervisor still uses the keyword `IntentRouter` for routing decisions. The seven new modules are correct and tested in isolation but are an orphan island until the F-phase wire-up lands. Documented as a new OPEN entry in `.ralph/blockers.md` with the concrete 5-step wire-up plan the next iteration follows when operator preconditions are cleared. Phase F status: - F1 (uv run pytest green): **DONE** (275 passed, 1 skipped). - F2, F3, F4 (live verification): **BLOCKED** on operator preconditions (Docker for Langfuse stack, SSH tunnel to coordinador-prod, ANTHROPIC_API_KEY). - F5 (PR open with verification artifacts): **BLOCKED** on F2-F4. Loop output: BLOCKER. Every remaining task requires either operator preconditions or the supervisor wire-up that depends on them. The plan's offline-doable scope is complete; the autonomous loop cannot proceed further from this seat. Full suite: 275 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E review)

The Phase E modules (LLMIntentRouter, mode_resolver, agentic_graph, meta_agent_node, InMemoryConversationStore) were orphan-islands — HarnessSupervisor.run() still used the keyword router. This commit lands the integration the post-Phase-E review flagged as Important #1. `runtime/supervisor.py`: - `__init__` gains optional kwargs `llm_router`, `agentic_graph`, `meta_model`, `meta_primer`, `meta_catalog`, `conversation_store`, `tenant_lock`. Positional sig unchanged — existing call sites and tests don't break. - `run()` resolves the route via `await resolve_mode(...)` when `llm_router` is injected; else falls back to the keyword router (Plan 12 default). - `ModeAccessDenied` (agentic-mode for non-Mintral) is caught at the route-resolution boundary and surfaced as a refusal answer. - Dispatch tree extended: - NEXO_QUERY → existing `_run_nexo` (nexo_graph.ainvoke) - NEXO_META → new `_run_nexo_meta` (calls meta_agent_node with primer + catalog; no graph, no SQL) - NEXO_AGENTIC → new `_run_nexo_agentic` (agentic_graph.ainvoke, wrapped in `agent_span("run", ...)`) - STORYTELLING_RUN / DIRECT / OTHER unchanged. - `conversation_id` round-trips: hydrate via store on entry, append one `ConversationTurn(user_message, assistant_answer)` on exit when both `conversation_store` and `request.conversation_id` are set. `record.conversation_id` mirrors the request id for telemetry. `runtime/factory.py`: - `build_harness` wires the always-on `InMemoryConversationStore` (it costs nothing without a request). - The Phase-E modules that need live Nexo boot (LLM router, agentic graph, meta model) are wired by the FastAPI lifespan once the Nexo integration succeeds. `api/server.py` lifespan: - After `build_nexo_graph(...)` succeeds, builds the agentic graph (sharing the analyst model as planner), wires `meta_model` + primer + an auto-derived catalog from the registered curated tools, instantiates `LLMIntentRouter` (Haiku tier from settings), and sets `tenant_lock` from `nexo_tenant_lock`. - Falls back gracefully — supervisor's keyword-router path stays intact when any of the boot steps fails. Tests: 7 new in `tests/runtime/test_supervisor_phase_e.py`: - explicit canned mode dispatches to nexo_graph - explicit meta mode calls meta_agent_node (no graph) - explicit agentic mode dispatches to agentic_graph - explicit agentic mode for non-Mintral returns refusal - auto mode uses LLM router (not keyword) - conversation_id round-trips via the store across two calls - backward compatibility: keyword-router-only construction works Full suite: 282 passed, 1 skipped.

`infra/observability/` is the local dev stack for Phase B (plan 13). Authored offline — Docker only needs to be running to *execute* it, not to write it. `docker-compose.yml` (B1): - 7 services on a private `observability` bridge network: - `postgres:16` — Langfuse's primary DB - `clickhouse:24-alpine` — Langfuse's analytics store - `redis:7-alpine` — Langfuse's queue/cache (`requirepass` set) - `minio` — S3-compatible blob storage for event payloads - `langfuse/langfuse:3` (web) — UI + ingest API on `:3000` - `langfuse/langfuse-worker:3` — background jobs - `otel/opentelemetry-collector-contrib:0.95.0` — OTLP gRPC on `:4317`, HTTP on `:4318`, self-metrics on `:8888` - Volumes for postgres, clickhouse, clickhouse-logs, minio. - Health checks gate `depends_on:` so the right boot order happens on `docker compose up -d`. - LANGFUSE_INIT_* envs declare a deterministic `miot-harness-local` project so bootstrap.sh can mint keys idempotently. `otel-collector-config.yaml` (B2): - OTLP gRPC (4317) + HTTP (4318) receivers. - `memory_limiter` (90% soft / 10% spike) + `batch` (200ms / 1KB). - `otlphttp/langfuse` exporter pointing at the in-network `http://langfuse-web:3000/api/public/otel` ingest endpoint. - Comments describe the fan-out pattern (e.g. adding Honeycomb) so swapping backends doesn't require harness re-instrumentation. `bootstrap.sh` (B3): - `docker compose up -d`, then polls `/api/public/health` until 200 (180s timeout, prints last 50 lines of langfuse-web logs on fail). - Calls Langfuse v3's `/api/public/projects/{id}/api-keys` to mint a key pair, parses `publicKey` / `secretKey` out of the JSON, and prints the `.env` lines for the operator to paste — including toggling `MIOT_HARNESS_OTEL_ENABLED=true`. - Falls back to UI instructions when the bootstrap API isn't reachable (Langfuse v3 build pre-bootstrap-API). - Marked `set -euo pipefail` + `chmod +x`. `README.md` (B4): - Bring-up / tear-down, access (UI / collector metrics), .env paste-block, privacy posture (single-tenant, internal, full message storage — not for SaaS). - Backup/restore via `tar` over docker volumes. - Backend-swap recipe (add a second exporter, no harness changes). - Troubleshooting table covering the common boot failures. YAML + shell syntax verified by `yaml.safe_load` and `bash -n`. No tests added — these are config files exercised live in F2/F3. Full suite still: 282 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase B · B1-B4)

C3 from plan 13 — `python -m miot_harness.observability.report --since 7d --by agent|tenant|mode`. The pure aggregation core is offline-doable (unit-tested) and the live-Langfuse fetcher is deferred to F-phase when the stack is up. `observability/report.py`: - `CostRow(key, cost_usd, input_tokens, output_tokens, traces)` — one bucket of the report. - `aggregate_cost(traces, by)` — pure aggregation over a list of trace projections. Groups by `modular.agent` / `modular.tenant_id` / `modular.mode`, sums `gen_ai.usage.cost_usd` as Decimal (6-decimal quantize), tallies `gen_ai.usage.{input,output}_tokens`, counts traces. Skips traces missing `gen_ai.usage.cost_usd`. Sorts rows cost-descending so the dashboard panel renders the cost driver first. - `parse_since('7d' | '24h' | '90m')` returns a `timedelta`; rejects invalid specs with a clear ValueError. - CLI shim: `--fixture <path-to-jsonl>` runs against a static file so operators can sanity-check the math before pointing at Langfuse; `--format json` for machine-readable output. The Langfuse fetcher is intentionally unbuilt; the CLI errors loudly if no fixture is passed, with a pointer to F-phase. Tests (`tests/observability/test_report.py`, 13 new): - aggregate by agent / tenant / mode each sum correctly - traces missing cost are skipped (not silently zeroed) - output is sorted cost-descending - unknown grouping dimension raises ValueError - token totals (in / out) propagate per-bucket - parse_since accepts common specs (1h/24h/7d/30d/90m) - parse_since rejects garbage ('', '1', '7y', 'minus', '1.5d') C2 dashboards and C4 alerting still wait for the live Langfuse stack (JSON exports + real trace queries respectively). C1 live verification is operator-precondition gated per `.ralph/blockers.md`. Stage 1 complete. Stage 2 = user action: install Docker, get ANTHROPIC_API_KEY, start the SSH tunnel. Full suite: 295 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase C · C3)

…ENCRYPTION_KEY, cluster mode, deterministic API keys Four corrections to the Phase B stack found while bringing it up live: 1. **ClickHouse healthcheck used `localhost`** — Docker on macOS doesn't route IPv6 inside containers, so wget resolved `localhost` → `::1` and got refused while ClickHouse was actually listening on `0.0.0.0:8123` (IPv4 only, after the v6 bind failed). Switched the healthcheck to `127.0.0.1`, raised `retries` to 20, added `start_period: 30s` so the slow first-boot migrations don't trip the gate. 2. **`ENCRYPTION_KEY: 000...0` was rejected by Langfuse's Zod schema** ("ENCRYPTION_KEY must be 256 bits, 64 string characters in hex format, generate via: openssl rand -hex 32"). Generated a real one with `openssl rand -hex 32`; comment makes the local-dev-only nature explicit so this isn't reused in prod. 3. **`CLICKHOUSE_CLUSTER_ENABLED: "false"`** added to both langfuse-web and langfuse-worker. Langfuse v3's migrations default to `ReplicatedMergeTree ON CLUSTER default`, which requires ZooKeeper. Disabling cluster mode makes them use plain MergeTree for the single-node local stack. 4. **`LANGFUSE_INIT_PROJECT_PUBLIC_KEY` / `_SECRET_KEY`** added so the API keys are minted deterministically during the project bootstrap pass — no manual UI step, no race condition with the bootstrap script trying to call a key-mint endpoint that doesn't exist in Langfuse v3's public API. Comment explicitly marks these as local-dev-only credentials and explains the rotation path. `otel-collector-config.yaml`: - Replaced the placeholder Authorization header with the real `Basic base64(pk-lf-...:sk-lf-...)` for the new auto-provisioned key pair so traces actually reach Langfuse. `bootstrap.sh`: - Replaced the (broken) `POST /api/public/projects/.../api-keys` attempt with a Postgres-direct read of the public_key (just for confirmation) and a grep over `docker-compose.yml` for the secret. Result: the script always prints the working pair, idempotently. Verified end-to-end: `docker compose down -v && docker compose up -d` boots the 7 services cleanly, langfuse-web reports `{"status":"OK"}` on `/api/public/health`, otel-collector starts with the Langfuse exporter ready, and `./bootstrap.sh` prints the .env paste-block in <5s.

…ion; clear blockers End-to-end live bring-up confirmed: - F2 canned mode (3 of plan-12's golden scenarios): G6 estado, G7 cola crítica, G8 ETA en riesgo. All three returned Markdown answers from the real Coordinador snapshot (with honest stale-data warnings where the snapshot is older than the freshness threshold). - F3 mode-bypass + auto-router: META mode answers from primer+catalog without DB touch, allowed for any tenant. AGENTIC refused at request-validation for non-Mintral (`ModeAccessDenied`). AGENTIC for Mintral runs through the stub planner→synthesizer. AUTO mode classified "dimensionamiento para mañana" as NEXO_QUERY at confidence 0.75 and dispatched correctly. - Direct ClickHouse query confirms Langfuse received: 11 `nexo.run` root spans + 6 `nexo.synthesizer` + 6 `nexo.filter_expert` per-agent spans + 22 `anthropic.chat` (Traceloop auto-instrumented) child spans. Per-mode trace split: 5 auto / 3 canned / 2 meta / 1 agentic. Every span carries `modular.{mode, run_id, tenant_id}` in its metadata. `tests/conftest.py` (new): 7 default-value tests were failing because the populated dev `.env` was being auto-loaded by pydantic-settings, overriding the asserted defaults. Added an autouse fixture that pins `HarnessSettings.model_config["env_file"]` to a non-existent path and clears any `MIOT_HARNESS_*` / provider keys from the inherited shell, so the full suite passes deterministically on a populated dev box and on CI. Full suite: 295 passed, 1 skipped. `.ralph/blockers.md`: both OPEN entries (operator preconditions + supervisor wire-up) marked RESOLVED with the resolution context. `.ralph/state.md`: C1, D1, D2, F2, F3 all → [x]. D3 / F4 stay [ ] (time-gated — 50+ runs over a week). C2 / C4 / F5 stay [ ] (follow-up work after the PR lands).

coderabbitai · 2026-05-12T21:00:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2fd475a8-f6c1-4dc1-89a2-7e12fecdbd87

📥 Commits

Reviewing files that changed from the base of the PR and between 7dc5bcd and d7d4f36.

📒 Files selected for processing (2)

.gitignore
miot-harness/evals/deploy/07-attestations-present.sh

✅ Files skipped from review due to trivial changes (1)

.gitignore

📝 Walkthrough

Walkthrough

Introduces a local Langfuse+OTel stack, integrates OpenTelemetry in the FastAPI app, adds pricing, provenance, SQL safety primitives, intent routing, agentic/meta modes, tenancy gate, conversation memory, supervisor/server wiring, a cost-report CLI, CI attestations, and comprehensive tests.

Changes

Plan 13 end-to-end telemetry and runtime modes

Layer / File(s)	Summary
Full slice: infra + observability + runtime modes + tests `infra/observability/*`, `miot-harness/src/...`, `miot-harness/tests/...`, `scripts/provenance-curate.py`, `.github/workflows/harness.yaml`, `.gitignore`, `miot-harness/evals/deploy/07-attestations-present.sh`	Adds local observability stack and OTel wiring, pricing/cost reporting, provenance logging, Nexo SQL-safe primitives, LLM intent routing and mode resolver, agentic/meta graphs, tenancy gate, conversation memory, supervisor/server integration, CLI/reporting, CI attestation updates, and tests.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant FastAPI
  participant Supervisor
  participant Router as Mode/Intent Router
  participant Graph as Nexo/Agentic/Meta Graphs
  participant OTel as OTel Collector
  participant Langfuse

  Client->>FastAPI: POST /runs (user_message, mode, tenant)
  FastAPI->>Supervisor: run(request)
  Supervisor->>Router: resolve_mode(request)
  Router-->>Supervisor: RouteResult(route)
  Supervisor->>Graph: execute(route) with callbacks
  Graph-->>Supervisor: answer, events
  Supervisor-->>FastAPI: response
  FastAPI->>OTel: OTLP traces
  OTel->>Langfuse: Export traces

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

feat(harness): end-to-end LLM telemetry + agentic supervisor for Nexo (Plan 13) #484 — Implements Langfuse+OTel stack, telemetry hooks, agentic/meta routing, primitives, pricing/reporting, and CI attestations described there.
feat(harness): per-tenant billing surface for AI cost attribution #482 — Adds per-agent OTel instrumentation, self-hosted Langfuse, first-class fields, reporting CLI, and matching tests.

Possibly related PRs

microboxlabs/modulariot#445 — Builds on the Nexo coordinator graph/supervisor architecture extended here with telemetry and routing.

Suggested reviewers

korutx

Poem

A rabbit wired the stars to trace,
Little spans hop place to place.
Grafters plan, routers muse,
Tenants gate what modes to choose.
SQL kept safe, costs align—
Langfuse glows, OTel fine.
Thump-thump: shipped Plan 13, divine! 🐇✨

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/harness-phase-13-telemetry-agentic

…nt cost rollups Spotted during the F2/F3 live verification: the Langfuse Tracing UI's filter sidebar (User ID / Session ID / Tags / Environment) was empty for our runs even though `tenant_id` / `user_id` / `conversation_id` were already threaded through every request. Without these promoted to first-class indexed columns, per-client cost rollups would require post-hoc metadata filtering instead of using Langfuse's built-in filter affordances. Langfuse's OTel ingest promotes specific attribute names (`langfuse.user.id`, `langfuse.session.id`, `langfuse.tags`, `langfuse.environment`) from span metadata into first-class trace columns. This commit emits them alongside the existing `modular.*` attrs. Changes: - `HarnessContext` gains `conversation_id: str | None`, propagated from `UserRequest.to_context()` so the conversation surface and Langfuse session_id read from the same source. - `observability/spans.py:agent_span(...)` extended with optional `user_id` / `session_id` / `tags` / `environment` kwargs. Tags serialized as a JSON-encoded string array for portability. - `observability/callbacks.py:NexoTelemetryCallback` gains the same four kwargs. On every `on_chat_model_start`, sets `langfuse.user.id`, `langfuse.session.id`, `langfuse.environment`, and `langfuse.tags` on the per-agent span so observation-level rows in Langfuse carry the same first-class fields as the trace. - `runtime/supervisor.py:_root_span_kwargs(ctx, route)` centralizes the mapping: every dispatch path (NEXO_QUERY / NEXO_AGENTIC / NEXO_META) calls `agent_span("run", **self._root_span_kwargs(...))` with identical kwargs plus a per-route tag (`route:<chosen_route>`). `session_id` falls back from `conversation_id` → `thread_id` so one-shot requests still group under a session key. - `runtime/nexo_graph.py:_instrument(...)` forwards `ctx.user_id`, `ctx.conversation_id or ctx.thread_id`, and a per-agent tag list (`tenant:<id>`, `mode:<mode>`, `agent:<name>`) to each NexoTelemetryCallback. Tests: 4 new in `tests/observability/test_langfuse_attribution.py`: - agent_span emits all four langfuse.* attrs when provided - agent_span omits them when unset (backward compat for old callers) - callback emits langfuse.* on per-agent spans when provided - callback omits them when not provided Live-verified end-to-end: after `POST /runs` with `{tenant_id: mintral, user_id: odtorres, conversation_id: odtorres-debug-2}`, the Langfuse `traces` ClickHouse row shows: user_id = 'odtorres' session_id = 'odtorres-debug-2' tags = ['agent:filter_expert','agent:synthesizer', 'mode:auto','route:nexo_query','tenant:mintral'] environment = 'local' Full suite: 299 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E10, added mid-execution per the plan's "editable during execution" rule).

coderabbitai

Actionable comments posted: 18

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

miot-harness/tests/conftest.py (1)

29-44: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clear cached settings inside the isolation fixture.

get_settings() is cached, so env cleanup alone may not take effect in tests that call the accessor after the first instantiation.

Suggested fix

-from miot_harness.config import HarnessSettings
+from collections.abc import Iterator
+
+from miot_harness.config import HarnessSettings, get_settings
@@
-@pytest.fixture(autouse=True)
-def _isolate_settings_from_env(monkeypatch: pytest.MonkeyPatch) -> None:
+@pytest.fixture(autouse=True)
+def _isolate_settings_from_env(monkeypatch: pytest.MonkeyPatch) -> Iterator[None]:
@@
+    get_settings.cache_clear()
@@
-    for key in list(os.environ.keys()):
+    for key in list(os.environ.keys()):
         if key.startswith("MIOT_HARNESS_") or key in {
             "ANTHROPIC_API_KEY",
             "OPENAI_API_KEY",
             "GOOGLE_API_KEY",
         }:
             monkeypatch.delenv(key, raising=False)
+    yield
+    get_settings.cache_clear()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/tests/conftest.py` around lines 29 - 44, The isolation fixture
_isolate_settings_from_env currently alters HarnessSettings.model_config and
environment variables but does not clear the cached settings, so tests that call
get_settings() after a prior instantiation still see stale values; update
_isolate_settings_from_env to clear the cached accessor by calling
get_settings.cache_clear() (or the appropriate cache-clear method used by
get_settings) after monkeypatching HarnessSettings.model_config and before
returning so subsequent get_settings() calls recreate settings from the modified
env/model_config.

🟡 Minor comments (5)

miot-harness/tests/observability/conftest.py-13-17 (1)
13-17: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix import ordering to clear CI (ruff I001).

The current import block is flagged by Ruff and is failing the lint job.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/tests/observability/conftest.py` around lines 13 - 17, Reorder
the import block to satisfy Ruff's I001 ordering: group and sort imports into
standard library, third-party, and local/application imports, and sort them
alphabetically within each group; specifically ensure pytest and opentelemetry
imports are ordered so that "import pytest" appears before "from opentelemetry
import trace" and the opentelemetry submodule imports (ReadableSpan,
TracerProvider, SimpleSpanProcessor, SpanExportResult, InMemorySpanExporter) are
grouped and sorted consistently.
.ralph/log.md-5-11 (1)
5-11: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language identifier to the fenced example block.

This fence triggers markdownlint MD040; adding a language keeps docs lint-clean.
Suggested fix
-```
+```text
 ## Iteration N — YYYY-MM-DD HH:MM
 - Task: <task ID + short>
 - Status: [completed|blocked|partial]
 - Commit: <SHA or "n/a">
 - Notes: <one-line summary; tests run; surprises>
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.ralph/log.md around lines 5 - 11, The fenced code block in the iteration
template is missing a language identifier which triggers markdownlint MD040;
update the triple-backtick fence for the example block (the block that begins
with "" followed by the iteration lines) to include a language identifier such as "text" (e.g. change "" to "```text") so the markdown linter passes
and the example remains unchanged otherwise.
</details>

</blockquote></details>
<details>
<summary>miot-harness/src/miot_harness/config.py-48-55 (1)</summary><blockquote>

`48-55`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Add bounds validation for new numeric settings.**

These values are externally configurable, but currently accept invalid ranges. Guarding them at load time prevents silent misconfiguration.



<details>
<summary>Suggested fix</summary>

```diff
-    intent_router_confidence_threshold: float = 0.7
+    intent_router_confidence_threshold: float = Field(default=0.7, ge=0.0, le=1.0)
@@
-    nexo_explain_cost_threshold: float = 10000.0
+    nexo_explain_cost_threshold: float = Field(default=10000.0, gt=0.0)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/src/miot_harness/config.py` around lines 48 - 55, The two
numeric config fields intent_router_confidence_threshold and
nexo_explain_cost_threshold need runtime bounds validation on load: ensure
intent_router_confidence_threshold is within [0.0, 1.0] and
nexo_explain_cost_threshold is a positive number (e.g., > 0.0); implement the
checks in the config initialization path (e.g., Config.__post_init__ or the
function that constructs these values) and raise a clear ValueError with the
invalid field name and value if a check fails so misconfiguration is caught
immediately.
```

</details>

</blockquote></details>
<details>
<summary>miot-harness/.env.example-135-135 (1)</summary><blockquote>

`135-135`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Fix the stack path typo in the OTEL comment.**

Line 135 has an extra space in the path (`infra/observability/ docker-compose`), which makes copy/paste misleading.

<details>
<summary>✏️ Suggested fix</summary>

```diff
-# infra/observability/ docker-compose stack is up.
+# infra/observability/docker-compose stack is up.
```

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/.env.example` at line 135, Fix the typo in the OTEL comment by
removing the stray space in the path string "infra/observability/
docker-compose" so it reads "infra/observability/docker-compose"; update the
comment text on the affected line in .env.example accordingly to ensure
copy/paste yields a valid path.
```

</details>

</blockquote></details>
<details>
<summary>miot-harness/tests/observability/test_graph_wiring.py-158-164 (1)</summary><blockquote>

`158-164`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Assert actual parent-child linkage, not just root presence.**

The test is named `test_root_nexo_run_span_parents_per_agent_spans` yet only verifies root attributes. The established testing pattern in the codebase (test_spans.py) demonstrates that `span.parent` and `span.context.span_id` are available for parent-child assertions. Add explicit parent-id checks to verify that nexo.* agent spans are actually children of the nexo.run root span, preventing broken span parenting from going undetected.

<details>
<summary>🧪 Suggested assertion</summary>

```diff
     spans = memory_exporter.get_finished_spans()
     root = next((s for s in spans if s.name == "nexo.run"), None)
     assert root is not None, "nexo.run root span not emitted"
     attrs = dict(root.attributes or {})
     assert attrs["modular.run_id"] == "run_root_span_test"
     assert attrs["modular.tenant_id"] == "mintral"
     assert attrs["gen_ai.operation.name"] == "nexo.run"
+
+    root_span_id = root.context.span_id
+    for s in spans:
+        if s.name.startswith("nexo.") and s.name != "nexo.run":
+            assert s.parent is not None, s.name
+            assert s.parent.span_id == root_span_id, s.name
```

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/tests/observability/test_graph_wiring.py` around lines 158 -
164, The test currently only asserts the root "nexo.run" span attributes; add
parent-child assertions to ensure all agent spans are children of that root by
locating the root span (root) from memory_exporter.get_finished_spans(), then
iterate over agent spans (spans with names starting with "nexo." excluding
"nexo.run") and assert each has a non-null parent and that span.parent.span_id
(or span.parentSpanId if that field is used in the Span model) equals
root.context.span_id (or root.context.span_id field used in this codebase);
update test_root_nexo_run_span_parents_per_agent_spans to include these explicit
parent-id checks.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @infra/observability/docker-compose.yml:

Around line 96-99: The compose file currently hardcodes deterministic secrets
(e.g., DATABASE_URL, NEXTAUTH_URL, NEXTAUTH_SECRET, SALT and other
DB/Redis/MinIO/Langfuse creds) — replace these literal values with externalized
references (use environment variable interpolation like ${DATABASE_URL}, an
env_file, or Docker secrets) and remove the static defaults from the committed
compose; update the service definitions that reference DATABASE_URL,
NEXTAUTH_SECRET, SALT (and the other listed secret entries) to read from the
external env or secret names, add a .env.example with placeholder keys (not
secrets) and documentation on how to provide real credentials locally/CI, and
ensure no real credentials remain in the file.

Around line 63-69: The Redis healthcheck is failing because the healthcheck
test (healthcheck.test using "redis-cli ping") doesn't authenticate while the
container is started with "--requirepass myredissecret"; update the
healthcheck.test to authenticate (for example use "redis-cli -a myredissecret
ping" or "redis-cli -u redis://:myredissecret@localhost:6379 ping") so it
supplies the password, or reference a secret/env var for the password, keeping
the rest of the healthcheck properties and the command that sets "--requirepass"
unchanged.

In @infra/observability/otel-collector-config.yaml:

Around line 35-43: The otel-collector config currently contains a hardcoded
Basic Authorization header under the otlphttp/langfuse exporter (headers:
Authorization) which must be moved to an environment variable; replace the
literal base64 string in the headers.Authorization value with an env reference
using the OTel Collector syntax (e.g., ${env:LANGFUSE_BASIC_AUTH}), then ensure
the otel-collector service in docker-compose sets LANGFUSE_BASIC_AUTH from the
runtime environment (and document adding the value to a gitignored .env file
generated via printf 'pk...:sk...' | base64). Keep the exporter key
(otlphttp/langfuse) and headers.Authorization as the target of the change so the
config still points to the same endpoint while removing secrets from the repo
and git history.

In @miot-harness/src/miot_harness/integrations/nexo/primitives.py:

Around line 6-11: The file has long-line and import-order lint errors and a
duplicate entry in the safe-functions list: break the long literal/table strings
(lines around the module docstring or top-level comment that contains the
Tool/Shape/Safety table and any other >79 char lines noted at 24-30) into
shorter concatenated or multi-line strings to fix E501; reorder the import block
into standard groups (stdlib, third-party, local) and ensure imports are
sorted/grouped consistently (I001) for the module's import section; and remove
the duplicate "to_char" entry from the _SAFE_FUNCTIONS collection (symbol
_SAFE_FUNCTIONS at line ~113) so there are no repeated function names (B033).
Make these edits in miot_harness.integrations.nexo.primitives (update the
docstring/table wrapping, fix import order, and de-duplicate _SAFE_FUNCTIONS)
and run linters to confirm all three issues are resolved.

Around line 378-381: The code incorrectly assumes rows are dicts by using
isinstance(row, dict) which causes all asyncpg.Record results to be rejected;
instead, in the nexo_explain path obtain the QUERY PLAN by attempting direct
subscript access on row (e.g., try: payload = row["QUERY PLAN"]; except
(KeyError, TypeError, IndexError): raise UnsupportedConstruct("EXPLAIN output
missing QUERY PLAN")) so asyncpg.Record objects are handled properly, then keep
the existing plan = payload[0].get("Plan", {}) logic and raise
UnsupportedConstruct only when the subscript access fails or payload is empty.

In @miot-harness/src/miot_harness/observability/otel.py:

Around line 46-56: When configure_tracing detects an existing global
TracerProvider (variable existing) you must record ownership instead of always
shutting it down later; add a boolean (e.g., tracer_provider_owned) that is set
to False when existing is reused and True when configure_tracing creates/sets a
new provider, and update shutdown_tracing to only call provider.shutdown() if
tracer_provider_owned is True (apply the same ownership check for the other
reuse branch around the code referenced at 72-75). Ensure the boolean is
accessible to both configure_tracing and shutdown_tracing so only providers
created by this module are shut down.

In @miot-harness/src/miot_harness/observability/provenance.py:

Around line 68-70: The current code writes JSON payload and newline in two
separate fh.write calls (inside the with path.open("a", encoding="utf-8")
block), which risks interleaving under concurrent appends; change it so the code
constructs a single string (e.g., record = json.dumps(payload,
ensure_ascii=False) + "\n") and replace the two fh.write(...) calls with a
single fh.write(record) to ensure each JSONL record is written atomically as one
write operation.

In @miot-harness/src/miot_harness/observability/report.py:

Line 124: The header f-string assigned to the variable lines is too long;
split it across multiple shorter string literals (using implicit concatenation
inside parentheses or concatenation with +) so the f-string doesn't exceed the
line-length limit. For example, break f"{'#':>3} {by.upper():<20}
{'COST_USD':>10} {'INPUT':>10} {'OUTPUT':>10} {'N':>5}" into two or more
parts and join them (keep the same f-string interpolation for by.upper()) so the
resulting value of lines remains identical but each source line respects the
linter's max line length.

In @miot-harness/src/miot_harness/runtime/agentic_graph.py:

Line 78: The statement assigning response exceeds the 100-char line limit;
break it into multiple shorter expressions by extracting
snapshot.get("user_message", "") into a local variable (e.g., user_message) and
then calling model.ainvoke with that variable (referencing response,
model.ainvoke, and snapshot.get), or by formatting the list argument across
multiple lines so the full call no longer exceeds 100 characters.

In @miot-harness/src/miot_harness/runtime/intent_router.py:

Around line 115-128: The call to self._model.ainvoke can raise and currently
bubbles up; wrap the await self._model.ainvoke(...) call in a try/except that
catches Exception (or the model client-specific exceptions) and on failure logs
the error (use logger.warning or logger.exception with context) and then returns
self._fallback.route(message) so routing degrades to deterministic keyword
routing; keep the existing flow that still parses the response via
_parse_decision and only falls back when parsing fails or when the ainvoke call
throws.

In @miot-harness/tests/integrations/nexo/test_primitives.py:

Around line 12-30: The import block around nexo test symbols is mis-ordered
for Ruff I001; reformat it so imports are grouped and sorted: keep "from
future import annotations" at the top, then a blank line, then standard
library imports (typing, unittest.mock) sorted alphabetically, then a blank
line, then third-party imports (pytest), then a blank line, then local package
imports from miot_harness.integrations.nexo.primitives; within that from-style
import list (AllowlistViolation, CostGateViolation, MutationRejected,
MultiStatementRejected, UnsupportedConstruct, nexo_describe, nexo_explain,
nexo_grep, nexo_select, validate_select_sql) sort names alphabetically.
Alternatively run isort/ruff --fix to apply the same ordering.

In @miot-harness/tests/observability/test_deps.py:

Around line 8-14: The import block currently violates Ruff I001; reorder and
normalize imports so they follow isort rules: keep "from future import
annotations" first, then the standard-library import "import importlib", then
third-party "import pytest", with a single blank line after the grouped imports;
you can either run the project formatter (isort/ruff --fix) or manually reorder
the three lines ("from future import annotations", "import importlib",
"import pytest") to fix the issue.

In @miot-harness/tests/observability/test_langfuse_attribution.py:

Around line 73-75: The import block in test_langfuse_attribution.py is
unsorted; reorder and group the imports so they pass ruff—place third-party
imports together in alphabetical order and ensure from-imports are alphabetized
(e.g., import AIMessage from langchain_core.messages and ChatGeneration,
LLMResult from langchain_core.outputs) and run the linter to verify formatting;
adjust any import grouping to match the project's import ordering rules used by
ruff.

Around line 131-133: The import block is unsorted; please reorder/group the
imports to satisfy ruff/isort by placing module imports in the correct
alphabetical/grouped order (e.g., put from langchain_core.messages import
AIMessage before from langchain_core.outputs import ChatGeneration, LLMResult)
and ensure the names inside each import are alphabetized and formatted
consistently so linting passes.

Line 15: The file imports pytest but never uses it; remove the unused import
statement (the line importing pytest) from
miot-harness/tests/observability/test_langfuse_attribution.py so the file no
longer contains an unused symbol and the pipeline error is resolved.

In @miot-harness/tests/observability/test_pricing.py:

Around line 5-16: The import block in tests/observability/test_pricing.py is
failing ruff I001 due to incorrect ordering and grouping; reorder imports into
standard library (datetime, decimal) first, then third-party (pytest), then
local package imports, with a blank line between groups, and sort names
alphabetically within each group—ensure the top imports include date and Decimal
from datetime/decimal, then pytest, then the miot_harness.observability.pricing
symbols (PRICING, ModelPricing, TokenUsage, UnknownModelError, compute_cost) in
alphabetical order to satisfy ruff.

In @miot-harness/tests/runtime/test_intent_router.py:

Around line 39-43: Two test-case tuples in test_intent_router.py exceed the
100-character line limit; break each long tuple literal into multiple lines so
no physical line exceeds 100 chars. Locate the offending tuples containing the
strings "filter by region=norte and order by ETA desc" and "draft a story about
today's incident" (they construct tuples with HarnessRoute.NEXO_AGENTIC and
HarnessRoute.STORYTELLING_RUN respectively) and reformat them by placing the
prompt string, model name, score, and route on separate indented lines (or wrap
the prompt string) so the tuple remains the same semantically but each source
line is <=100 chars to satisfy Ruff E501.

In @miot-harness/tests/runtime/test_supervisor_phase_e.py:

Line 29: The import list in the test includes an unused symbol HarnessRoute
which triggers a pipeline failure; remove HarnessRoute from the import statement
that currently reads "from miot_harness.runtime.router import HarnessRoute,
IntentRouter" so only IntentRouter is imported (or switch to a single-name
import of IntentRouter) and run the tests to ensure no other references to
HarnessRoute remain in the file (e.g., check test_supervisor_phase_e.py for any
usage of HarnessRoute).

Outside diff comments:
In @miot-harness/tests/conftest.py:

Around line 29-44: The isolation fixture _isolate_settings_from_env currently
alters HarnessSettings.model_config and environment variables but does not clear
the cached settings, so tests that call get_settings() after a prior
instantiation still see stale values; update _isolate_settings_from_env to clear
the cached accessor by calling get_settings.cache_clear() (or the appropriate
cache-clear method used by get_settings) after monkeypatching
HarnessSettings.model_config and before returning so subsequent get_settings()
calls recreate settings from the modified env/model_config.

Minor comments:
In @.ralph/log.md:

Around line 5-11: The fenced code block in the iteration template is missing a
language identifier which triggers markdownlint MD040; update the
triple-backtick fence for the example block (the block that begins with "" followed by the iteration lines) to include a language identifier such as "text" (e.g. change "" to "```text") so the markdown linter passes and the example
remains unchanged otherwise.

In @miot-harness/.env.example:

Line 135: Fix the typo in the OTEL comment by removing the stray space in the
path string "infra/observability/ docker-compose" so it reads
"infra/observability/docker-compose"; update the comment text on the affected
line in .env.example accordingly to ensure copy/paste yields a valid path.

In @miot-harness/src/miot_harness/config.py:

Around line 48-55: The two numeric config fields
intent_router_confidence_threshold and nexo_explain_cost_threshold need runtime
bounds validation on load: ensure intent_router_confidence_threshold is within
[0.0, 1.0] and nexo_explain_cost_threshold is a positive number (e.g., > 0.0);
implement the checks in the config initialization path (e.g.,
Config.post_init or the function that constructs these values) and raise a
clear ValueError with the invalid field name and value if a check fails so
misconfiguration is caught immediately.

In @miot-harness/tests/observability/conftest.py:

Around line 13-17: Reorder the import block to satisfy Ruff's I001 ordering:
group and sort imports into standard library, third-party, and local/application
imports, and sort them alphabetically within each group; specifically ensure
pytest and opentelemetry imports are ordered so that "import pytest" appears
before "from opentelemetry import trace" and the opentelemetry submodule imports
(ReadableSpan, TracerProvider, SimpleSpanProcessor, SpanExportResult,
InMemorySpanExporter) are grouped and sorted consistently.

In @miot-harness/tests/observability/test_graph_wiring.py:

Around line 158-164: The test currently only asserts the root "nexo.run" span
attributes; add parent-child assertions to ensure all agent spans are children
of that root by locating the root span (root) from
memory_exporter.get_finished_spans(), then iterate over agent spans (spans with
names starting with "nexo." excluding "nexo.run") and assert each has a non-null
parent and that span.parent.span_id (or span.parentSpanId if that field is used
in the Span model) equals root.context.span_id (or root.context.span_id field
used in this codebase); update test_root_nexo_run_span_parents_per_agent_spans
to include these explicit parent-id checks.
</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ]  Push a commit to this branch (recommended)
- [ ]  Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `3cbe69d2-21ba-4b42-8fac-5a2bf59d082a`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 2ad28fa5be479da481882bbf55691b9793e96c7c and 085595fcc0621504f0cc6beb7f84bada86ffa889.

</details>

<details>
<summary>⛔ Files ignored due to path filters (1)</summary>

* `miot-harness/uv.lock` is excluded by `!**/*.lock`

</details>

<details>
<summary>📒 Files selected for processing (59)</summary>

* `.ralph/RALPH_PROMPT.md`
* `.ralph/blockers.md`
* `.ralph/log.md`
* `.ralph/state.md`
* `infra/observability/README.md`
* `infra/observability/bootstrap.sh`
* `infra/observability/docker-compose.yml`
* `infra/observability/otel-collector-config.yaml`
* `miot-harness/.env.example`
* `miot-harness/pyproject.toml`
* `miot-harness/src/miot_harness/agents/meta_agent.py`
* `miot-harness/src/miot_harness/api/server.py`
* `miot-harness/src/miot_harness/config.py`
* `miot-harness/src/miot_harness/integrations/nexo/primitives.py`
* `miot-harness/src/miot_harness/observability/__init__.py`
* `miot-harness/src/miot_harness/observability/callbacks.py`
* `miot-harness/src/miot_harness/observability/otel.py`
* `miot-harness/src/miot_harness/observability/pricing.py`
* `miot-harness/src/miot_harness/observability/provenance.py`
* `miot-harness/src/miot_harness/observability/report.py`
* `miot-harness/src/miot_harness/observability/spans.py`
* `miot-harness/src/miot_harness/runtime/agentic_graph.py`
* `miot-harness/src/miot_harness/runtime/context.py`
* `miot-harness/src/miot_harness/runtime/conversation.py`
* `miot-harness/src/miot_harness/runtime/factory.py`
* `miot-harness/src/miot_harness/runtime/intent_router.py`
* `miot-harness/src/miot_harness/runtime/mode_resolver.py`
* `miot-harness/src/miot_harness/runtime/nexo_graph.py`
* `miot-harness/src/miot_harness/runtime/router.py`
* `miot-harness/src/miot_harness/runtime/run_store.py`
* `miot-harness/src/miot_harness/runtime/supervisor.py`
* `miot-harness/src/miot_harness/runtime/tenancy.py`
* `miot-harness/tests/agents/__init__.py`
* `miot-harness/tests/agents/test_meta_agent.py`
* `miot-harness/tests/conftest.py`
* `miot-harness/tests/integrations/nexo/test_primitives.py`
* `miot-harness/tests/observability/__init__.py`
* `miot-harness/tests/observability/conftest.py`
* `miot-harness/tests/observability/test_callbacks.py`
* `miot-harness/tests/observability/test_deps.py`
* `miot-harness/tests/observability/test_graph_wiring.py`
* `miot-harness/tests/observability/test_langfuse_attribution.py`
* `miot-harness/tests/observability/test_lifespan_init.py`
* `miot-harness/tests/observability/test_mode_telemetry.py`
* `miot-harness/tests/observability/test_otel.py`
* `miot-harness/tests/observability/test_pricing.py`
* `miot-harness/tests/observability/test_propagation.py`
* `miot-harness/tests/observability/test_provenance.py`
* `miot-harness/tests/observability/test_report.py`
* `miot-harness/tests/observability/test_spans.py`
* `miot-harness/tests/runtime/__init__.py`
* `miot-harness/tests/runtime/test_agentic_graph.py`
* `miot-harness/tests/runtime/test_conversation.py`
* `miot-harness/tests/runtime/test_intent_router.py`
* `miot-harness/tests/runtime/test_mode_resolver.py`
* `miot-harness/tests/runtime/test_supervisor_phase_e.py`
* `miot-harness/tests/runtime/test_tenancy_gate.py`
* `miot-harness/tests/test_config.py`
* `scripts/provenance-curate.py`

</details>

</details>

…es gate Post-merge review of PR #462 found that `SELECT … FOR UPDATE / FOR SHARE / FOR KEY SHARE` (incl. `NOWAIT` / `SKIP LOCKED`) was ACCEPTED by the safety gate. These are read-shaped at the AST level (no Insert/Update/Delete) but acquire row-level locks — against `nexo.dx_*` snapshot tables they can block the snapshot-refresh job; against Citus they can escalate to a global lock. The DB role is read-only at runtime, but defense-in-depth says gate it at the AST level. Same class as the LATERAL function-in-FROM and side-effect-function bypasses the earlier E3 review closed: production-affecting side effects without a DML AST node. Fix: add `exp.Lock` to `_MUTATION_NODES` in `primitives.py`. sqlglot parses the `FOR UPDATE` clause into `args["locks"]` on the Select node, which the walker now catches. Test: `test_validator_rejects_row_locking` parametrically asserts all 5 PG locking variants raise `MutationRejected`. Test suite: 52 passed in `tests/integrations/nexo/` (was 47). Review ref: PR #462 review C1.

…volume Two Important items from the PR #462 review, neither behavior-changing on the wire but both improving operability: I1 — `agentic_graph.route_from_planner` was dead-code: both branches returned `"synthesizer"` and the conditional-edge dict had only one entry (`{"synthesizer": "synthesizer"}`). Anyone returning a different label from a future executor wire-up would have raised at edge time. Collapsed to a plain `graph.add_edge("planner", "synthesizer")` with a comment marking the spot for the F-phase wire-up to swap back to a conditional edge with an expanded mapping. I2 — `docker-compose.yml` declared persistent volumes for postgres / clickhouse / clickhouse-logs / minio but NOT for redis. Langfuse v3 uses Redis for the ingestion queue + cache — `docker compose restart` would lose in-flight events between the OTel Collector and langfuse-worker. Added `redis-data:/data` named volume; turned on `--appendonly yes` so the AOF actually persists. Updated the README's "Volumes used" list (was 4 entries, now 5). Review ref: PR #462 review I1 + I2.

PR #462 review I4: `_parse_decision` calls `json.loads(text)`, catches `JSONDecodeError`, then calls `payload.get(...)`. If the LLM returns valid JSON that isn't a dict (bare string `"hello"`, number `42`, bool `true`, null, array), `.get(...)` raises `AttributeError`, escapes the narrow `JSONDecodeError` except clause, and crashes the route call before the keyword fallback runs. Fix: `isinstance(payload, dict)` guard immediately after the `json.loads` — non-dict payloads now fall back to the keyword router like any other unparseable response. Test: `test_non_dict_json_falls_back_to_keyword` parametrically covers bare string / number / bool / null / array. Each case ships the message "Mintral status" so the keyword router lands on NEXO_QUERY (verifying the fallback runs end-to-end, not just that `_parse_decision` returned None). Review ref: PR #462 review I4.

coderabbitai

♻️ Duplicate comments (2)

infra/observability/docker-compose.yml (2)
102-105: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hardcoded credentials remain a security posture concern despite local dev scope.

While the inline comments (lines 106-109, 115-119) clearly document these as LOCAL-DEV-ONLY values, committing deterministic credentials still normalizes insecure patterns and creates copy-paste risk into staging/production environments. Git history preserves these forever, and they establish a precedent for credential handling that can leak beyond local development.
🔐 Recommended approach

Even for local dev, prefer environment variable interpolation with a .env.example template:
     environment:
-      DATABASE_URL: postgresql://postgres:postgres@postgres:5432/postgres
+      DATABASE_URL: ${MIOT_OBS_DATABASE_URL:-postgresql://postgres:postgres@postgres:5432/postgres}
-      NEXTAUTH_SECRET: please-change-this-in-prod
+      NEXTAUTH_SECRET: ${MIOT_OBS_NEXTAUTH_SECRET:?required}
-      LANGFUSE_INIT_PROJECT_PUBLIC_KEY: pk-lf-d57c7ed15cd59041736edbcfc679e48e
+      LANGFUSE_INIT_PROJECT_PUBLIC_KEY: ${MIOT_OBS_LANGFUSE_PUBLIC_KEY:?required}
Then provide a .env.example with placeholder documentation. This maintains reproducibility while keeping secrets out of version control.
Also applies to: 28-28, 53-53, 72-72, 85-85, 109-109, 135-135, 139-139, 151-152, 156-156, 165-165, 169-169

Also applies to: 120-121, 124-124
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/observability/docker-compose.yml` around lines 102 - 105, Replace the
hardcoded local-dev credentials (DATABASE_URL, NEXTAUTH_URL, NEXTAUTH_SECRET,
SALT) in docker-compose.yml with environment-variable interpolation and provide
a .env.example template; specifically, change the service env entries to
reference variables (e.g. DATABASE_URL:
"${DATABASE_URL:-postgresql://postgres:postgres@postgres:5432/postgres}" or
simply DATABASE_URL: "${DATABASE_URL}") and add a .env.example with placeholder
values and documentation for the four keys so secrets are not committed and
local defaults remain documented but externalized.
63-75: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Redis healthcheck still fails due to missing authentication.

The healthcheck at line 64 uses redis-cli ping without credentials, but line 72 configures Redis with --requirepass myredissecret. This causes NOAUTH errors, marks the container unhealthy, and breaks the depends_on: service_healthy condition at line 97, preventing langfuse-web from starting.
🔐 Proposed fix
     healthcheck:
-      test: ["CMD", "redis-cli", "ping"]
+      test: ["CMD", "redis-cli", "-a", "myredissecret", "ping"]
       interval: 5s
       timeout: 3s
       retries: 10
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/observability/docker-compose.yml` around lines 63 - 75, The Redis
healthcheck command is missing authentication and fails because Redis is started
with --requirepass myredissecret; update the healthcheck test (the healthcheck
"test" entry that currently runs redis-cli ping) to pass the password (e.g., use
redis-cli -a myredissecret ping or the equivalent redis-cli --pass myredissecret
ping) so the healthcheck authenticates successfully against the server started
with --requirepass myredissecret.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@infra/observability/docker-compose.yml`:
- Around line 102-105: Replace the hardcoded local-dev credentials
(DATABASE_URL, NEXTAUTH_URL, NEXTAUTH_SECRET, SALT) in docker-compose.yml with
environment-variable interpolation and provide a .env.example template;
specifically, change the service env entries to reference variables (e.g.
DATABASE_URL:
"${DATABASE_URL:-postgresql://postgres:postgres@postgres:5432/postgres}" or
simply DATABASE_URL: "${DATABASE_URL}") and add a .env.example with placeholder
values and documentation for the four keys so secrets are not committed and
local defaults remain documented but externalized.
- Around line 63-75: The Redis healthcheck command is missing authentication and
fails because Redis is started with --requirepass myredissecret; update the
healthcheck test (the healthcheck "test" entry that currently runs redis-cli
ping) to pass the password (e.g., use redis-cli -a myredissecret ping or the
equivalent redis-cli --pass myredissecret ping) so the healthcheck authenticates
successfully against the server started with --requirepass myredissecret.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 600b9af6-6529-4f41-9537-3b6b8a282514

📥 Commits

Reviewing files that changed from the base of the PR and between 085595f and 5330dc8.

📒 Files selected for processing (7)

infra/observability/README.md
infra/observability/docker-compose.yml
miot-harness/src/miot_harness/integrations/nexo/primitives.py
miot-harness/src/miot_harness/runtime/agentic_graph.py
miot-harness/src/miot_harness/runtime/intent_router.py
miot-harness/tests/integrations/nexo/test_primitives.py
miot-harness/tests/runtime/test_intent_router.py

✅ Files skipped from review due to trivial changes (1)

infra/observability/README.md

🚧 Files skipped from review as they are similar to previous changes (5)

miot-harness/tests/runtime/test_intent_router.py
miot-harness/tests/integrations/nexo/test_primitives.py
miot-harness/src/miot_harness/runtime/agentic_graph.py
miot-harness/src/miot_harness/runtime/intent_router.py
miot-harness/src/miot_harness/integrations/nexo/primitives.py

…e ping Container starts with --requirepass but healthcheck called `redis-cli ping` unauthenticated. redis-cli prints `NOAUTH Authentication required.` and exits 0, so the healthcheck reported "healthy" without ever verifying Redis was actually responsive. Use `-a` with --no-auth-warning so the ping is authenticated and the exit status reflects real health. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three lint failures in miot_harness.integrations.nexo.primitives: - E501 (lines 6-11): docstring table exceeded the 100-char budget. Rewrote as a bulleted list — same content, fits the line length. - B033 (line 119): `to_char` appeared twice in `_SAFE_FUNCTIONS` (once under "string", once under "date/time"). Kept the first; the set semantics were unchanged but the duplicate is a lint failure. - I001 (line 24): import block already grouped correctly (future → stdlib → third-party); the failure was a stray double blank line between imports and the first top-level comment. Ruff: 0 errors. Tests: tests/integrations/nexo/ → 103 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`isinstance(row, dict)` is always False for asyncpg.Record (it supports mapping-style subscript access but is not a dict subclass), so every nexo_explain call against a real DB raised "EXPLAIN output missing QUERY PLAN" before reaching the cost-gate logic. The existing tests passed only because their fake_pool fixtures fed plain dicts — coverage masked the prod bug. - primitives.py: switch to `row["QUERY PLAN"]` subscript access guarded by try/except (KeyError, TypeError, IndexError). Works uniformly across asyncpg.Record and plain-dict fixtures. - test_primitives.py: add test_nexo_explain_handles_record_like_rows using a non-dict RecordLike fixture that locks the fix against a future `isinstance(row, dict)` regression. Also fix two pre-existing I001 nits (out-of-order import, stray blank line) flagged by ruff on the touched file. Ruff: 0 errors. Tests: tests/integrations/nexo/ → 104 passed (+1), 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three lint failures in test_langfuse_attribution.py: - I001 (lines 73-74 and 131-132): inline imports inside two test bodies ordered `langchain_core.outputs` before `langchain_core.messages`. Reordered so messages comes first. - F401 (line 15): `pytest` imported but never referenced. Dropped. Ruff: 0 errors. Tests: tests/observability/test_langfuse_attribution.py → 4 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The import block is already correctly grouped (future → stdlib → third-party → local) and alphabetized; the actual cause of the I001 failure was a stray double blank line between the imports and the first test definition. Collapsed to one. Ruff: 0 errors. Tests: tests/observability/test_pricing.py → 6 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two parametrize tuples at lines 39 and 43 were 102 chars (line-length=100). Wrapped each tuple across five indented lines (prompt / model / score / route) — same semantics, fits the budget. Ruff: 0 errors. Tests: tests/runtime/test_intent_router.py → 10 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e-E tests F401: HarnessRoute was imported at line 29 but never referenced in the file. Ruff: 0 errors. Tests: tests/runtime/test_supervisor_phase_e.py → 7 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four files not previously touched in this branch still tripped `ruff check src tests`: - src/miot_harness/observability/report.py:124 (E501): inline header f-string was 102 chars. Extracted to a two-line concatenated literal. - src/miot_harness/runtime/agentic_graph.py:78 (E501): inline ainvoke argument was 103 chars. Pulled into a `prompt` local. - tests/observability/conftest.py:17 and tests/observability/test_deps.py:12 (I001 ×2): stray double blank line after the import block — same pattern as earlier fixes on this branch. Ruff: 0 errors over `src tests`. Tests: tests/observability/, tests/runtime/test_intent_router.py, tests/runtime/test_supervisor_phase_e.py, tests/integrations/nexo/ → 178 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@infra/observability/docker-compose.yml`:
- Around line 77-88: The MinIO service (minio / container_name miot-obs-minio)
lacks a Docker healthcheck which causes dependents to wait for service_started
instead of service_healthy; add a healthcheck block to the minio service using
an HTTP or mc-based check against :9000 (for example a curl GET
/minio/health/live or mc admin info) with sensible interval/retries/timeout and
start_period so the container reports healthy only when MinIO is ready, and then
update the langfuse-web service dependency to use depends_on: { minio: {
condition: service_healthy } } (or equivalent docker-compose v3 syntax) so
langfuse-web waits for MinIO to be healthy before starting.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cb96ba78-cc9b-497d-87a5-7215364853d4

📥 Commits

Reviewing files that changed from the base of the PR and between 5330dc8 and 2d21523.

📒 Files selected for processing (11)

infra/observability/docker-compose.yml
miot-harness/src/miot_harness/integrations/nexo/primitives.py
miot-harness/src/miot_harness/observability/report.py
miot-harness/src/miot_harness/runtime/agentic_graph.py
miot-harness/tests/integrations/nexo/test_primitives.py
miot-harness/tests/observability/conftest.py
miot-harness/tests/observability/test_deps.py
miot-harness/tests/observability/test_langfuse_attribution.py
miot-harness/tests/observability/test_pricing.py
miot-harness/tests/runtime/test_intent_router.py
miot-harness/tests/runtime/test_supervisor_phase_e.py

🚧 Files skipped from review as they are similar to previous changes (10)

miot-harness/tests/observability/test_deps.py
miot-harness/tests/runtime/test_supervisor_phase_e.py
miot-harness/tests/observability/conftest.py
miot-harness/src/miot_harness/runtime/agentic_graph.py
miot-harness/src/miot_harness/observability/report.py
miot-harness/tests/integrations/nexo/test_primitives.py
miot-harness/tests/observability/test_pricing.py
miot-harness/tests/observability/test_langfuse_attribution.py
miot-harness/tests/runtime/test_intent_router.py
miot-harness/src/miot_harness/integrations/nexo/primitives.py

The minio service had no healthcheck, so langfuse-web's depends_on fell back to `service_started` — the web container could start while MinIO was still booting and fail blob uploads on first request. - minio: probe `/minio/health/live` via `curl -fsS` (both curl and mc confirmed present in minio/minio:RELEASE.2025-08-13). Used `127.0.0.1` to match the clickhouse IPv6 workaround above. Timing mirrors the rest of the stack (5s/5s, 10 retries, 10s start_period). - langfuse-web depends_on minio: flip `service_started` → `service_healthy`. Validated: `docker compose config --quiet` passes; recreated minio container reports `healthy` in ~3s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`Tracer.start_as_current_span(attributes=...)` expects `Mapping[str, AttributeValue]` where `AttributeValue` is `str | bool | int | float | Sequence[...]`. The local was annotated `dict[str, object]`, which mypy can't narrow. Every value placed in this dict at runtime is already a str (f-string, the str-typed kwargs past their None guard, and `json.dumps(...)`), so tightening the annotation to `dict[str, str]` matches reality and clears the error without any runtime change. Mypy: 0 errors over 65 files. Tests: tests/observability/test_langfuse_attribution.py → 4 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Distribution-Evals job ran `07-attestations-present.sh` without GH_TOKEN, so `gh attestation verify` exited on the auth precondition ("set the GH_TOKEN environment variable") and the script surfaced it as a misleading "no attestation verifies" FAIL. Scoped GH_TOKEN at the step (not the job) — the other two scripts in the same job (05-pulls-from-ghcr.sh, 06-pulls-from-dockerhub.sh) don't use gh, so the token stays narrow. The job's existing `permissions.attestations: read` was already correct; the only missing piece was exposing github.token to the gh CLI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`docker/build-push-action`'s `provenance: true / sbom: true` embeds attestations into the OCI image manifest only — they do NOT register with the GitHub attestation API. The deploy-eval script `07-attestations-present.sh` uses `gh attestation verify --owner`, which queries `/orgs/{owner}/attestations/{digest}` and returned 404 because no attestation had ever been registered there. Added three steps after the build, gated to skip on fork PRs (same condition as the existing build-push-action `push:` gate, since fork PRs can't write to org packages): 1. `anchore/sbom-action@v0` writes an SPDX-JSON SBOM file (the embedded SBOM from build-push-action is OCI-only; attest-sbom needs a file). 2. `actions/attest-build-provenance@v2` registers SLSA provenance against the image digest with GitHub. 3. `actions/attest-sbom@v2` registers the SPDX SBOM against the same digest with GitHub. Both attest-* actions also push to the OCI registry (`push-to-registry: true`), so the OCI manifest carries attestations from both build-push-action AND the attest-* family — defense in depth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

infra/observability/docker-compose.yml (1)

108-110: ⚡ Quick win

Bind published ports to loopback for safer local defaults.

These mappings currently expose Langfuse and OTLP endpoints on all host interfaces. For local-dev defaults, binding to 127.0.0.1 reduces accidental LAN exposure.

🔧 Suggested patch

   langfuse-web:
@@
     ports:
-      - "3000:3000"
+      - "127.0.0.1:3000:3000"
@@
   otel-collector:
@@
     ports:
-      - "4317:4317"   # OTLP gRPC (harness -> collector)
-      - "4318:4318"   # OTLP HTTP (debug)
-      - "8888:8888"   # collector self-metrics
+      - "127.0.0.1:4317:4317"   # OTLP gRPC (harness -> collector)
+      - "127.0.0.1:4318:4318"   # OTLP HTTP (debug)
+      - "127.0.0.1:8888:8888"   # collector self-metrics

Also applies to: 192-195

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/observability/docker-compose.yml` around lines 108 - 110, The ports
mapping currently publishes services to all interfaces (e.g., the ports entry
with "- \"3000:3000\""); change published ports to bind to loopback by prefixing
the host address (e.g., "- \"127.0.0.1:3000:3000\"") for the Langfuse and OTLP
port mappings (and any other similar entries around the other ports referenced
in the file) so they only listen on localhost for safer local-dev defaults.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@infra/observability/docker-compose.yml`:
- Around line 108-110: The ports mapping currently publishes services to all
interfaces (e.g., the ports entry with "- \"3000:3000\""); change published
ports to bind to loopback by prefixing the host address (e.g., "-
\"127.0.0.1:3000:3000\"") for the Langfuse and OTLP port mappings (and any other
similar entries around the other ports referenced in the file) so they only
listen on localhost for safer local-dev defaults.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9fe21d1d-99db-4c61-af51-f7717e09c16e

📥 Commits

Reviewing files that changed from the base of the PR and between 2d21523 and 7dc5bcd.

📒 Files selected for processing (3)

.github/workflows/harness.yaml
infra/observability/docker-compose.yml
miot-harness/src/miot_harness/observability/spans.py

`gh attestation verify` filters by `--predicate-type` and defaults to `https://slsa.dev/provenance/v1`. The single call we were making only ever returned the provenance predicate, so the script reported "no SBOM predicate" even though `actions/attest-sbom@v2` had successfully registered an SPDX attestation against the same digest. Now does two calls: - provenance with explicit `--predicate-type https://slsa.dev/provenance/v1` (no behavior change, self-documenting) - SBOM with `--predicate-type https://spdx.dev/Document`, falling back to `https://cyclonedx.org/bom` if the SBOM tooling ever flips formats Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #462 shipped four .ralph/ files (RALPH_PROMPT.md, blockers.md, log.md, state.md) seeded by commit 8107571. They're local ralph-loop iteration state — worklist, progress log, blockers, runtime prompt — that should never have been tracked. Non-sensitive, but noise for reviewers and grows with every loop iteration. - `git rm -r --cached .ralph/` removes them from the index only; files remain on disk so the active ralph session keeps working. - Add `.ralph/` to root `.gitignore` (under the existing Cursor / Aura local-only dev-tooling section) so any future ralph run anywhere in the monorepo is ignored by default. User chose the non-destructive path: no history rewrite, no force-push. Old commits still reference the files but they're non-sensitive iteration logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

korutx

WIP

* harness(phase-13): close NEXO_META observation-level tenant tag gap E10 emitted `langfuse.tags = [tenant:<id>, mode:<m>, ...]` on the root `nexo.run` trace span, but the inner LLM call in meta mode slipped through observation-level tagging — `supervisor._run_nexo_meta` called `meta_agent_node(model=self.meta_model, ...)` with the **raw** BaseChatModel, so Traceloop's auto-instrumented `anthropic.chat` observation carried no `modular.{agent,tenant_id,mode}` attrs. Result: trace-level filters worked; observation-level per-tenant cost rollups under-counted meta runs. - `runtime/nexo_graph.py`: rename `_instrument` → `instrument_model` so it's importable across modules (no behavior change). - `runtime/supervisor.py`: import `instrument_model`; in `_run_nexo_meta`, wrap `self.meta_model` via `instrument_model(self.meta_model, "meta_agent", ctx)` before invoking `meta_agent_node`. Per-agent `nexo.meta_agent` span now carries the same `modular.{agent,tenant_id,mode}` + `langfuse.{user.id,session.id,tags,environment}` attrs as canned / agentic paths. Tests: - `tests/runtime/test_supervisor_phase_e.py`: +`test_meta_mode_per_agent_span_carries_tenant_tag` — runs the meta dispatch via the supervisor, asserts the emitted nexo.meta_agent span has `modular.agent=meta_agent`, `modular.tenant_id=acme-corp`, `modular.mode=meta`, `langfuse.user.id=alice`, and `tenant:acme-corp` / `mode:meta` / `agent:meta_agent` all present in the JSON-decoded `langfuse.tags`. - Drop unused `HarnessRoute` import in same file. Conftest restructure: the `memory_exporter` fixture lived in `tests/observability/conftest.py`, but the new supervisor test needs it from `tests/runtime/`. Pytest's conftest scoping only resolves ancestors, not siblings — promoted the fixture (+ the `_HarnessSpanExporter` class) to `tests/conftest.py` so any test file in the suite can use it. `tests/observability/conftest.py` is now a docstring stub. Full suite: 310 passed, 1 skipped. * harness(phase-13): C3 live Langfuse fetcher + MIOT_HARNESS_LANGFUSE_HOST `observability/report.py` previously errored out on the live path ("live Langfuse fetch is not wired yet — pass --fixture"). Wiring it to the real Langfuse v3 public REST API. `config.py`: - `langfuse_host: str = "http://localhost:3000"` — where the CLI fetches traces from. Local docker-compose stack by default; override for a deployed Langfuse via env var. - `.env.example`: documented entry alongside the existing `MIOT_HARNESS_LANGFUSE_*` keys block. `observability/report.py`: - `_project_langfuse_trace(trace) -> {"attributes": {...}}` — pure projection from Langfuse's trace row shape (`tags`, `totalCost`, `userId`, `sessionId`) into the `aggregate_cost`-friendly shape. Splits the `tenant:` / `mode:` / `agent:` prefixes back into `modular.{tenant_id, mode, agent}` so the existing aggregator keys still work. Token totals project as 0 (Langfuse v3 keeps them on observations, not on trace rows). - `fetch_traces_window(host, public_key, secret_key, since, now=, page_size=100, client=)` — paginated GET against `/api/public/traces` with Basic Auth, returning the raw rows. Caller-injectable `client` for tests via `httpx.MockTransport`. Owns the client lifecycle when none is injected. - `_live_traces(since)` — operator-facing wrapper reading the three env vars (`MIOT_HARNESS_LANGFUSE_{HOST,PUBLIC_KEY,SECRET_KEY}`). Raises a clear RuntimeError telling the operator to run `./infra/observability/bootstrap.sh` when keys are missing. - `main()`: replaces the `parser.error("not wired yet")` branch with `_live_traces(since)` when no `--fixture` is passed; still honours `--fixture` for dry-run / offline use. - Drive-by: line-wrap `_render_text`'s long header line to satisfy ruff E501 (was committed in 462's a2f5e8a). Tests (+8 new): - `test_project_langfuse_trace_extracts_tags_and_cost` — `tenant:acme` / `mode:agentic` → `modular.{tenant_id, mode}`. - `test_project_langfuse_trace_handles_missing_tags`. - `test_project_langfuse_trace_handles_null_total_cost` (Langfuse returns `totalCost: null` for traces with no LLM calls). - `test_project_langfuse_trace_picks_first_agent_tag` — multi-agent traces project the first. - `test_fetch_traces_window_paginates_until_total_pages` — 3-page MockTransport stream concatenates correctly. - `test_fetch_traces_window_forwards_since_as_query_params` — asserts `fromTimestamp` / `toTimestamp` are derived from the injected `now` and the `since` timedelta. - `test_fetch_traces_window_sends_basic_auth_header` — asserts `Authorization: Basic base64(pk:sk)`. - `test_fetch_traces_window_strips_trailing_slash_from_host` — no `//api/public/traces` corruption when host ends in `/`. - `tests/test_config.py`: +2 tests for `langfuse_host` default and env-var override. Usage: MIOT_HARNESS_LANGFUSE_PUBLIC_KEY=pk-lf-… MIOT_HARNESS_LANGFUSE_SECRET_KEY=sk-lf-… \ uv run python -m miot_harness.observability.report --since 7d --by tenant Full suite: 310 passed, 1 skipped. * harness(phase-13): docs — Run Modes + per-tenant billing surface Four documentation updates covering the per-tenant billing flow this PR ships. No code changes here — all source changes landed in the preceding two commits. `miot-harness/README.md`: - New `## Run Modes` section after `## Design Defaults` covering the four `mode` values on `POST /runs`: auto / canned / meta / agentic. - 5-column table (route / DB-touch / non-Mintral / LLM-calls / use-case) for at-a-glance comparison. - Per-mode subsections with concrete example questions: - `### canned — answer FROM the data` (nexo_graph, curated fn_dx_*). - `### meta — answer ABOUT the data` (meta_agent_node, no SQL, structural guarantee via function signature). - `### agentic — explore the data with composable primitives` (agentic_graph, sqlglot safety gate). - `### Cost separation in Langfuse` — how `mode:<m>` tag + per-tenant filter enables tiered billing (unlimited meta + metered canned). `infra/observability/README.md`: - New `## Per-tenant cost rollups` section before "What this is *not*" documenting the four ways to get per-tenant cost: 1. Langfuse UI Filters sidebar (Tags → tenant:<id> → Total Cost). 2. Cost-report CLI with sample text output. 3. ClickHouse JOIN query against traces + observations (correctly joining the tag-bearing trace row to the cost-bearing observation rows — Langfuse v3 only stores total_cost on observations). 4. Persistent dashboard (pointer to dashboards/per-tenant-cost.md). - Privacy posture reminder: single-tenant self-host; co-mingled tenant data inside one Langfuse project; plan-14 isolates per project. `infra/observability/dashboards/per-tenant-cost.md` (NEW): - 2-minute click-by-click UI build procedure for the Per-tenant cost widget. Langfuse v3 doesn't yet expose a stable dashboard-import endpoint so we ship the build procedure instead of brittle JSON. - ClickHouse JOIN query as a power-user fallback. - CLI alternative. `infra/observability/PER_TENANT_BILLING_TEST.md` (NEW): - 9-step manual test procedure for end-to-end verification of the per-tenant billing surface (tunnel → uvicorn → 3-tenant curl battery → UI filter → CLI → NEXO_META observation-level check → ClickHouse query → dashboard build). - All paths are `$REPO_ROOT`-relative (`git rev-parse --show-toplevel`) so the doc is portable across developer machines and CI. - `$DB_SCRIPTS_ROOT` placeholder for the *separate* db-scripts repo that owns the SSH tunnel script. - Langfuse keys read from `docker-compose.yml` via a `grep | awk` pipeline rather than hardcoded literal values — clean for forks that regenerate the keys. - Pass/fail checklist at the end with a "which file to inspect when each step fails" troubleshooting table. * harness(phase-13): post-review report.py fixes — settings-based config + (multi) agent sentinel Two Important items from PR #470's first-pass review: I1 — `_live_traces` was reading `os.environ.get` directly with a hardcoded `"http://localhost:3000"` default, even though `HarnessSettings.langfuse_host` already exists with the same default (and proper `MIOT_HARNESS_` prefix + `.env` resolution via pydantic-settings). Two defaults to keep in sync = drift hazard. Now reads through `get_settings()` — single source of truth, `.env` resolution + validation apply, the `langfuse_host` field is no longer dead. `os` import dropped since nothing else used it. I3 — `_project_langfuse_trace` previously projected the **first** `agent:<name>` tag and silently dropped the rest. For a canned-mode trace (which carries 3+ agent tags: filter_expert / synthesizer / critic / …), this attributed all of the trace's cost to the first agent and undercounted the others. For `--by tenant` / `--by mode` this was correct (single-valued tags); for `--by agent` it was wrong on live data. Now: single-agent traces still project the real name; multi-agent traces project `modular.agent = "(multi)"` so the aggregator bucket is visibly distinct. Operators see a `(multi)` row and know to drill down via the observations endpoint instead of trusting silently-undercounted numbers. Tests: - Replaced `test_project_langfuse_trace_picks_first_agent_tag` with `test_project_langfuse_trace_marks_multi_agent_with_sentinel`. - Added `test_project_langfuse_trace_single_agent_uses_real_name` to pin down the unambiguous case. Full suite: 321 passed, 1 skipped. Review ref: PR #470 review I1 + I3. * harness(phase-13): tighten meta-attribution test docstring (PR #470 I2) The test docstring promised the regression target was "the inner LLM call must carry the same attrs" — but the test asserts on the harness-emitted `nexo.meta_agent` span (from `NexoTelemetryCallback`), not the Traceloop-auto-instrumented `anthropic.chat` child observation that was the actual gap. `FakeListChatModel` doesn't trigger the Anthropic SDK instrumentor, so the test verifies the harness-side wrap, not the third-party-instrumented child observation. Rewrote the docstring to make the scope explicit: - What's covered: the per-agent span emitted when `_run_nexo_meta` calls `instrument_model(self.meta_model, "meta_agent", ctx)`. - What's verified separately: the Traceloop child-observation path, confirmed live via the ClickHouse query in PR #470's description (`SELECT name, tenant_id FROM observations WHERE name='nexo.meta_agent'`). No code change; docstring-only. Review ref: PR #470 review I2. * chore: remove plan files * chore: remove plan files * chore: remove plan files --------- Co-authored-by: Michel David <mdavid.cu@gmail.com>

odtorres added 22 commits May 12, 2026 13:21

odtorres added 2 commits May 12, 2026 17:00

harness(phase-13): F5 done — PR #462 opened

416864b

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

odtorres mentioned this pull request May 14, 2026

harness(phase-13): per-tenant billing surface (stacked on #462) #470

Merged

6 tasks

odtorres added 3 commits May 14, 2026 17:52

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

odtorres mentioned this pull request May 14, 2026

harness(phase-13): hydrate conversation memory into LLM-bearing agents (stacked on #470) #472

Merged

odtorres self-assigned this May 15, 2026

odtorres and others added 8 commits May 15, 2026 15:54

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread infra/observability/docker-compose.yml

odtorres and others added 4 commits May 15, 2026 16:50

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

odtorres linked an issue May 15, 2026 that may be closed by this pull request

feat(harness): end-to-end LLM telemetry + agentic supervisor for Nexo (Plan 13) #484

Closed

6 tasks

korutx approved these changes May 20, 2026

View reviewed changes

korutx merged commit 85f6fe9 into trunk May 20, 2026
12 checks passed

This was referenced May 21, 2026

Feat/harness sse streaming #496

Merged

Feat/harness sse rich events #516

Merged

coderabbitai Bot mentioned this pull request Jun 6, 2026

refactor(harness)!: datasource-agnostic core — Nexo becomes the first DataSourceProvider #604

Merged

7 tasks

Conversation

odtorres commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's NOT in this PR (follow-up)

Known limitations / caveats

Verification artifacts

Test plan

Code review checkpoints

Plan ref

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

korutx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

odtorres commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading