harness(phase-13): per-agent telemetry + agentic search foundation#462
Conversation
- Add `opentelemetry-{api,sdk,exporter-otlp}`,
`opentelemetry-instrumentation-fastapi`, `traceloop-sdk` to runtime deps
and `pytest-opentelemetry` to dev deps.
- `uv sync` resolves 163 packages cleanly with no version conflicts against
the existing langchain/langgraph/fastapi/pydantic pins.
- New smoke tests under `tests/observability/test_deps.py` exercise the
public import surface each Phase-A task depends on (TracerProvider,
OTLP gRPC trace exporter, FastAPI instrumentation, traceloop.sdk,
pytest_opentelemetry).
- Seed `.ralph/` state, log, blockers, and RALPH_PROMPT.md so iteration
progress is tracked in-tree.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A1)
Add `miot_harness.observability` with four units that the rest of Phase A
plugs into:
- `pricing.py` — versioned `ModelPricing` table (Opus 4.7, Sonnet 4.6,
Haiku 4.5, gpt-4o-mini) keyed by model id with `effective_date` so
historical reads stay stable; Anthropic-style `TokenUsage` buckets;
`compute_cost` returns a `Decimal` quantized to 6 decimals; raises
`UnknownModelError` for unmapped models.
- `otel.py` — `configure_tracing(enabled, service_name, endpoint,
environment)` builds a `TracerProvider` with a Resource carrying
`service.name` / `service.namespace=modular` /
`deployment.environment` and a `BatchSpanProcessor` wrapping
`OTLPSpanExporter`. Returns `None` when disabled so the FastAPI
lifespan (A4) can no-op without branches. `shutdown_tracing` is
None-safe.
- `spans.py` — `agent_span(name, *, run_id, tenant_id?, mode?)` context
manager that opens a `nexo.<name>` span carrying
`gen_ai.operation.name`, `modular.run_id`, `modular.tenant_id`, and
`modular.mode` (last two omitted when unset) and nests under the
current context so traces form a tree.
- `callbacks.py` — `NexoTelemetryCallback(BaseCallbackHandler)` tracking
per-`run_id` span handles in a dict (so concurrent LLM calls under
the same node do not cross-talk). Maps LangChain's total-style
`usage_metadata.input_tokens` back into Anthropic buckets, infers
provider from `serialized.id`, and emits
`gen_ai.{system, request.model, usage.input_tokens, usage.output_tokens,
usage.cache_read.input_tokens, usage.cache_creation.input_tokens,
usage.cost_usd}` plus `modular.{agent, run_id, tenant_id, mode}` on
`on_llm_end`. `on_llm_error` records the exception and sets
`Status(ERROR)`.
Tests under `tests/observability/` attach an `InMemorySpanExporter` to
the `TracerProvider` already installed by pytest-opentelemetry (OTel's
global provider is install-once) and filter to `nexo.*` spans so the
plugin's own per-test spans don't pollute assertions. 17 new tests cover
pricing math, span context, configure_tracing return contract, and the
callback's full span-emission surface incl. concurrent runs and provider
inference. Full suite: 175 passed, 1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A2)
Per-agent attribution from day one:
- `runtime/nexo_graph.py`: new `_instrument(model, agent_name, ctx)` helper
that returns `model.with_config(callbacks=[NexoTelemetryCallback(...)])`,
threading `ctx.run_id` and `ctx.tenant_id` through every LLM-bearing seat
(filter_expert, domain_analyst, synthesizer, critic, summarizer). Tool-only
seats (data_fetcher, freshness_judge) stay model-less and produce no
callback spans.
- `runtime/supervisor.py`: wrap `nexo_graph.ainvoke(...)` in
`agent_span("run", run_id, tenant_id)` so a `nexo.run` root span lives
over the whole invocation. Combined with the callback's per-agent spans,
Langfuse can render a complete tree per run; the `modular.run_id` attr
on every span lets us regroup even when LangGraph's parallel branches
break OTel context propagation (the gotcha A6 covers).
Tests: `tests/observability/test_graph_wiring.py` builds the real
`build_nexo_graph(...)` with scripted `FakeListChatModel`s and asserts the
graph emits `nexo.<agent>` spans carrying `modular.{agent, run_id,
tenant_id}` for every seat that fired, plus a `nexo.run` root with the
right `gen_ai.operation.name`/`modular.run_id`/`modular.tenant_id`. Full
suite: 177 passed, 1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A3)
Pull A5 ahead of A4 so the FastAPI lifespan (A4) has typed settings to read instead of falling back to raw env reads. New `HarnessSettings` fields, all under the existing `env_prefix="MIOT_HARNESS_"`: - `otel_enabled: bool = False` — telemetry is off by default; the lifespan no-ops on configure_tracing when this is unset. - `otel_endpoint: str = "http://localhost:4317"` — OTLP gRPC target. - `otel_service_name: str = "miot-harness"` — resource attr. - `otel_environment: str = "local"` — resource attr (`deployment.environment`). - `langfuse_public_key: str | None = None` - `langfuse_secret_key: str | None = None` Tests: 4 new in `test_config.py` covering safe defaults + env overrides for both groups. Full suite: 181 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A5)
Wire telemetry boot into the FastAPI lifespan: - Before Nexo boot, call `configure_tracing(enabled=settings.otel_enabled, service_name, endpoint, environment)`. When enabled, follow with `Traceloop.init(app_name, api_endpoint, disable_batch=False, telemetry_enabled=False)` so Anthropic/OpenAI SDK calls auto-emit `gen_ai.*` child spans against the same global TracerProvider that `NexoTelemetryCallback` per-agent spans flow through. Self-hosted — `telemetry_enabled=False` blocks Traceloop's phone-home pings. - `app.state.tracer_provider` exposes the provider for introspection. - `shutdown_tracing` runs in BOTH `finally` branches of the lifespan (the early Nexo-disabled return AND the regular pool-close path) so the batch exporter always flushes on shutdown. Tests: `tests/observability/test_lifespan_init.py` patches `configure_tracing`/`Traceloop`/`shutdown_tracing` and asserts: - Off-path: configure_tracing called once with `enabled=False`, `Traceloop.init` NOT called. - On-path: configure_tracing forwards all four settings, `Traceloop.init` called with `app_name=service_name`, `shutdown_tracing` called with the returned provider on teardown. Full suite: 183 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A4)
Phase A's known-risk test: when LangGraph fans out into parallel
branches, OTel parent-context propagation can break — child spans
land under the wrong parent and usage attribution gets mixed.
`tests/observability/test_propagation.py` builds a 3-node synthetic
graph `planner → (branch_a ∥ branch_b) → joiner` where each parallel
branch instantiates its own `NexoTelemetryCallback`. After the run we
assert every `nexo.*` span carries the right `modular.{agent, run_id,
tenant_id}` — the explicit attrs Langfuse uses to regroup spans when
the OTel context-propagation tree is unreliable.
Second test: 20 concurrent callbacks via `asyncio.gather` confirm the
per-`run_id` span-dict inside `NexoTelemetryCallback` doesn't cross-talk
under realistic concurrency (multiple LangGraph parallel branches).
The two other A6 test files (`test_callbacks.py` for span emission,
`test_pricing.py` for cost math) were written under A2 as part of TDD
on the source — A6 adds only the propagation piece.
Phase A is now complete; Phase B (backend deployment) is blocked on
operator preconditions per `.ralph/blockers.md`. Full suite: 185 passed,
1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase A · A6)
Reviewer flagged 0 Critical / 5 Important issues during the Phase A
checkpoint. All five fixed plus one recommendation:
1. `.env.example` was missing the new OTel/Langfuse keys, violating the
file's "every config.py setting documented here" invariant. Added
commented entries for MIOT_HARNESS_OTEL_{ENABLED,ENDPOINT,
SERVICE_NAME,ENVIRONMENT} and the LANGFUSE keys.
2. `tests/observability/conftest.py` filtered to `nexo.*` only — Traceloop
auto-instrumentation spans (`gen_ai.*` / `anthropic.*` / `openai.*`)
would have been silently dropped in Phase B tests. Broadened the
filter and renamed `_NexoSpanFilter` → `_HarnessSpanExporter`.
3. `observability/otel.py` silently no-op'd on re-init: a second
`configure_tracing(enabled=True)` call with a provider already
installed returned a NEW SDK provider whose spans went nowhere
(`set_tracer_provider` is install-once). Now we detect the existing
`TracerProvider`, log a warning, and return that one instead. Test
rewritten with `patch` on `trace.get_tracer_provider`/`set_tracer_provider`
to exercise both branches without polluting global state.
4. `observability/spans.py` docstring claimed nodes wrap themselves in
`agent_span`, but no node body does that. The callback layer emits
`nexo.<agent>` for LLM calls; `agent_span` is currently only used
for the root `nexo.run` in `supervisor.py`. Docstring corrected.
5. A6 propagation test (`test_propagation.py`) used `FakeListChatModel`
without any real `await` suspension — branches ran to completion
inside their own tasks and never genuinely interleaved. Added
`await asyncio.sleep(0)` around each `ainvoke` so the event loop
yields between branches, and tightened the docstring to scope what
the test does and doesn't prove (structural attr defense vs.
OTel-context propagation).
Plus reviewer Recommendation 4: wired
`FastAPIInstrumentor.instrument_app(app, tracer_provider=...)` in the
lifespan so HTTP request spans parent `nexo.run` and the full
request → graph → LLM tree is visible in Langfuse.
Full suite: 186 passed, 1 skipped.
Skip ahead of blocked B/C/D (operator-precondition gated per
`.ralph/blockers.md`) and ship the offline-doable agentic-search
foundation.
`runtime/intent_router.py` — new `LLMIntentRouter`:
- Async `route(message)` that asks the LLM to classify into one of
NEXO_QUERY | NEXO_META | NEXO_AGENTIC | STORYTELLING_RUN | DIRECT |
OTHER with calibrated confidence.
- Tolerant JSON parsing: accepts bare JSON or a ```json``` fenced
block (real LLMs often wrap). Unknown route names or malformed JSON
fall back to the keyword router.
- Confidence threshold (default 0.7): when LLM is uncertain, the
deterministic keyword router decides — never silently misroute.
`runtime/mode_resolver.py` — new `resolve_mode`:
- `mode="auto"` → LLM router decides.
- `canned` → `NEXO_QUERY`, no LLM call.
- `meta` → `NEXO_META`, allowed for any tenant (meta-info is
non-confidential per the plan's tenant-gate decision).
- `agentic` → `NEXO_AGENTIC`, but only when `request.tenant_id ==
tenant_lock`; otherwise raises `ModeAccessDenied` BEFORE any LLM
or graph call — request rejected at validation time.
`runtime/router.py` — extended `HarnessRoute` enum with NEXO_META,
NEXO_AGENTIC, OTHER. Existing keyword router unchanged; it now serves
as the deterministic fallback in LLM uncertain / failure paths.
`runtime/context.py` — `UserRequest.mode: RunMode = "auto"`
(`Literal["auto","canned","meta","agentic"]`). Pydantic rejects
unknown values (HTTP 422 in FastAPI).
`config.py` — `intent_router_model` (default Haiku tier),
`intent_router_confidence_threshold` (default 0.7).
Tests: 11 new under `tests/runtime/`:
- `test_intent_router.py`: 30-prompt confusion matrix asserts ≥90%
accuracy; plus low-conf fallback, unparseable response, fenced
JSON, and invalid-route-name paths.
- `test_mode_resolver.py`: all four mode-bypass paths (canned, meta,
agentic-mintral, agentic-rejected), the auto delegation path, and
pydantic validation rejecting `mode="bogus"`.
`.ralph/state.md`: note the out-of-order execution — E1 done before
B-D since those are operator-blocked.
Full suite: 197 passed, 1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E1)
New `agents/meta_agent.py` (Haiku tier) answers schema / primer /
introspection questions WITHOUT touching the DB:
- `MetaAgentCatalogEntry` — narrow projection of `FunctionDescriptor`
the agent needs (name, layer, title, body). Callers feed either the
live introspection cache (production) or a hand-crafted list (tests).
- `meta_agent_node(state, *, model, primer, catalog)` — composes a
system prompt from `primer + formatted catalog`, asks the LLM, and
returns a `{"answer": str}` delta. The function signature
**deliberately omits** any `registry` / `tools` / `pool` parameter:
there's nothing to invoke and nothing to SQL against. The "no SQL"
guarantee is structural, not aspirational.
Tests (`tests/agents/test_meta_agent.py`):
- Agent answers reference catalog functions by name.
- Primer text reaches the system prompt (so es_critico explanations
stay grounded).
- Every catalog entry's title + body appears in the system prompt.
- Signature-level test asserts no DB/tool parameter ever creeps in
via a future refactor.
Full suite: 201 passed, 1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E2)
THE high-risk surface of plan 13: the agent gains four read-only DB
primitives. The safety gate has to be airtight before they go live.
Deps:
- Added `sqlglot>=25.0` (resolves to 30.7.0). Missed in A1; the plan
calls for it.
Config:
- `nexo_explain_cost_threshold: float = 10000.0` — PostgreSQL plan-cost
units. Above this, `nexo_explain` refuses the query.
`integrations/nexo/primitives.py`:
Safety gate `validate_select_sql(sql)`:
- Parses with sqlglot's postgres dialect.
- Rejects bare-or-trailing `;`, multi-statement input, or empty parses
(`MultiStatementRejected`).
- Walks the AST for mutation nodes (Insert/Update/Delete/Drop/Create/
Alter/AlterColumn/Grant/TruncateTable) — even inside CTEs
(`MutationRejected`).
- Requires the top-level construct to be SELECT/Subquery/Union
(`UnsupportedConstruct`).
- Asserts every `exp.Table` matches `^nexo\.dx_[a-zA-Z0-9_]+$` —
`pg_catalog`, `information_schema`, `public.*`, even
`nexo.private_*` are rejected (`AllowlistViolation`).
- Returns the validated AST so callers can re-serialize from it.
Primitives:
- `nexo_describe(table)` — narrow allowlist; queries
`information_schema.columns` with parameterised schema/name. Bypasses
the SELECT gate (own allowlist).
- `nexo_select(table, where?, order?, limit?, columns?)` — default
LIMIT 100, hard cap 5000 (silent clamp). Identifiers quoted; SQL
passes through `validate_select_sql` before execution.
- `nexo_grep(table, column, pattern, limit?)` — sugar over
`WHERE col ILIKE $1`. Pattern parameterised; column identifier
validated against `[a-zA-Z_][a-zA-Z0-9_]*`.
- `nexo_explain(query, cost_threshold)` — runs EXPLAIN (FORMAT JSON);
raises `CostGateViolation` if total cost exceeds threshold.
Tests (`tests/integrations/nexo/test_primitives.py`): 27 new.
Safety gate (no DB):
- 3 multi-statement rejections (bare `;`, `; INSERT`, `SELECT; SELECT`).
- 8 mutation rejections (INSERT/UPDATE/DELETE/DROP/TRUNCATE/CREATE/
ALTER/GRANT).
- 5 non-allowlisted rejections (pg_catalog, information_schema,
pg_class, public.users, nexo.private_table).
- 3 allowlist-accepted SELECTs.
- 1 CTE-with-INSERT rejection.
Functional (mocked pool that mimics asyncpg's real `acquire()` shape —
sync method returning an async context manager):
- `nexo_describe` returns parsed rows + rejects non-allowlisted target.
- `nexo_select` clamps LIMIT to 5000; default is 100.
- `nexo_grep` builds the right ILIKE WHERE with a quoted column.
- `nexo_explain` refuses above threshold; accepts below; returns
`{total_cost, node_type, plan}`.
Full suite: 228 passed, 1 skipped. Loop-prompt gate met: "Move past E3
without safety tests green" is satisfied.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E3)
Code review of E3 verified two Critical bypass paths through the
composable-primitives safety gate. Both fixed before E4 ships.
C1 — `LATERAL <function>` bypass.
`SELECT * FROM nexo.dx_servicios, LATERAL pg_read_file('/etc/passwd')`
was accepted by the gate because `LATERAL` holds an `exp.Anonymous`,
not an `exp.Table` — the allowlist iterator never saw it. Fix:
`validate_select_sql` walks `exp.Lateral` nodes and rejects with
`UnsupportedConstruct`. Agentic queries don't need LATERAL.
C2 — side-effect functions in SELECT list.
`pg_terminate_backend`, `pg_cancel_backend`, `set_config`, `pg_sleep`,
`lo_import`, `pg_reload_conf`, `pg_advisory_lock`, `pg_read_file`,
`dblink` are all SELECT-shaped per the AST (no Insert/Update/Delete
node) but cause DoS, session-state mutation, file exfiltration, or
kill other backends. Positive function allowlist (`_SAFE_FUNCTIONS`,
~80 entries: aggregates, math, string, date/time, JSON, array,
window) is the only durable defense.
`_function_name_candidates(func)` returns the FULL set of sqlglot
spellings for a given function — `Anonymous.this`, `sql_name()`, `key`,
class name — so the gate matches regardless of how sqlglot normalizes
the AST (`to_char` → `TimeToStr`/`time_to_str`/`timetostr`,
`date_trunc` → `TimestampTrunc`/`timestamp_trunc`/`timestamptrunc`,
`now()` → `CurrentTimestamp`/`current_timestamp`). The allowlist
includes every form.
Additional review fixes:
I1 — CTEs were silently broken (`WITH ok AS (SELECT ...) SELECT * FROM ok`
returned a confusing `AllowlistViolation: table 'ok' is outside allowlist`).
Fix: collect CTE alias names from `exp.With.expressions` and exempt them
from the table allowlist. The CTE body itself was already walked, so
the inner SELECT's tables still get checked.
I4 — `validate_select_sql` returned the AST but `nexo_select`/`nexo_grep`/
`nexo_explain` sent the raw user string to the DB. Fix: new
`_render_safe(ast)` returns `ast.sql(dialect='postgres')` — comments,
whitespace tricks, and any structurally-redundant input get normalized
away. DB sees only what the gate validated.
M3 — `_MUTATION_NODES` gained `exp.Merge` (PG15+ MERGE) and `exp.Copy`.
Tests added (20 new in `test_primitives.py`):
- 3 LATERAL function-in-FROM rejections
- 9 side-effect function rejections in SELECT list
- 5 safe-builtin acceptances (count, lower+max, coalesce+to_char+now,
date_trunc, json_build_object)
- 1 CTE-allowlisted acceptance
- 2 MERGE/COPY rejections
Direct re-probe via `uv run python` confirms all 8 bypass attempts
from the review are now blocked. Full suite: 248 passed, 1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E3 review)
`observability/provenance.py`:
- `ProvenanceEntry` dataclass: question, sql, plan_cost,
rows_returned, refreshed_at (optional), run_id, tenant_id.
- `ProvenanceLog(root, enabled=True)`: append-only JSONL log
partitioned by UTC date (`YYYY-MM-DD.jsonl`). Creates the parent
directory on first write so callers don't have to. Naive `now`
arguments coerced to UTC. `enabled=False` makes every call a no-op
so production can flip the log off without untangling call sites.
`scripts/provenance-curate.py` (new, worktree-root scripts dir):
- argparse CLI reading the last N days of JSONL.
- WHERE-clause normalization: collapses `'literals'` and numeric
values to `$`, so `WHERE id = 5` and `WHERE id = 7` cluster into
one bucket. Repeated (table, where-pattern) combos surface as new
curated `fn_dx_*` function candidates.
- Surfaces three things in the weekly diff:
1. Top-N most-repeated (table, where-pattern) buckets.
2. Cross-tenant tables (≥2 tenants) — multi-tenant abstraction
signal for plan 14.
3. Plan-cost outliers — index or denial candidates.
- `--format json` for machine-readable output.
Tests (`tests/observability/test_provenance.py`, 7 new):
- Every field round-trips through JSONL.
- Append-only (one line per call).
- Date partitioning across days.
- mkdir on missing parent dir.
- Disable flag is a true no-op.
- `refreshed_at` may be None.
Full suite: 255 passed, 1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E4)
`runtime/conversation.py`: - `ConversationTurn(user_message, assistant_answer)` — dataclass. - `ConversationHistory(conversation_id, turns, summary)` — dataclass. - `ConversationStore` Protocol — the v2 Redis-backed implementation drops in here without changing call sites. - `InMemoryConversationStore` — v1 dict-keyed store. Lost on process restart; acceptable for v1. - `summarize_if_needed(conversation_id, summarizer)` returns False below `summarize_at_turns` (default 10) and fires the async summarizer at the (n+1)th turn, storing the result in `history.summary` so the next prompt can be compacted before context-window pressure hits. `UserRequest.conversation_id: str | None = None` and `HarnessRunRecord.conversation_id: str | None = None` added so Langfuse groups multi-turn runs by this attribute. None for one-shot requests. Tests (`tests/runtime/test_conversation.py`, 7 new): - `get` returns None for unknown id. - Append round-trips a single turn. - Order preserved across multiple turns. - Two conversations are isolated. - `summarize_if_needed` no-ops below threshold (asserts summarizer not called). - `summarize_if_needed` fires above threshold, stores summary. - `conversation_id` round-trips across two /runs calls — the seam contract that matters most. Full suite: 262 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E5)
E6 — new `runtime/agentic_graph.py`:
- `build_agentic_graph(settings, models, provenance_log)` returns a
compiled LangGraph `StateGraph[NexoState]` with the nodes the plan
names: `tenancy_gate / planner / synthesizer / critic / summarizer`.
- Entry point is the tenancy gate; planner bumps `turn_count` and
exits with a failure at 12 (the agentic turn cap, vs 8 for plan-
execute).
- Behavioral specifics — planner choosing curated-vs-composable,
executor invoking primitives, freshness_judge/domain_analyst/
critic verdict routing — are scaffolded but stubbed in this
iteration. The plan acknowledges F3 (live verification) is the
behavior gate; the wiring tests assert topology + happy-path
reachability + turn-cap safety.
- `provenance_log` is threaded through the closure so the executor
(when implemented) writes one row per primitive call (see E4).
E7 — new `runtime/tenancy.py`:
- `tenancy_gate_decision(ctx, route, settings) → TenancyDecision(
allowed, refusal_message, audit_attr)` — pure function, no I/O.
- Behavior matrix from plan 13 §E7:
- `NEXO_QUERY` / `NEXO_AGENTIC` → tenant must match
`nexo_tenant_lock`; else refuse with a Mintral-only message.
- `NEXO_META` → allow any tenant. Off-lock tenants get an audit
attribute `tenant.bypass: meta_route` so Langfuse can panel
them.
- Unknown route → safe-default refuse for off-lock tenants.
- `nexo_graph._tenant_gate_node` and `agentic_graph.tenancy_gate`
both delegate here. The mode resolver already refuses
`mode="agentic"` for non-Mintral at request validation; the
graph-level gate is defense in depth.
Tests:
- 4 in `test_agentic_graph.py`: non-Mintral refusal, happy path
runs to synthesizer answer, turn-cap exit, required-nodes
topology check.
- 7 in `test_tenancy_gate.py`: data routes refuse non-Mintral,
data routes allow Mintral, NEXO_META allows any tenant + audit
attr when off-lock, NEXO_META locked tenant no audit attr,
unknown route safe-default refuse.
Full suite: 273 passed, 1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E6, E7)
…spans
E9 — the C2 per-mode cost dashboard needs `modular.mode` on every span
so it can group canned vs agentic vs meta runs and compare cost-per-
question. Wire-up:
- `HarnessContext.mode: RunMode = "auto"` — propagated from
`UserRequest.to_context()`.
- `supervisor._run_nexo` passes `mode=ctx.mode` to
`agent_span("run", ...)`, so the root `nexo.run` span carries
`modular.mode`.
- `nexo_graph._instrument(...)` forwards `ctx.mode` to
`NexoTelemetryCallback`, so every per-agent `nexo.<agent>` span
Langfuse renders for the run carries the same attribute.
Tests (`tests/observability/test_mode_telemetry.py`, 2 new):
- Root `nexo.run` carries `modular.mode='agentic'` when the caller
sets it on the context.
- After running the real graph, every per-agent span carries
`modular.mode='canned'` — proves the attr survives from
HarnessContext → callback → emitted span.
Marks E8 done as well: the seven test files the plan lists for E8
(test_intent_router, test_mode_resolver, test_meta_agent,
test_primitives, test_provenance, test_conversation,
test_agentic_graph, test_tenancy_gate) were all written via TDD
during E1-E7; E8 is implicitly satisfied.
The `nexo.critic` non-zero requirement and C2 dashboards in E9 are
infrastructure-dependent (Phase B+C blocked on operator
preconditions). The attribute surface is wired and verified offline;
live verification gates on infra.
Full suite: 275 passed, 1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E8, E9)
`uv run pytest` reports 275 passed, 1 skipped, 0 failed, 0 errors. - Plan 12's 151-test baseline preserved (no regressions). - 124 new tests across `tests/observability/`, `tests/runtime/`, `tests/agents/`, `tests/integrations/nexo/` covering Phase A (telemetry foundation) and Phase E (agentic search + composable primitives) — every offline-verifiable task in plan 13. F2 (canned-mode live verification), F3 (agentic-mode live verification), F4 (one week of agentic traffic), and F5 (PR with verification artifacts) require operator preconditions per `.ralph/blockers.md`: Docker daemon for the Langfuse stack, SSH tunnel to coordinador-prod, and `ANTHROPIC_API_KEY`. Until those are in place the loop cannot complete Phase F. Phase E review checkpoint dispatched separately.
…p gap Phase E code review checkpoint dispatched after E1-E9 + F1 landed. Verdict: 0 Critical / 7 Important / 5 Minor. Critical bypasses from the earlier E3 review remain closed. Cheap Important fixes applied here: - **.env.example** (#4): added MIOT_HARNESS_INTENT_ROUTER_MODEL, MIOT_HARNESS_INTENT_ROUTER_CONFIDENCE_THRESHOLD, and MIOT_HARNESS_NEXO_EXPLAIN_COST_THRESHOLD so the file's own "every config.py setting documented here" invariant holds. - **agentic_graph.py docstring** (#2): rewritten to be honest about which nodes are wired in this iteration (tenancy_gate, planner stub, synthesizer, critic stub, summarizer stub) vs which are deferred to the F-phase supervisor integration (executor, freshness_judge, domain_analyst). The old docstring listed all 8 nodes as if they were present. - **`_ = provenance_log`** (#3): clarified the intent comment so a reader knows the parameter is captured for the future executor node, not silently ignored. Big Important #1 — `HarnessSupervisor` does NOT yet consume the new Phase E modules (resolve_mode, LLMIntentRouter, InMemoryConversation Store, build_agentic_graph, meta_agent_node). The supervisor still uses the keyword `IntentRouter` for routing decisions. The seven new modules are correct and tested in isolation but are an orphan island until the F-phase wire-up lands. Documented as a new OPEN entry in `.ralph/blockers.md` with the concrete 5-step wire-up plan the next iteration follows when operator preconditions are cleared. Phase F status: - F1 (uv run pytest green): **DONE** (275 passed, 1 skipped). - F2, F3, F4 (live verification): **BLOCKED** on operator preconditions (Docker for Langfuse stack, SSH tunnel to coordinador-prod, ANTHROPIC_API_KEY). - F5 (PR open with verification artifacts): **BLOCKED** on F2-F4. Loop output: BLOCKER. Every remaining task requires either operator preconditions or the supervisor wire-up that depends on them. The plan's offline-doable scope is complete; the autonomous loop cannot proceed further from this seat. Full suite: 275 passed, 1 skipped. Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E review)
The Phase E modules (LLMIntentRouter, mode_resolver, agentic_graph, meta_agent_node, InMemoryConversationStore) were orphan-islands — HarnessSupervisor.run() still used the keyword router. This commit lands the integration the post-Phase-E review flagged as Important #1. `runtime/supervisor.py`: - `__init__` gains optional kwargs `llm_router`, `agentic_graph`, `meta_model`, `meta_primer`, `meta_catalog`, `conversation_store`, `tenant_lock`. Positional sig unchanged — existing call sites and tests don't break. - `run()` resolves the route via `await resolve_mode(...)` when `llm_router` is injected; else falls back to the keyword router (Plan 12 default). - `ModeAccessDenied` (agentic-mode for non-Mintral) is caught at the route-resolution boundary and surfaced as a refusal answer. - Dispatch tree extended: - NEXO_QUERY → existing `_run_nexo` (nexo_graph.ainvoke) - NEXO_META → new `_run_nexo_meta` (calls meta_agent_node with primer + catalog; no graph, no SQL) - NEXO_AGENTIC → new `_run_nexo_agentic` (agentic_graph.ainvoke, wrapped in `agent_span("run", ...)`) - STORYTELLING_RUN / DIRECT / OTHER unchanged. - `conversation_id` round-trips: hydrate via store on entry, append one `ConversationTurn(user_message, assistant_answer)` on exit when both `conversation_store` and `request.conversation_id` are set. `record.conversation_id` mirrors the request id for telemetry. `runtime/factory.py`: - `build_harness` wires the always-on `InMemoryConversationStore` (it costs nothing without a request). - The Phase-E modules that need live Nexo boot (LLM router, agentic graph, meta model) are wired by the FastAPI lifespan once the Nexo integration succeeds. `api/server.py` lifespan: - After `build_nexo_graph(...)` succeeds, builds the agentic graph (sharing the analyst model as planner), wires `meta_model` + primer + an auto-derived catalog from the registered curated tools, instantiates `LLMIntentRouter` (Haiku tier from settings), and sets `tenant_lock` from `nexo_tenant_lock`. - Falls back gracefully — supervisor's keyword-router path stays intact when any of the boot steps fails. Tests: 7 new in `tests/runtime/test_supervisor_phase_e.py`: - explicit canned mode dispatches to nexo_graph - explicit meta mode calls meta_agent_node (no graph) - explicit agentic mode dispatches to agentic_graph - explicit agentic mode for non-Mintral returns refusal - auto mode uses LLM router (not keyword) - conversation_id round-trips via the store across two calls - backward compatibility: keyword-router-only construction works Full suite: 282 passed, 1 skipped.
`infra/observability/` is the local dev stack for Phase B (plan 13).
Authored offline — Docker only needs to be running to *execute* it,
not to write it.
`docker-compose.yml` (B1):
- 7 services on a private `observability` bridge network:
- `postgres:16` — Langfuse's primary DB
- `clickhouse:24-alpine` — Langfuse's analytics store
- `redis:7-alpine` — Langfuse's queue/cache (`requirepass` set)
- `minio` — S3-compatible blob storage for event payloads
- `langfuse/langfuse:3` (web) — UI + ingest API on `:3000`
- `langfuse/langfuse-worker:3` — background jobs
- `otel/opentelemetry-collector-contrib:0.95.0` — OTLP gRPC on
`:4317`, HTTP on `:4318`, self-metrics on `:8888`
- Volumes for postgres, clickhouse, clickhouse-logs, minio.
- Health checks gate `depends_on:` so the right boot order happens
on `docker compose up -d`.
- LANGFUSE_INIT_* envs declare a deterministic `miot-harness-local`
project so bootstrap.sh can mint keys idempotently.
`otel-collector-config.yaml` (B2):
- OTLP gRPC (4317) + HTTP (4318) receivers.
- `memory_limiter` (90% soft / 10% spike) + `batch` (200ms / 1KB).
- `otlphttp/langfuse` exporter pointing at the in-network
`http://langfuse-web:3000/api/public/otel` ingest endpoint.
- Comments describe the fan-out pattern (e.g. adding Honeycomb) so
swapping backends doesn't require harness re-instrumentation.
`bootstrap.sh` (B3):
- `docker compose up -d`, then polls `/api/public/health` until 200
(180s timeout, prints last 50 lines of langfuse-web logs on fail).
- Calls Langfuse v3's `/api/public/projects/{id}/api-keys` to mint a
key pair, parses `publicKey` / `secretKey` out of the JSON, and
prints the `.env` lines for the operator to paste — including
toggling `MIOT_HARNESS_OTEL_ENABLED=true`.
- Falls back to UI instructions when the bootstrap API isn't
reachable (Langfuse v3 build pre-bootstrap-API).
- Marked `set -euo pipefail` + `chmod +x`.
`README.md` (B4):
- Bring-up / tear-down, access (UI / collector metrics), .env
paste-block, privacy posture (single-tenant, internal, full
message storage — not for SaaS).
- Backup/restore via `tar` over docker volumes.
- Backend-swap recipe (add a second exporter, no harness changes).
- Troubleshooting table covering the common boot failures.
YAML + shell syntax verified by `yaml.safe_load` and `bash -n`.
No tests added — these are config files exercised live in F2/F3.
Full suite still: 282 passed, 1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase B · B1-B4)
C3 from plan 13 — `python -m miot_harness.observability.report --since 7d
--by agent|tenant|mode`. The pure aggregation core is offline-doable
(unit-tested) and the live-Langfuse fetcher is deferred to F-phase when
the stack is up.
`observability/report.py`:
- `CostRow(key, cost_usd, input_tokens, output_tokens, traces)` —
one bucket of the report.
- `aggregate_cost(traces, by)` — pure aggregation over a list of trace
projections. Groups by `modular.agent` / `modular.tenant_id` /
`modular.mode`, sums `gen_ai.usage.cost_usd` as Decimal (6-decimal
quantize), tallies `gen_ai.usage.{input,output}_tokens`, counts
traces. Skips traces missing `gen_ai.usage.cost_usd`. Sorts rows
cost-descending so the dashboard panel renders the cost driver
first.
- `parse_since('7d' | '24h' | '90m')` returns a `timedelta`; rejects
invalid specs with a clear ValueError.
- CLI shim: `--fixture <path-to-jsonl>` runs against a static file so
operators can sanity-check the math before pointing at Langfuse;
`--format json` for machine-readable output. The Langfuse fetcher
is intentionally unbuilt; the CLI errors loudly if no fixture is
passed, with a pointer to F-phase.
Tests (`tests/observability/test_report.py`, 13 new):
- aggregate by agent / tenant / mode each sum correctly
- traces missing cost are skipped (not silently zeroed)
- output is sorted cost-descending
- unknown grouping dimension raises ValueError
- token totals (in / out) propagate per-bucket
- parse_since accepts common specs (1h/24h/7d/30d/90m)
- parse_since rejects garbage ('', '1', '7y', 'minus', '1.5d')
C2 dashboards and C4 alerting still wait for the live Langfuse stack
(JSON exports + real trace queries respectively). C1 live verification
is operator-precondition gated per `.ralph/blockers.md`.
Stage 1 complete. Stage 2 = user action: install Docker, get
ANTHROPIC_API_KEY, start the SSH tunnel.
Full suite: 295 passed, 1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase C · C3)
…ENCRYPTION_KEY, cluster mode, deterministic API keys
Four corrections to the Phase B stack found while bringing it up live:
1. **ClickHouse healthcheck used `localhost`** — Docker on macOS doesn't
route IPv6 inside containers, so wget resolved `localhost` → `::1`
and got refused while ClickHouse was actually listening on
`0.0.0.0:8123` (IPv4 only, after the v6 bind failed). Switched the
healthcheck to `127.0.0.1`, raised `retries` to 20, added
`start_period: 30s` so the slow first-boot migrations don't trip
the gate.
2. **`ENCRYPTION_KEY: 000...0` was rejected by Langfuse's Zod schema**
("ENCRYPTION_KEY must be 256 bits, 64 string characters in hex
format, generate via: openssl rand -hex 32"). Generated a real one
with `openssl rand -hex 32`; comment makes the local-dev-only
nature explicit so this isn't reused in prod.
3. **`CLICKHOUSE_CLUSTER_ENABLED: "false"`** added to both langfuse-web
and langfuse-worker. Langfuse v3's migrations default to
`ReplicatedMergeTree ON CLUSTER default`, which requires
ZooKeeper. Disabling cluster mode makes them use plain MergeTree
for the single-node local stack.
4. **`LANGFUSE_INIT_PROJECT_PUBLIC_KEY` / `_SECRET_KEY`** added so the
API keys are minted deterministically during the project bootstrap
pass — no manual UI step, no race condition with the bootstrap
script trying to call a key-mint endpoint that doesn't exist in
Langfuse v3's public API. Comment explicitly marks these as
local-dev-only credentials and explains the rotation path.
`otel-collector-config.yaml`:
- Replaced the placeholder Authorization header with the real
`Basic base64(pk-lf-...:sk-lf-...)` for the new auto-provisioned
key pair so traces actually reach Langfuse.
`bootstrap.sh`:
- Replaced the (broken) `POST /api/public/projects/.../api-keys`
attempt with a Postgres-direct read of the public_key (just for
confirmation) and a grep over `docker-compose.yml` for the secret.
Result: the script always prints the working pair, idempotently.
Verified end-to-end: `docker compose down -v && docker compose up -d`
boots the 7 services cleanly, langfuse-web reports `{"status":"OK"}` on
`/api/public/health`, otel-collector starts with the Langfuse exporter
ready, and `./bootstrap.sh` prints the .env paste-block in <5s.
…ion; clear blockers
End-to-end live bring-up confirmed:
- F2 canned mode (3 of plan-12's golden scenarios): G6 estado, G7 cola
crítica, G8 ETA en riesgo. All three returned Markdown answers from
the real Coordinador snapshot (with honest stale-data warnings where
the snapshot is older than the freshness threshold).
- F3 mode-bypass + auto-router: META mode answers from primer+catalog
without DB touch, allowed for any tenant. AGENTIC refused at
request-validation for non-Mintral (`ModeAccessDenied`). AGENTIC for
Mintral runs through the stub planner→synthesizer. AUTO mode
classified "dimensionamiento para mañana" as NEXO_QUERY at
confidence 0.75 and dispatched correctly.
- Direct ClickHouse query confirms Langfuse received: 11 `nexo.run`
root spans + 6 `nexo.synthesizer` + 6 `nexo.filter_expert` per-agent
spans + 22 `anthropic.chat` (Traceloop auto-instrumented) child
spans. Per-mode trace split: 5 auto / 3 canned / 2 meta / 1
agentic. Every span carries `modular.{mode, run_id, tenant_id}` in
its metadata.
`tests/conftest.py` (new): 7 default-value tests were failing because
the populated dev `.env` was being auto-loaded by pydantic-settings,
overriding the asserted defaults. Added an autouse fixture that pins
`HarnessSettings.model_config["env_file"]` to a non-existent path and
clears any `MIOT_HARNESS_*` / provider keys from the inherited shell,
so the full suite passes deterministically on a populated dev box and
on CI. Full suite: 295 passed, 1 skipped.
`.ralph/blockers.md`: both OPEN entries (operator preconditions +
supervisor wire-up) marked RESOLVED with the resolution context.
`.ralph/state.md`: C1, D1, D2, F2, F3 all → [x]. D3 / F4 stay [ ]
(time-gated — 50+ runs over a week). C2 / C4 / F5 stay [ ] (follow-up
work after the PR lands).
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughIntroduces a local Langfuse+OTel stack, integrates OpenTelemetry in the FastAPI app, adds pricing, provenance, SQL safety primitives, intent routing, agentic/meta modes, tenancy gate, conversation memory, supervisor/server wiring, a cost-report CLI, CI attestations, and comprehensive tests. ChangesPlan 13 end-to-end telemetry and runtime modes
Sequence Diagram(s)sequenceDiagram
participant Client
participant FastAPI
participant Supervisor
participant Router as Mode/Intent Router
participant Graph as Nexo/Agentic/Meta Graphs
participant OTel as OTel Collector
participant Langfuse
Client->>FastAPI: POST /runs (user_message, mode, tenant)
FastAPI->>Supervisor: run(request)
Supervisor->>Router: resolve_mode(request)
Router-->>Supervisor: RouteResult(route)
Supervisor->>Graph: execute(route) with callbacks
Graph-->>Supervisor: answer, events
Supervisor-->>FastAPI: response
FastAPI->>OTel: OTLP traces
OTel->>Langfuse: Export traces
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
…nt cost rollups
Spotted during the F2/F3 live verification: the Langfuse Tracing UI's
filter sidebar (User ID / Session ID / Tags / Environment) was empty
for our runs even though `tenant_id` / `user_id` / `conversation_id`
were already threaded through every request. Without these promoted to
first-class indexed columns, per-client cost rollups would require
post-hoc metadata filtering instead of using Langfuse's built-in
filter affordances.
Langfuse's OTel ingest promotes specific attribute names
(`langfuse.user.id`, `langfuse.session.id`, `langfuse.tags`,
`langfuse.environment`) from span metadata into first-class trace
columns. This commit emits them alongside the existing `modular.*`
attrs.
Changes:
- `HarnessContext` gains `conversation_id: str | None`, propagated
from `UserRequest.to_context()` so the conversation surface and
Langfuse session_id read from the same source.
- `observability/spans.py:agent_span(...)` extended with optional
`user_id` / `session_id` / `tags` / `environment` kwargs. Tags
serialized as a JSON-encoded string array for portability.
- `observability/callbacks.py:NexoTelemetryCallback` gains the same
four kwargs. On every `on_chat_model_start`, sets
`langfuse.user.id`, `langfuse.session.id`, `langfuse.environment`,
and `langfuse.tags` on the per-agent span so observation-level
rows in Langfuse carry the same first-class fields as the trace.
- `runtime/supervisor.py:_root_span_kwargs(ctx, route)` centralizes
the mapping: every dispatch path (NEXO_QUERY / NEXO_AGENTIC /
NEXO_META) calls `agent_span("run", **self._root_span_kwargs(...))`
with identical kwargs plus a per-route tag (`route:<chosen_route>`).
`session_id` falls back from `conversation_id` → `thread_id` so
one-shot requests still group under a session key.
- `runtime/nexo_graph.py:_instrument(...)` forwards `ctx.user_id`,
`ctx.conversation_id or ctx.thread_id`, and a per-agent tag list
(`tenant:<id>`, `mode:<mode>`, `agent:<name>`) to each
NexoTelemetryCallback.
Tests: 4 new in `tests/observability/test_langfuse_attribution.py`:
- agent_span emits all four langfuse.* attrs when provided
- agent_span omits them when unset (backward compat for old callers)
- callback emits langfuse.* on per-agent spans when provided
- callback omits them when not provided
Live-verified end-to-end: after `POST /runs` with
`{tenant_id: mintral, user_id: odtorres, conversation_id:
odtorres-debug-2}`, the Langfuse `traces` ClickHouse row shows:
user_id = 'odtorres'
session_id = 'odtorres-debug-2'
tags = ['agent:filter_expert','agent:synthesizer',
'mode:auto','route:nexo_query','tenant:mintral']
environment = 'local'
Full suite: 299 passed, 1 skipped.
Plan ref: .cursor/plans/ai-first/13-post-nexo-roadmap.md (Phase E · E10,
added mid-execution per the plan's "editable during execution" rule).
There was a problem hiding this comment.
Actionable comments posted: 18
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
miot-harness/tests/conftest.py (1)
29-44:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClear cached settings inside the isolation fixture.
get_settings()is cached, so env cleanup alone may not take effect in tests that call the accessor after the first instantiation.Suggested fix
-from miot_harness.config import HarnessSettings +from collections.abc import Iterator + +from miot_harness.config import HarnessSettings, get_settings @@ -@pytest.fixture(autouse=True) -def _isolate_settings_from_env(monkeypatch: pytest.MonkeyPatch) -> None: +@pytest.fixture(autouse=True) +def _isolate_settings_from_env(monkeypatch: pytest.MonkeyPatch) -> Iterator[None]: @@ + get_settings.cache_clear() @@ - for key in list(os.environ.keys()): + for key in list(os.environ.keys()): if key.startswith("MIOT_HARNESS_") or key in { "ANTHROPIC_API_KEY", "OPENAI_API_KEY", "GOOGLE_API_KEY", }: monkeypatch.delenv(key, raising=False) + yield + get_settings.cache_clear()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@miot-harness/tests/conftest.py` around lines 29 - 44, The isolation fixture _isolate_settings_from_env currently alters HarnessSettings.model_config and environment variables but does not clear the cached settings, so tests that call get_settings() after a prior instantiation still see stale values; update _isolate_settings_from_env to clear the cached accessor by calling get_settings.cache_clear() (or the appropriate cache-clear method used by get_settings) after monkeypatching HarnessSettings.model_config and before returning so subsequent get_settings() calls recreate settings from the modified env/model_config.
🟡 Minor comments (5)
miot-harness/tests/observability/conftest.py-13-17 (1)
13-17:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix import ordering to clear CI (
ruff I001).The current import block is flagged by Ruff and is failing the lint job.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@miot-harness/tests/observability/conftest.py` around lines 13 - 17, Reorder the import block to satisfy Ruff's I001 ordering: group and sort imports into standard library, third-party, and local/application imports, and sort them alphabetically within each group; specifically ensure pytest and opentelemetry imports are ordered so that "import pytest" appears before "from opentelemetry import trace" and the opentelemetry submodule imports (ReadableSpan, TracerProvider, SimpleSpanProcessor, SpanExportResult, InMemorySpanExporter) are grouped and sorted consistently..ralph/log.md-5-11 (1)
5-11:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd a language identifier to the fenced example block.
This fence triggers markdownlint
MD040; adding a language keeps docs lint-clean.Suggested fix
-``` +```text ## Iteration N — YYYY-MM-DD HH:MM - Task: <task ID + short> - Status: [completed|blocked|partial] - Commit: <SHA or "n/a"> - Notes: <one-line summary; tests run; surprises></details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In @.ralph/log.md around lines 5 - 11, The fenced code block in the iteration
template is missing a language identifier which triggers markdownlint MD040;
update the triple-backtick fence for the example block (the block that begins
with "" followed by the iteration lines) to include a language identifier such as "text" (e.g. change "" to "```text") so the markdown linter passes
and the example remains unchanged otherwise.</details> </blockquote></details> <details> <summary>miot-harness/src/miot_harness/config.py-48-55 (1)</summary><blockquote> `48-55`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Add bounds validation for new numeric settings.** These values are externally configurable, but currently accept invalid ranges. Guarding them at load time prevents silent misconfiguration. <details> <summary>Suggested fix</summary> ```diff - intent_router_confidence_threshold: float = 0.7 + intent_router_confidence_threshold: float = Field(default=0.7, ge=0.0, le=1.0) @@ - nexo_explain_cost_threshold: float = 10000.0 + nexo_explain_cost_threshold: float = Field(default=10000.0, gt=0.0) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@miot-harness/src/miot_harness/config.py` around lines 48 - 55, The two numeric config fields intent_router_confidence_threshold and nexo_explain_cost_threshold need runtime bounds validation on load: ensure intent_router_confidence_threshold is within [0.0, 1.0] and nexo_explain_cost_threshold is a positive number (e.g., > 0.0); implement the checks in the config initialization path (e.g., Config.__post_init__ or the function that constructs these values) and raise a clear ValueError with the invalid field name and value if a check fails so misconfiguration is caught immediately. ``` </details> </blockquote></details> <details> <summary>miot-harness/.env.example-135-135 (1)</summary><blockquote> `135-135`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Fix the stack path typo in the OTEL comment.** Line 135 has an extra space in the path (`infra/observability/ docker-compose`), which makes copy/paste misleading. <details> <summary>✏️ Suggested fix</summary> ```diff -# infra/observability/ docker-compose stack is up. +# infra/observability/docker-compose stack is up. ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@miot-harness/.env.example` at line 135, Fix the typo in the OTEL comment by removing the stray space in the path string "infra/observability/ docker-compose" so it reads "infra/observability/docker-compose"; update the comment text on the affected line in .env.example accordingly to ensure copy/paste yields a valid path. ``` </details> </blockquote></details> <details> <summary>miot-harness/tests/observability/test_graph_wiring.py-158-164 (1)</summary><blockquote> `158-164`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Assert actual parent-child linkage, not just root presence.** The test is named `test_root_nexo_run_span_parents_per_agent_spans` yet only verifies root attributes. The established testing pattern in the codebase (test_spans.py) demonstrates that `span.parent` and `span.context.span_id` are available for parent-child assertions. Add explicit parent-id checks to verify that nexo.* agent spans are actually children of the nexo.run root span, preventing broken span parenting from going undetected. <details> <summary>🧪 Suggested assertion</summary> ```diff spans = memory_exporter.get_finished_spans() root = next((s for s in spans if s.name == "nexo.run"), None) assert root is not None, "nexo.run root span not emitted" attrs = dict(root.attributes or {}) assert attrs["modular.run_id"] == "run_root_span_test" assert attrs["modular.tenant_id"] == "mintral" assert attrs["gen_ai.operation.name"] == "nexo.run" + + root_span_id = root.context.span_id + for s in spans: + if s.name.startswith("nexo.") and s.name != "nexo.run": + assert s.parent is not None, s.name + assert s.parent.span_id == root_span_id, s.name ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@miot-harness/tests/observability/test_graph_wiring.py` around lines 158 - 164, The test currently only asserts the root "nexo.run" span attributes; add parent-child assertions to ensure all agent spans are children of that root by locating the root span (root) from memory_exporter.get_finished_spans(), then iterate over agent spans (spans with names starting with "nexo." excluding "nexo.run") and assert each has a non-null parent and that span.parent.span_id (or span.parentSpanId if that field is used in the Span model) equals root.context.span_id (or root.context.span_id field used in this codebase); update test_root_nexo_run_span_parents_per_agent_spans to include these explicit parent-id checks. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Inline comments:
In@infra/observability/docker-compose.yml:
- Around line 96-99: The compose file currently hardcodes deterministic secrets
(e.g., DATABASE_URL, NEXTAUTH_URL, NEXTAUTH_SECRET, SALT and other
DB/Redis/MinIO/Langfuse creds) — replace these literal values with externalized
references (use environment variable interpolation like ${DATABASE_URL}, an
env_file, or Docker secrets) and remove the static defaults from the committed
compose; update the service definitions that reference DATABASE_URL,
NEXTAUTH_SECRET, SALT (and the other listed secret entries) to read from the
external env or secret names, add a .env.example with placeholder keys (not
secrets) and documentation on how to provide real credentials locally/CI, and
ensure no real credentials remain in the file.- Around line 63-69: The Redis healthcheck is failing because the healthcheck
test (healthcheck.test using "redis-cli ping") doesn't authenticate while the
container is started with "--requirepass myredissecret"; update the
healthcheck.test to authenticate (for example use "redis-cli -a myredissecret
ping" or "redis-cli -u redis://:myredissecret@localhost:6379 ping") so it
supplies the password, or reference a secret/env var for the password, keeping
the rest of the healthcheck properties and the command that sets "--requirepass"
unchanged.In
@infra/observability/otel-collector-config.yaml:
- Around line 35-43: The otel-collector config currently contains a hardcoded
Basic Authorization header under the otlphttp/langfuse exporter (headers:
Authorization) which must be moved to an environment variable; replace the
literal base64 string in the headers.Authorization value with an env reference
using the OTel Collector syntax (e.g., ${env:LANGFUSE_BASIC_AUTH}), then ensure
the otel-collector service in docker-compose sets LANGFUSE_BASIC_AUTH from the
runtime environment (and document adding the value to a gitignored .env file
generated via printf 'pk...:sk...' | base64). Keep the exporter key
(otlphttp/langfuse) and headers.Authorization as the target of the change so the
config still points to the same endpoint while removing secrets from the repo
and git history.In
@miot-harness/src/miot_harness/integrations/nexo/primitives.py:
- Around line 6-11: The file has long-line and import-order lint errors and a
duplicate entry in the safe-functions list: break the long literal/table strings
(lines around the module docstring or top-level comment that contains the
Tool/Shape/Safety table and any other >79 char lines noted at 24-30) into
shorter concatenated or multi-line strings to fix E501; reorder the import block
into standard groups (stdlib, third-party, local) and ensure imports are
sorted/grouped consistently (I001) for the module's import section; and remove
the duplicate "to_char" entry from the _SAFE_FUNCTIONS collection (symbol
_SAFE_FUNCTIONS at line ~113) so there are no repeated function names (B033).
Make these edits in miot_harness.integrations.nexo.primitives (update the
docstring/table wrapping, fix import order, and de-duplicate _SAFE_FUNCTIONS)
and run linters to confirm all three issues are resolved.- Around line 378-381: The code incorrectly assumes rows are dicts by using
isinstance(row, dict) which causes all asyncpg.Record results to be rejected;
instead, in the nexo_explain path obtain the QUERY PLAN by attempting direct
subscript access on row (e.g., try: payload = row["QUERY PLAN"]; except
(KeyError, TypeError, IndexError): raise UnsupportedConstruct("EXPLAIN output
missing QUERY PLAN")) so asyncpg.Record objects are handled properly, then keep
the existing plan = payload[0].get("Plan", {}) logic and raise
UnsupportedConstruct only when the subscript access fails or payload is empty.In
@miot-harness/src/miot_harness/observability/otel.py:
- Around line 46-56: When configure_tracing detects an existing global
TracerProvider (variable existing) you must record ownership instead of always
shutting it down later; add a boolean (e.g., tracer_provider_owned) that is set
to False when existing is reused and True when configure_tracing creates/sets a
new provider, and update shutdown_tracing to only call provider.shutdown() if
tracer_provider_owned is True (apply the same ownership check for the other
reuse branch around the code referenced at 72-75). Ensure the boolean is
accessible to both configure_tracing and shutdown_tracing so only providers
created by this module are shut down.In
@miot-harness/src/miot_harness/observability/provenance.py:
- Around line 68-70: The current code writes JSON payload and newline in two
separate fh.write calls (inside the with path.open("a", encoding="utf-8")
block), which risks interleaving under concurrent appends; change it so the code
constructs a single string (e.g., record = json.dumps(payload,
ensure_ascii=False) + "\n") and replace the two fh.write(...) calls with a
single fh.write(record) to ensure each JSONL record is written atomically as one
write operation.In
@miot-harness/src/miot_harness/observability/report.py:
- Line 124: The header f-string assigned to the variable lines is too long;
split it across multiple shorter string literals (using implicit concatenation
inside parentheses or concatenation with +) so the f-string doesn't exceed the
line-length limit. For example, break f"{'#':>3} {by.upper():<20}
{'COST_USD':>10} {'INPUT':>10} {'OUTPUT':>10} {'N':>5}" into two or more
parts and join them (keep the same f-string interpolation for by.upper()) so the
resulting value of lines remains identical but each source line respects the
linter's max line length.In
@miot-harness/src/miot_harness/runtime/agentic_graph.py:
- Line 78: The statement assigning response exceeds the 100-char line limit;
break it into multiple shorter expressions by extracting
snapshot.get("user_message", "") into a local variable (e.g., user_message) and
then calling model.ainvoke with that variable (referencing response,
model.ainvoke, and snapshot.get), or by formatting the list argument across
multiple lines so the full call no longer exceeds 100 characters.In
@miot-harness/src/miot_harness/runtime/intent_router.py:
- Around line 115-128: The call to self._model.ainvoke can raise and currently
bubbles up; wrap the await self._model.ainvoke(...) call in a try/except that
catches Exception (or the model client-specific exceptions) and on failure logs
the error (use logger.warning or logger.exception with context) and then returns
self._fallback.route(message) so routing degrades to deterministic keyword
routing; keep the existing flow that still parses the response via
_parse_decision and only falls back when parsing fails or when the ainvoke call
throws.In
@miot-harness/tests/integrations/nexo/test_primitives.py:
- Around line 12-30: The import block around nexo test symbols is mis-ordered
for Ruff I001; reformat it so imports are grouped and sorted: keep "from
future import annotations" at the top, then a blank line, then standard
library imports (typing, unittest.mock) sorted alphabetically, then a blank
line, then third-party imports (pytest), then a blank line, then local package
imports from miot_harness.integrations.nexo.primitives; within that from-style
import list (AllowlistViolation, CostGateViolation, MutationRejected,
MultiStatementRejected, UnsupportedConstruct, nexo_describe, nexo_explain,
nexo_grep, nexo_select, validate_select_sql) sort names alphabetically.
Alternatively run isort/ruff --fix to apply the same ordering.In
@miot-harness/tests/observability/test_deps.py:
- Around line 8-14: The import block currently violates Ruff I001; reorder and
normalize imports so they follow isort rules: keep "from future import
annotations" first, then the standard-library import "import importlib", then
third-party "import pytest", with a single blank line after the grouped imports;
you can either run the project formatter (isort/ruff --fix) or manually reorder
the three lines ("from future import annotations", "import importlib",
"import pytest") to fix the issue.In
@miot-harness/tests/observability/test_langfuse_attribution.py:
- Around line 73-75: The import block in test_langfuse_attribution.py is
unsorted; reorder and group the imports so they pass ruff—place third-party
imports together in alphabetical order and ensure from-imports are alphabetized
(e.g., import AIMessage from langchain_core.messages and ChatGeneration,
LLMResult from langchain_core.outputs) and run the linter to verify formatting;
adjust any import grouping to match the project's import ordering rules used by
ruff.- Around line 131-133: The import block is unsorted; please reorder/group the
imports to satisfy ruff/isort by placing module imports in the correct
alphabetical/grouped order (e.g., put from langchain_core.messages import
AIMessage before from langchain_core.outputs import ChatGeneration, LLMResult)
and ensure the names inside each import are alphabetized and formatted
consistently so linting passes.- Line 15: The file imports pytest but never uses it; remove the unused import
statement (the line importing pytest) from
miot-harness/tests/observability/test_langfuse_attribution.py so the file no
longer contains an unused symbol and the pipeline error is resolved.In
@miot-harness/tests/observability/test_pricing.py:
- Around line 5-16: The import block in tests/observability/test_pricing.py is
failing ruff I001 due to incorrect ordering and grouping; reorder imports into
standard library (datetime, decimal) first, then third-party (pytest), then
local package imports, with a blank line between groups, and sort names
alphabetically within each group—ensure the top imports include date and Decimal
from datetime/decimal, then pytest, then the miot_harness.observability.pricing
symbols (PRICING, ModelPricing, TokenUsage, UnknownModelError, compute_cost) in
alphabetical order to satisfy ruff.In
@miot-harness/tests/runtime/test_intent_router.py:
- Around line 39-43: Two test-case tuples in test_intent_router.py exceed the
100-character line limit; break each long tuple literal into multiple lines so
no physical line exceeds 100 chars. Locate the offending tuples containing the
strings "filter by region=norte and order by ETA desc" and "draft a story about
today's incident" (they construct tuples with HarnessRoute.NEXO_AGENTIC and
HarnessRoute.STORYTELLING_RUN respectively) and reformat them by placing the
prompt string, model name, score, and route on separate indented lines (or wrap
the prompt string) so the tuple remains the same semantically but each source
line is <=100 chars to satisfy Ruff E501.In
@miot-harness/tests/runtime/test_supervisor_phase_e.py:
- Line 29: The import list in the test includes an unused symbol HarnessRoute
which triggers a pipeline failure; remove HarnessRoute from the import statement
that currently reads "from miot_harness.runtime.router import HarnessRoute,
IntentRouter" so only IntentRouter is imported (or switch to a single-name
import of IntentRouter) and run the tests to ensure no other references to
HarnessRoute remain in the file (e.g., check test_supervisor_phase_e.py for any
usage of HarnessRoute).
Outside diff comments:
In@miot-harness/tests/conftest.py:
- Around line 29-44: The isolation fixture _isolate_settings_from_env currently
alters HarnessSettings.model_config and environment variables but does not clear
the cached settings, so tests that call get_settings() after a prior
instantiation still see stale values; update _isolate_settings_from_env to clear
the cached accessor by calling get_settings.cache_clear() (or the appropriate
cache-clear method used by get_settings) after monkeypatching
HarnessSettings.model_config and before returning so subsequent get_settings()
calls recreate settings from the modified env/model_config.
Minor comments:
In @.ralph/log.md:
- Around line 5-11: The fenced code block in the iteration template is missing a
language identifier which triggers markdownlint MD040; update the
triple-backtick fence for the example block (the block that begins with "" followed by the iteration lines) to include a language identifier such as "text" (e.g. change "" to "```text") so the markdown linter passes and the example
remains unchanged otherwise.In
@miot-harness/.env.example:
- Line 135: Fix the typo in the OTEL comment by removing the stray space in the
path string "infra/observability/ docker-compose" so it reads
"infra/observability/docker-compose"; update the comment text on the affected
line in .env.example accordingly to ensure copy/paste yields a valid path.In
@miot-harness/src/miot_harness/config.py:
- Around line 48-55: The two numeric config fields
intent_router_confidence_threshold and nexo_explain_cost_threshold need runtime
bounds validation on load: ensure intent_router_confidence_threshold is within
[0.0, 1.0] and nexo_explain_cost_threshold is a positive number (e.g., > 0.0);
implement the checks in the config initialization path (e.g.,
Config.post_init or the function that constructs these values) and raise a
clear ValueError with the invalid field name and value if a check fails so
misconfiguration is caught immediately.In
@miot-harness/tests/observability/conftest.py:
- Around line 13-17: Reorder the import block to satisfy Ruff's I001 ordering:
group and sort imports into standard library, third-party, and local/application
imports, and sort them alphabetically within each group; specifically ensure
pytest and opentelemetry imports are ordered so that "import pytest" appears
before "from opentelemetry import trace" and the opentelemetry submodule imports
(ReadableSpan, TracerProvider, SimpleSpanProcessor, SpanExportResult,
InMemorySpanExporter) are grouped and sorted consistently.In
@miot-harness/tests/observability/test_graph_wiring.py:
- Around line 158-164: The test currently only asserts the root "nexo.run" span
attributes; add parent-child assertions to ensure all agent spans are children
of that root by locating the root span (root) from
memory_exporter.get_finished_spans(), then iterate over agent spans (spans with
names starting with "nexo." excluding "nexo.run") and assert each has a non-null
parent and that span.parent.span_id (or span.parentSpanId if that field is used
in the Span model) equals root.context.span_id (or root.context.span_id field
used in this codebase); update test_root_nexo_run_span_parents_per_agent_spans
to include these explicit parent-id checks.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: defaults **Review profile**: CHILL **Plan**: Pro **Run ID**: `3cbe69d2-21ba-4b42-8fac-5a2bf59d082a` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 2ad28fa5be479da481882bbf55691b9793e96c7c and 085595fcc0621504f0cc6beb7f84bada86ffa889. </details> <details> <summary>⛔ Files ignored due to path filters (1)</summary> * `miot-harness/uv.lock` is excluded by `!**/*.lock` </details> <details> <summary>📒 Files selected for processing (59)</summary> * `.ralph/RALPH_PROMPT.md` * `.ralph/blockers.md` * `.ralph/log.md` * `.ralph/state.md` * `infra/observability/README.md` * `infra/observability/bootstrap.sh` * `infra/observability/docker-compose.yml` * `infra/observability/otel-collector-config.yaml` * `miot-harness/.env.example` * `miot-harness/pyproject.toml` * `miot-harness/src/miot_harness/agents/meta_agent.py` * `miot-harness/src/miot_harness/api/server.py` * `miot-harness/src/miot_harness/config.py` * `miot-harness/src/miot_harness/integrations/nexo/primitives.py` * `miot-harness/src/miot_harness/observability/__init__.py` * `miot-harness/src/miot_harness/observability/callbacks.py` * `miot-harness/src/miot_harness/observability/otel.py` * `miot-harness/src/miot_harness/observability/pricing.py` * `miot-harness/src/miot_harness/observability/provenance.py` * `miot-harness/src/miot_harness/observability/report.py` * `miot-harness/src/miot_harness/observability/spans.py` * `miot-harness/src/miot_harness/runtime/agentic_graph.py` * `miot-harness/src/miot_harness/runtime/context.py` * `miot-harness/src/miot_harness/runtime/conversation.py` * `miot-harness/src/miot_harness/runtime/factory.py` * `miot-harness/src/miot_harness/runtime/intent_router.py` * `miot-harness/src/miot_harness/runtime/mode_resolver.py` * `miot-harness/src/miot_harness/runtime/nexo_graph.py` * `miot-harness/src/miot_harness/runtime/router.py` * `miot-harness/src/miot_harness/runtime/run_store.py` * `miot-harness/src/miot_harness/runtime/supervisor.py` * `miot-harness/src/miot_harness/runtime/tenancy.py` * `miot-harness/tests/agents/__init__.py` * `miot-harness/tests/agents/test_meta_agent.py` * `miot-harness/tests/conftest.py` * `miot-harness/tests/integrations/nexo/test_primitives.py` * `miot-harness/tests/observability/__init__.py` * `miot-harness/tests/observability/conftest.py` * `miot-harness/tests/observability/test_callbacks.py` * `miot-harness/tests/observability/test_deps.py` * `miot-harness/tests/observability/test_graph_wiring.py` * `miot-harness/tests/observability/test_langfuse_attribution.py` * `miot-harness/tests/observability/test_lifespan_init.py` * `miot-harness/tests/observability/test_mode_telemetry.py` * `miot-harness/tests/observability/test_otel.py` * `miot-harness/tests/observability/test_pricing.py` * `miot-harness/tests/observability/test_propagation.py` * `miot-harness/tests/observability/test_provenance.py` * `miot-harness/tests/observability/test_report.py` * `miot-harness/tests/observability/test_spans.py` * `miot-harness/tests/runtime/__init__.py` * `miot-harness/tests/runtime/test_agentic_graph.py` * `miot-harness/tests/runtime/test_conversation.py` * `miot-harness/tests/runtime/test_intent_router.py` * `miot-harness/tests/runtime/test_mode_resolver.py` * `miot-harness/tests/runtime/test_supervisor_phase_e.py` * `miot-harness/tests/runtime/test_tenancy_gate.py` * `miot-harness/tests/test_config.py` * `scripts/provenance-curate.py` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
…es gate Post-merge review of PR #462 found that `SELECT … FOR UPDATE / FOR SHARE / FOR KEY SHARE` (incl. `NOWAIT` / `SKIP LOCKED`) was ACCEPTED by the safety gate. These are read-shaped at the AST level (no Insert/Update/Delete) but acquire row-level locks — against `nexo.dx_*` snapshot tables they can block the snapshot-refresh job; against Citus they can escalate to a global lock. The DB role is read-only at runtime, but defense-in-depth says gate it at the AST level. Same class as the LATERAL function-in-FROM and side-effect-function bypasses the earlier E3 review closed: production-affecting side effects without a DML AST node. Fix: add `exp.Lock` to `_MUTATION_NODES` in `primitives.py`. sqlglot parses the `FOR UPDATE` clause into `args["locks"]` on the Select node, which the walker now catches. Test: `test_validator_rejects_row_locking` parametrically asserts all 5 PG locking variants raise `MutationRejected`. Test suite: 52 passed in `tests/integrations/nexo/` (was 47). Review ref: PR #462 review C1.
…volume Two Important items from the PR #462 review, neither behavior-changing on the wire but both improving operability: I1 — `agentic_graph.route_from_planner` was dead-code: both branches returned `"synthesizer"` and the conditional-edge dict had only one entry (`{"synthesizer": "synthesizer"}`). Anyone returning a different label from a future executor wire-up would have raised at edge time. Collapsed to a plain `graph.add_edge("planner", "synthesizer")` with a comment marking the spot for the F-phase wire-up to swap back to a conditional edge with an expanded mapping. I2 — `docker-compose.yml` declared persistent volumes for postgres / clickhouse / clickhouse-logs / minio but NOT for redis. Langfuse v3 uses Redis for the ingestion queue + cache — `docker compose restart` would lose in-flight events between the OTel Collector and langfuse-worker. Added `redis-data:/data` named volume; turned on `--appendonly yes` so the AOF actually persists. Updated the README's "Volumes used" list (was 4 entries, now 5). Review ref: PR #462 review I1 + I2.
PR #462 review I4: `_parse_decision` calls `json.loads(text)`, catches `JSONDecodeError`, then calls `payload.get(...)`. If the LLM returns valid JSON that isn't a dict (bare string `"hello"`, number `42`, bool `true`, null, array), `.get(...)` raises `AttributeError`, escapes the narrow `JSONDecodeError` except clause, and crashes the route call before the keyword fallback runs. Fix: `isinstance(payload, dict)` guard immediately after the `json.loads` — non-dict payloads now fall back to the keyword router like any other unparseable response. Test: `test_non_dict_json_falls_back_to_keyword` parametrically covers bare string / number / bool / null / array. Each case ships the message "Mintral status" so the keyword router lands on NEXO_QUERY (verifying the fallback runs end-to-end, not just that `_parse_decision` returned None). Review ref: PR #462 review I4.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
infra/observability/docker-compose.yml (2)
102-105:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHardcoded credentials remain a security posture concern despite local dev scope.
While the inline comments (lines 106-109, 115-119) clearly document these as LOCAL-DEV-ONLY values, committing deterministic credentials still normalizes insecure patterns and creates copy-paste risk into staging/production environments. Git history preserves these forever, and they establish a precedent for credential handling that can leak beyond local development.
🔐 Recommended approach
Even for local dev, prefer environment variable interpolation with a
.env.exampletemplate:environment: - DATABASE_URL: postgresql://postgres:postgres@postgres:5432/postgres + DATABASE_URL: ${MIOT_OBS_DATABASE_URL:-postgresql://postgres:postgres@postgres:5432/postgres} - NEXTAUTH_SECRET: please-change-this-in-prod + NEXTAUTH_SECRET: ${MIOT_OBS_NEXTAUTH_SECRET:?required} - LANGFUSE_INIT_PROJECT_PUBLIC_KEY: pk-lf-d57c7ed15cd59041736edbcfc679e48e + LANGFUSE_INIT_PROJECT_PUBLIC_KEY: ${MIOT_OBS_LANGFUSE_PUBLIC_KEY:?required}Then provide a
.env.examplewith placeholder documentation. This maintains reproducibility while keeping secrets out of version control.Also applies to: 28-28, 53-53, 72-72, 85-85, 109-109, 135-135, 139-139, 151-152, 156-156, 165-165, 169-169
Also applies to: 120-121, 124-124
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/observability/docker-compose.yml` around lines 102 - 105, Replace the hardcoded local-dev credentials (DATABASE_URL, NEXTAUTH_URL, NEXTAUTH_SECRET, SALT) in docker-compose.yml with environment-variable interpolation and provide a .env.example template; specifically, change the service env entries to reference variables (e.g. DATABASE_URL: "${DATABASE_URL:-postgresql://postgres:postgres@postgres:5432/postgres}" or simply DATABASE_URL: "${DATABASE_URL}") and add a .env.example with placeholder values and documentation for the four keys so secrets are not committed and local defaults remain documented but externalized.
63-75:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winRedis healthcheck still fails due to missing authentication.
The healthcheck at line 64 uses
redis-cli pingwithout credentials, but line 72 configures Redis with--requirepass myredissecret. This causes NOAUTH errors, marks the container unhealthy, and breaks thedepends_on: service_healthycondition at line 97, preventinglangfuse-webfrom starting.🔐 Proposed fix
healthcheck: - test: ["CMD", "redis-cli", "ping"] + test: ["CMD", "redis-cli", "-a", "myredissecret", "ping"] interval: 5s timeout: 3s retries: 10🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/observability/docker-compose.yml` around lines 63 - 75, The Redis healthcheck command is missing authentication and fails because Redis is started with --requirepass myredissecret; update the healthcheck test (the healthcheck "test" entry that currently runs redis-cli ping) to pass the password (e.g., use redis-cli -a myredissecret ping or the equivalent redis-cli --pass myredissecret ping) so the healthcheck authenticates successfully against the server started with --requirepass myredissecret.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@infra/observability/docker-compose.yml`:
- Around line 102-105: Replace the hardcoded local-dev credentials
(DATABASE_URL, NEXTAUTH_URL, NEXTAUTH_SECRET, SALT) in docker-compose.yml with
environment-variable interpolation and provide a .env.example template;
specifically, change the service env entries to reference variables (e.g.
DATABASE_URL:
"${DATABASE_URL:-postgresql://postgres:postgres@postgres:5432/postgres}" or
simply DATABASE_URL: "${DATABASE_URL}") and add a .env.example with placeholder
values and documentation for the four keys so secrets are not committed and
local defaults remain documented but externalized.
- Around line 63-75: The Redis healthcheck command is missing authentication and
fails because Redis is started with --requirepass myredissecret; update the
healthcheck test (the healthcheck "test" entry that currently runs redis-cli
ping) to pass the password (e.g., use redis-cli -a myredissecret ping or the
equivalent redis-cli --pass myredissecret ping) so the healthcheck authenticates
successfully against the server started with --requirepass myredissecret.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 600b9af6-6529-4f41-9537-3b6b8a282514
📒 Files selected for processing (7)
infra/observability/README.mdinfra/observability/docker-compose.ymlmiot-harness/src/miot_harness/integrations/nexo/primitives.pymiot-harness/src/miot_harness/runtime/agentic_graph.pymiot-harness/src/miot_harness/runtime/intent_router.pymiot-harness/tests/integrations/nexo/test_primitives.pymiot-harness/tests/runtime/test_intent_router.py
✅ Files skipped from review due to trivial changes (1)
- infra/observability/README.md
🚧 Files skipped from review as they are similar to previous changes (5)
- miot-harness/tests/runtime/test_intent_router.py
- miot-harness/tests/integrations/nexo/test_primitives.py
- miot-harness/src/miot_harness/runtime/agentic_graph.py
- miot-harness/src/miot_harness/runtime/intent_router.py
- miot-harness/src/miot_harness/integrations/nexo/primitives.py
…e ping Container starts with --requirepass but healthcheck called `redis-cli ping` unauthenticated. redis-cli prints `NOAUTH Authentication required.` and exits 0, so the healthcheck reported "healthy" without ever verifying Redis was actually responsive. Use `-a` with --no-auth-warning so the ping is authenticated and the exit status reflects real health. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three lint failures in miot_harness.integrations.nexo.primitives: - E501 (lines 6-11): docstring table exceeded the 100-char budget. Rewrote as a bulleted list — same content, fits the line length. - B033 (line 119): `to_char` appeared twice in `_SAFE_FUNCTIONS` (once under "string", once under "date/time"). Kept the first; the set semantics were unchanged but the duplicate is a lint failure. - I001 (line 24): import block already grouped correctly (future → stdlib → third-party); the failure was a stray double blank line between imports and the first top-level comment. Ruff: 0 errors. Tests: tests/integrations/nexo/ → 103 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`isinstance(row, dict)` is always False for asyncpg.Record (it supports mapping-style subscript access but is not a dict subclass), so every nexo_explain call against a real DB raised "EXPLAIN output missing QUERY PLAN" before reaching the cost-gate logic. The existing tests passed only because their fake_pool fixtures fed plain dicts — coverage masked the prod bug. - primitives.py: switch to `row["QUERY PLAN"]` subscript access guarded by try/except (KeyError, TypeError, IndexError). Works uniformly across asyncpg.Record and plain-dict fixtures. - test_primitives.py: add test_nexo_explain_handles_record_like_rows using a non-dict RecordLike fixture that locks the fix against a future `isinstance(row, dict)` regression. Also fix two pre-existing I001 nits (out-of-order import, stray blank line) flagged by ruff on the touched file. Ruff: 0 errors. Tests: tests/integrations/nexo/ → 104 passed (+1), 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three lint failures in test_langfuse_attribution.py: - I001 (lines 73-74 and 131-132): inline imports inside two test bodies ordered `langchain_core.outputs` before `langchain_core.messages`. Reordered so messages comes first. - F401 (line 15): `pytest` imported but never referenced. Dropped. Ruff: 0 errors. Tests: tests/observability/test_langfuse_attribution.py → 4 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The import block is already correctly grouped (future → stdlib → third-party → local) and alphabetized; the actual cause of the I001 failure was a stray double blank line between the imports and the first test definition. Collapsed to one. Ruff: 0 errors. Tests: tests/observability/test_pricing.py → 6 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two parametrize tuples at lines 39 and 43 were 102 chars (line-length=100). Wrapped each tuple across five indented lines (prompt / model / score / route) — same semantics, fits the budget. Ruff: 0 errors. Tests: tests/runtime/test_intent_router.py → 10 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-E tests F401: HarnessRoute was imported at line 29 but never referenced in the file. Ruff: 0 errors. Tests: tests/runtime/test_supervisor_phase_e.py → 7 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four files not previously touched in this branch still tripped `ruff check src tests`: - src/miot_harness/observability/report.py:124 (E501): inline header f-string was 102 chars. Extracted to a two-line concatenated literal. - src/miot_harness/runtime/agentic_graph.py:78 (E501): inline ainvoke argument was 103 chars. Pulled into a `prompt` local. - tests/observability/conftest.py:17 and tests/observability/test_deps.py:12 (I001 ×2): stray double blank line after the import block — same pattern as earlier fixes on this branch. Ruff: 0 errors over `src tests`. Tests: tests/observability/, tests/runtime/test_intent_router.py, tests/runtime/test_supervisor_phase_e.py, tests/integrations/nexo/ → 178 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@infra/observability/docker-compose.yml`:
- Around line 77-88: The MinIO service (minio / container_name miot-obs-minio)
lacks a Docker healthcheck which causes dependents to wait for service_started
instead of service_healthy; add a healthcheck block to the minio service using
an HTTP or mc-based check against :9000 (for example a curl GET
/minio/health/live or mc admin info) with sensible interval/retries/timeout and
start_period so the container reports healthy only when MinIO is ready, and then
update the langfuse-web service dependency to use depends_on: { minio: {
condition: service_healthy } } (or equivalent docker-compose v3 syntax) so
langfuse-web waits for MinIO to be healthy before starting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cb96ba78-cc9b-497d-87a5-7215364853d4
📒 Files selected for processing (11)
infra/observability/docker-compose.ymlmiot-harness/src/miot_harness/integrations/nexo/primitives.pymiot-harness/src/miot_harness/observability/report.pymiot-harness/src/miot_harness/runtime/agentic_graph.pymiot-harness/tests/integrations/nexo/test_primitives.pymiot-harness/tests/observability/conftest.pymiot-harness/tests/observability/test_deps.pymiot-harness/tests/observability/test_langfuse_attribution.pymiot-harness/tests/observability/test_pricing.pymiot-harness/tests/runtime/test_intent_router.pymiot-harness/tests/runtime/test_supervisor_phase_e.py
🚧 Files skipped from review as they are similar to previous changes (10)
- miot-harness/tests/observability/test_deps.py
- miot-harness/tests/runtime/test_supervisor_phase_e.py
- miot-harness/tests/observability/conftest.py
- miot-harness/src/miot_harness/runtime/agentic_graph.py
- miot-harness/src/miot_harness/observability/report.py
- miot-harness/tests/integrations/nexo/test_primitives.py
- miot-harness/tests/observability/test_pricing.py
- miot-harness/tests/observability/test_langfuse_attribution.py
- miot-harness/tests/runtime/test_intent_router.py
- miot-harness/src/miot_harness/integrations/nexo/primitives.py
The minio service had no healthcheck, so langfuse-web's depends_on fell back to `service_started` — the web container could start while MinIO was still booting and fail blob uploads on first request. - minio: probe `/minio/health/live` via `curl -fsS` (both curl and mc confirmed present in minio/minio:RELEASE.2025-08-13). Used `127.0.0.1` to match the clickhouse IPv6 workaround above. Timing mirrors the rest of the stack (5s/5s, 10 retries, 10s start_period). - langfuse-web depends_on minio: flip `service_started` → `service_healthy`. Validated: `docker compose config --quiet` passes; recreated minio container reports `healthy` in ~3s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`Tracer.start_as_current_span(attributes=...)` expects `Mapping[str, AttributeValue]` where `AttributeValue` is `str | bool | int | float | Sequence[...]`. The local was annotated `dict[str, object]`, which mypy can't narrow. Every value placed in this dict at runtime is already a str (f-string, the str-typed kwargs past their None guard, and `json.dumps(...)`), so tightening the annotation to `dict[str, str]` matches reality and clears the error without any runtime change. Mypy: 0 errors over 65 files. Tests: tests/observability/test_langfuse_attribution.py → 4 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Distribution-Evals job ran `07-attestations-present.sh` without
GH_TOKEN, so `gh attestation verify` exited on the auth precondition
("set the GH_TOKEN environment variable") and the script surfaced it
as a misleading "no attestation verifies" FAIL.
Scoped GH_TOKEN at the step (not the job) — the other two scripts in
the same job (05-pulls-from-ghcr.sh, 06-pulls-from-dockerhub.sh) don't
use gh, so the token stays narrow. The job's existing
`permissions.attestations: read` was already correct; the only missing
piece was exposing github.token to the gh CLI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`docker/build-push-action`'s `provenance: true / sbom: true` embeds
attestations into the OCI image manifest only — they do NOT register
with the GitHub attestation API. The deploy-eval script
`07-attestations-present.sh` uses `gh attestation verify --owner`,
which queries `/orgs/{owner}/attestations/{digest}` and returned 404
because no attestation had ever been registered there.
Added three steps after the build, gated to skip on fork PRs (same
condition as the existing build-push-action `push:` gate, since fork
PRs can't write to org packages):
1. `anchore/sbom-action@v0` writes an SPDX-JSON SBOM file (the embedded
SBOM from build-push-action is OCI-only; attest-sbom needs a file).
2. `actions/attest-build-provenance@v2` registers SLSA provenance against
the image digest with GitHub.
3. `actions/attest-sbom@v2` registers the SPDX SBOM against the same
digest with GitHub.
Both attest-* actions also push to the OCI registry (`push-to-registry:
true`), so the OCI manifest carries attestations from both
build-push-action AND the attest-* family — defense in depth.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
infra/observability/docker-compose.yml (1)
108-110: ⚡ Quick winBind published ports to loopback for safer local defaults.
These mappings currently expose Langfuse and OTLP endpoints on all host interfaces. For local-dev defaults, binding to
127.0.0.1reduces accidental LAN exposure.🔧 Suggested patch
langfuse-web: @@ ports: - - "3000:3000" + - "127.0.0.1:3000:3000" @@ otel-collector: @@ ports: - - "4317:4317" # OTLP gRPC (harness -> collector) - - "4318:4318" # OTLP HTTP (debug) - - "8888:8888" # collector self-metrics + - "127.0.0.1:4317:4317" # OTLP gRPC (harness -> collector) + - "127.0.0.1:4318:4318" # OTLP HTTP (debug) + - "127.0.0.1:8888:8888" # collector self-metricsAlso applies to: 192-195
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/observability/docker-compose.yml` around lines 108 - 110, The ports mapping currently publishes services to all interfaces (e.g., the ports entry with "- \"3000:3000\""); change published ports to bind to loopback by prefixing the host address (e.g., "- \"127.0.0.1:3000:3000\"") for the Langfuse and OTLP port mappings (and any other similar entries around the other ports referenced in the file) so they only listen on localhost for safer local-dev defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@infra/observability/docker-compose.yml`:
- Around line 108-110: The ports mapping currently publishes services to all
interfaces (e.g., the ports entry with "- \"3000:3000\""); change published
ports to bind to loopback by prefixing the host address (e.g., "-
\"127.0.0.1:3000:3000\"") for the Langfuse and OTLP port mappings (and any other
similar entries around the other ports referenced in the file) so they only
listen on localhost for safer local-dev defaults.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9fe21d1d-99db-4c61-af51-f7717e09c16e
📒 Files selected for processing (3)
.github/workflows/harness.yamlinfra/observability/docker-compose.ymlmiot-harness/src/miot_harness/observability/spans.py
`gh attestation verify` filters by `--predicate-type` and defaults to `https://slsa.dev/provenance/v1`. The single call we were making only ever returned the provenance predicate, so the script reported "no SBOM predicate" even though `actions/attest-sbom@v2` had successfully registered an SPDX attestation against the same digest. Now does two calls: - provenance with explicit `--predicate-type https://slsa.dev/provenance/v1` (no behavior change, self-documenting) - SBOM with `--predicate-type https://spdx.dev/Document`, falling back to `https://cyclonedx.org/bom` if the SBOM tooling ever flips formats Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #462 shipped four .ralph/ files (RALPH_PROMPT.md, blockers.md, log.md, state.md) seeded by commit 8107571. They're local ralph-loop iteration state — worklist, progress log, blockers, runtime prompt — that should never have been tracked. Non-sensitive, but noise for reviewers and grows with every loop iteration. - `git rm -r --cached .ralph/` removes them from the index only; files remain on disk so the active ralph session keeps working. - Add `.ralph/` to root `.gitignore` (under the existing Cursor / Aura local-only dev-tooling section) so any future ralph run anywhere in the monorepo is ignored by default. User chose the non-destructive path: no history rewrite, no force-push. Old commits still reference the files but they're non-sensitive iteration logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* harness(phase-13): close NEXO_META observation-level tenant tag gap
E10 emitted `langfuse.tags = [tenant:<id>, mode:<m>, ...]` on the root
`nexo.run` trace span, but the inner LLM call in meta mode slipped
through observation-level tagging — `supervisor._run_nexo_meta` called
`meta_agent_node(model=self.meta_model, ...)` with the **raw**
BaseChatModel, so Traceloop's auto-instrumented `anthropic.chat`
observation carried no `modular.{agent,tenant_id,mode}` attrs. Result:
trace-level filters worked; observation-level per-tenant cost rollups
under-counted meta runs.
- `runtime/nexo_graph.py`: rename `_instrument` → `instrument_model`
so it's importable across modules (no behavior change).
- `runtime/supervisor.py`: import `instrument_model`; in
`_run_nexo_meta`, wrap `self.meta_model` via
`instrument_model(self.meta_model, "meta_agent", ctx)` before
invoking `meta_agent_node`. Per-agent `nexo.meta_agent` span now
carries the same `modular.{agent,tenant_id,mode}` +
`langfuse.{user.id,session.id,tags,environment}` attrs as canned /
agentic paths.
Tests:
- `tests/runtime/test_supervisor_phase_e.py`:
+`test_meta_mode_per_agent_span_carries_tenant_tag` — runs the meta
dispatch via the supervisor, asserts the emitted nexo.meta_agent
span has `modular.agent=meta_agent`, `modular.tenant_id=acme-corp`,
`modular.mode=meta`, `langfuse.user.id=alice`, and
`tenant:acme-corp` / `mode:meta` / `agent:meta_agent` all present
in the JSON-decoded `langfuse.tags`.
- Drop unused `HarnessRoute` import in same file.
Conftest restructure: the `memory_exporter` fixture lived in
`tests/observability/conftest.py`, but the new supervisor test needs
it from `tests/runtime/`. Pytest's conftest scoping only resolves
ancestors, not siblings — promoted the fixture (+ the
`_HarnessSpanExporter` class) to `tests/conftest.py` so any test file
in the suite can use it. `tests/observability/conftest.py` is now a
docstring stub.
Full suite: 310 passed, 1 skipped.
* harness(phase-13): C3 live Langfuse fetcher + MIOT_HARNESS_LANGFUSE_HOST
`observability/report.py` previously errored out on the live path
("live Langfuse fetch is not wired yet — pass --fixture"). Wiring it
to the real Langfuse v3 public REST API.
`config.py`:
- `langfuse_host: str = "http://localhost:3000"` — where the CLI
fetches traces from. Local docker-compose stack by default;
override for a deployed Langfuse via env var.
- `.env.example`: documented entry alongside the existing
`MIOT_HARNESS_LANGFUSE_*` keys block.
`observability/report.py`:
- `_project_langfuse_trace(trace) -> {"attributes": {...}}` — pure
projection from Langfuse's trace row shape (`tags`, `totalCost`,
`userId`, `sessionId`) into the `aggregate_cost`-friendly shape.
Splits the `tenant:` / `mode:` / `agent:` prefixes back into
`modular.{tenant_id, mode, agent}` so the existing aggregator
keys still work. Token totals project as 0 (Langfuse v3 keeps
them on observations, not on trace rows).
- `fetch_traces_window(host, public_key, secret_key, since, now=,
page_size=100, client=)` — paginated GET against
`/api/public/traces` with Basic Auth, returning the raw rows.
Caller-injectable `client` for tests via `httpx.MockTransport`.
Owns the client lifecycle when none is injected.
- `_live_traces(since)` — operator-facing wrapper reading the three
env vars (`MIOT_HARNESS_LANGFUSE_{HOST,PUBLIC_KEY,SECRET_KEY}`).
Raises a clear RuntimeError telling the operator to run
`./infra/observability/bootstrap.sh` when keys are missing.
- `main()`: replaces the `parser.error("not wired yet")` branch
with `_live_traces(since)` when no `--fixture` is passed; still
honours `--fixture` for dry-run / offline use.
- Drive-by: line-wrap `_render_text`'s long header line to satisfy
ruff E501 (was committed in 462's a2f5e8a).
Tests (+8 new):
- `test_project_langfuse_trace_extracts_tags_and_cost` —
`tenant:acme` / `mode:agentic` → `modular.{tenant_id, mode}`.
- `test_project_langfuse_trace_handles_missing_tags`.
- `test_project_langfuse_trace_handles_null_total_cost` (Langfuse
returns `totalCost: null` for traces with no LLM calls).
- `test_project_langfuse_trace_picks_first_agent_tag` —
multi-agent traces project the first.
- `test_fetch_traces_window_paginates_until_total_pages` — 3-page
MockTransport stream concatenates correctly.
- `test_fetch_traces_window_forwards_since_as_query_params` —
asserts `fromTimestamp` / `toTimestamp` are derived from the
injected `now` and the `since` timedelta.
- `test_fetch_traces_window_sends_basic_auth_header` — asserts
`Authorization: Basic base64(pk:sk)`.
- `test_fetch_traces_window_strips_trailing_slash_from_host` — no
`//api/public/traces` corruption when host ends in `/`.
- `tests/test_config.py`: +2 tests for `langfuse_host` default and
env-var override.
Usage:
MIOT_HARNESS_LANGFUSE_PUBLIC_KEY=pk-lf-… MIOT_HARNESS_LANGFUSE_SECRET_KEY=sk-lf-… \
uv run python -m miot_harness.observability.report --since 7d --by tenant
Full suite: 310 passed, 1 skipped.
* harness(phase-13): docs — Run Modes + per-tenant billing surface
Four documentation updates covering the per-tenant billing flow this
PR ships. No code changes here — all source changes landed in the
preceding two commits.
`miot-harness/README.md`:
- New `## Run Modes` section after `## Design Defaults` covering the
four `mode` values on `POST /runs`: auto / canned / meta / agentic.
- 5-column table (route / DB-touch / non-Mintral / LLM-calls / use-case)
for at-a-glance comparison.
- Per-mode subsections with concrete example questions:
- `### canned — answer FROM the data` (nexo_graph, curated fn_dx_*).
- `### meta — answer ABOUT the data` (meta_agent_node, no SQL,
structural guarantee via function signature).
- `### agentic — explore the data with composable primitives`
(agentic_graph, sqlglot safety gate).
- `### Cost separation in Langfuse` — how `mode:<m>` tag + per-tenant
filter enables tiered billing (unlimited meta + metered canned).
`infra/observability/README.md`:
- New `## Per-tenant cost rollups` section before "What this is *not*"
documenting the four ways to get per-tenant cost:
1. Langfuse UI Filters sidebar (Tags → tenant:<id> → Total Cost).
2. Cost-report CLI with sample text output.
3. ClickHouse JOIN query against traces + observations (correctly
joining the tag-bearing trace row to the cost-bearing observation
rows — Langfuse v3 only stores total_cost on observations).
4. Persistent dashboard (pointer to dashboards/per-tenant-cost.md).
- Privacy posture reminder: single-tenant self-host; co-mingled tenant
data inside one Langfuse project; plan-14 isolates per project.
`infra/observability/dashboards/per-tenant-cost.md` (NEW):
- 2-minute click-by-click UI build procedure for the Per-tenant cost
widget. Langfuse v3 doesn't yet expose a stable dashboard-import
endpoint so we ship the build procedure instead of brittle JSON.
- ClickHouse JOIN query as a power-user fallback.
- CLI alternative.
`infra/observability/PER_TENANT_BILLING_TEST.md` (NEW):
- 9-step manual test procedure for end-to-end verification of the
per-tenant billing surface (tunnel → uvicorn → 3-tenant curl
battery → UI filter → CLI → NEXO_META observation-level check →
ClickHouse query → dashboard build).
- All paths are `$REPO_ROOT`-relative (`git rev-parse --show-toplevel`)
so the doc is portable across developer machines and CI.
- `$DB_SCRIPTS_ROOT` placeholder for the *separate* db-scripts repo
that owns the SSH tunnel script.
- Langfuse keys read from `docker-compose.yml` via a `grep | awk`
pipeline rather than hardcoded literal values — clean for forks
that regenerate the keys.
- Pass/fail checklist at the end with a "which file to inspect when
each step fails" troubleshooting table.
* harness(phase-13): post-review report.py fixes — settings-based config + (multi) agent sentinel
Two Important items from PR #470's first-pass review:
I1 — `_live_traces` was reading `os.environ.get` directly with a
hardcoded `"http://localhost:3000"` default, even though
`HarnessSettings.langfuse_host` already exists with the same default
(and proper `MIOT_HARNESS_` prefix + `.env` resolution via
pydantic-settings). Two defaults to keep in sync = drift hazard. Now
reads through `get_settings()` — single source of truth, `.env`
resolution + validation apply, the `langfuse_host` field is no longer
dead. `os` import dropped since nothing else used it.
I3 — `_project_langfuse_trace` previously projected the **first**
`agent:<name>` tag and silently dropped the rest. For a canned-mode
trace (which carries 3+ agent tags: filter_expert / synthesizer /
critic / …), this attributed all of the trace's cost to the first
agent and undercounted the others. For `--by tenant` / `--by mode`
this was correct (single-valued tags); for `--by agent` it was wrong
on live data. Now: single-agent traces still project the real name;
multi-agent traces project `modular.agent = "(multi)"` so the
aggregator bucket is visibly distinct. Operators see a `(multi)`
row and know to drill down via the observations endpoint instead of
trusting silently-undercounted numbers.
Tests:
- Replaced `test_project_langfuse_trace_picks_first_agent_tag` with
`test_project_langfuse_trace_marks_multi_agent_with_sentinel`.
- Added `test_project_langfuse_trace_single_agent_uses_real_name` to
pin down the unambiguous case.
Full suite: 321 passed, 1 skipped.
Review ref: PR #470 review I1 + I3.
* harness(phase-13): tighten meta-attribution test docstring (PR #470 I2)
The test docstring promised the regression target was "the inner LLM
call must carry the same attrs" — but the test asserts on the
harness-emitted `nexo.meta_agent` span (from `NexoTelemetryCallback`),
not the Traceloop-auto-instrumented `anthropic.chat` child observation
that was the actual gap. `FakeListChatModel` doesn't trigger the
Anthropic SDK instrumentor, so the test verifies the harness-side
wrap, not the third-party-instrumented child observation.
Rewrote the docstring to make the scope explicit:
- What's covered: the per-agent span emitted when `_run_nexo_meta`
calls `instrument_model(self.meta_model, "meta_agent", ctx)`.
- What's verified separately: the Traceloop child-observation path,
confirmed live via the ClickHouse query in PR #470's description
(`SELECT name, tenant_id FROM observations WHERE name='nexo.meta_agent'`).
No code change; docstring-only.
Review ref: PR #470 review I2.
* chore: remove plan files
* chore: remove plan files
* chore: remove plan files
---------
Co-authored-by: Michel David <mdavid.cu@gmail.com>
Summary
nexo.runroot +nexo.<agent>per-LLM-call spans) wired into every LangGraph node; Traceloop auto-instruments Anthropic/OpenAI SDKs; FastAPIInstrumentor adds HTTP-request spans; versioned pricing table converts token usage togen_ai.usage.cost_usdat run time.infra/observability/self-hosted Langfuse v3 stack (postgres + clickhouse + redis + minio + langfuse-web + langfuse-worker + otel-collector) with deterministic init keys,bootstrap.sh, and a README covering bring-up, backup, backend-swap.modefield onRunRequest(auto/canned/meta/agentic); composable DB primitives (nexo_describe/nexo_select/nexo_grep/nexo_explain) with a multi-layer sqlglot safety gate (AST mutation rejection, LATERAL rejection, positive function allowlist, table allowlist^nexo\.dx_*$, EXPLAIN cost gate); meta-question agent (no SQL); conversational memory (InMemoryConversationStore+ConversationStoreProtocol); new agentic LangGraph; route-awaretenancy_gate; provenance JSONL log + curate CLI; cost-report CLI.HarnessSupervisor.run()resolves the route viaLLMIntentRouter(with keyword fallback below threshold), dispatches tonexo_graph/agentic_graph/meta_agent_node/ storytelling, and round-tripsconversation_idthrough the store. The FastAPI lifespan injects all Phase-E modules on Nexo boot.What's NOT in this PR (follow-up)
tenancy_gate / planner / synthesizer / critic / summarizer) is wired, butplannerandsynthesizerare stubs. Connecting them to the composable primitives + provenance log is a contained follow-up; tests and the safety gate already cover the primitive surface.Known limitations / caveats
--by agentis a trace-level approximation. Each trace can carry multipleagent:<name>tags (one per agent that fired during the run), but_project_langfuse_traceprojects the first one. For per-tenant and per-mode rollups (single-valued tags) this is correct; for per-agent, dollars get attributed to whichever agent tag came first in the array. True per-agent attribution requires fetching the observations endpoint — ships in the stacked follow-up PR harness(phase-13): per-tenant billing surface (stacked on #462) #470.nexo.runtrace carrieslangfuse.tags = [tenant:<id>, …](E10), but the inneranthropic.chatobservation Traceloop auto-emits inside_run_nexo_metadoes not —meta_modelis invoked raw. Trace-level filters work; observation-level rollups under-count meta runs. Fixed in PR harness(phase-13): per-tenant billing surface (stacked on #462) #470 by wrappingmeta_modelviainstrument_model(self.meta_model, "meta_agent", ctx).Verification artifacts
Test suite:
295 passed, 1 skipped, 0 failed(uv run pytestfrommiot-harness/). Plan 12's 151-test baseline preserved; 144 new tests acrosstests/{observability,runtime,agents,integrations}/.Live telemetry (Langfuse @ http://localhost:3000): verified via direct ClickHouse query after F2/F3 curl battery:
nexo.runroot spansnexo.synthesizer+ 6nexo.filter_expertper-agent spansanthropic.chat(Traceloop-auto-instrumented) child spansauto/ 3canned/ 2meta/ 1agenticmodular.{mode, run_id, tenant_id}in its metadataF2 — canned-mode (
mode=\"canned\"): plan 12's G6/G7/G8 still produce Markdown answers from the real Coordinador snapshot.F3 — mode bypass + auto-router:
mode=\"meta\"fortenant_id=mintral→ primer+catalog answer, no DB.mode=\"meta\"fortenant_id=non-mintral-customer→ allowed (meta is non-confidential).mode=\"agentic\"fortenant_id=non-mintral-customer→ModeAccessDeniedat request-validation; 200 response with explicit refusal answer.mode=\"agentic\"fortenant_id=mintral→ fires through the stub planner+synthesizer; emits the right spans.NEXO_QUERYat confidence 0.75 → answered from the canned graph.E3 composable-primitives safety gate (the high-risk surface, reviewed twice):
pg_terminate_backend/pg_sleep/set_config/lo_import/pg_reload_conf/pg_advisory_lock/pg_read_file/dblink, multi-statement, CTE with mutation, MERGE, COPY, non-allowlisted schemas).INSERT/UPDATE/DELETEAST nodes but cause writes, kills, or DoS.Test plan
uv run pytest -qfrommiot-harness/→ 295 passedcd infra/observability && docker compose up -d && ./bootstrap.sh→ 7 services healthy, .env paste-block printedMIOT_HARNESS_OTEL_ENABLED=true uv run uvicorn miot_harness.api.server:create_app --factory→/healthreturnsnexo.enabled=true+ curated tool listPOST /runswithmode=canned(×3 scenarios) → 200, real Coordinador data, spans in LangfusePOST /runswithmode=meta(×2 tenants) → 200, primer+catalog answer, no SQLPOST /runswithmode=agentic(Mintral + non-Mintral) → 200/refusal as appropriatePOST /runswith implicitmode=auto→ LLMIntentRouter classifies correctlymodular.mode/modular.run_id/modular.tenant_idfor groupingCode review checkpoints
Two
superpowers:requesting-code-reviewcycles run during the plan:.env.exampledrift, conftest filter scope, configure_tracing no-op-on-second-call, spans.py docstring, A6 propagation test docstring) — commit21037769.684fbd09.Plan ref
.cursor/plans/ai-first/13-post-nexo-roadmap.md— Phases A, B, C (partial), D (partial), E, F1-F3.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Infrastructure
Configuration