Skip to content

replication: scheduler glue + inbound channel + NAT-traversal doc (partial #65)#75

Open
emooreatx wants to merge 2 commits into
mainfrom
replication-scheduler-glue
Open

replication: scheduler glue + inbound channel + NAT-traversal doc (partial #65)#75
emooreatx wants to merge 2 commits into
mainfrom
replication-scheduler-glue

Conversation

@emooreatx

Copy link
Copy Markdown
Contributor

Summary

Re-cut of closed PR #74 (base was the now-merged `replication-long-lived-session` branch). Rebased onto main with #73 squashed in.

What's in this PR

  1. Scheduler glue — `ReplicationScheduler` + per-coordinator
    `tokio::time::interval` tasks + inbound mpsc channel. Closes
    the documented "Scheduler" non-goal in coordinator.rs / mod.rs.
    Responder rounds stay inline in the application's listen loop
    (event-driven, no scheduler needed).

    • `ReplicationCoordinator::deliver_inbound(msg)` / `recv_inbound()`
    • `SchedulerConfig { cadence: 30s, round_timeout: 10s }` defaults
      match MISSION "near-realtime convergence" anchor
    • `run_until_cancelled(watch::Receiver)` — clean shutdown
    • `run_with_events(cancel, Option<event_sink>)` — per-round
      `RoundEvent { Completed(RoundReport), Refused, TimedOut, Error }`
      for metrics / τ_partial telemetry
  2. NAT-traversal documentation — explicit paragraph in
    replication/mod.rs explaining that edge does NOT implement
    STUN/TURN/ICE. Reticulum routes by cryptographic destination
    hash; transit nodes carry packets without decrypt capability;
    the federation topology (registry / lens / agent 2.9.6 / CEG
    0.15 community peers) supplies the public-side peer set by
    construction. HTTP transport stays the one exception (operator
    config, not edge code).

Tests

5 new (214 lib total, was 209 after #73):

  • `scheduler_drives_initiator_round_to_completion` — end-to-end
    across two coordinators + in-memory transport + listen-loop sim;
    asserts `StalenessSignal::InSync`
  • `scheduler_reports_timeout_when_peer_silent` — black-hole +
    `RoundEvent::TimedOut`
  • `scheduler_exits_on_cancel` — clean shutdown
  • Builder smoke + defaults

Real-time tokio (10ms cadence / 500ms timeout) — lib-test build
doesn't pull tokio's `test-util` feature; tests finish in ~200ms.

Test plan

  • `cargo check` clean across all 4 CI feature combos under -D warnings
  • `cargo clippy --all-targets` clean on both CI clippy combos
  • `cargo fmt --check` clean
  • 214 lib tests pass
  • CI green on the PR

Replication ladder

#69 protocol → #70 coordinator → #71 directory adapters → #72
wire-frame prefix → #73 long-lived Session (merged) → this PR
(scheduler glue + NAT doc). Layer (c-2) production wiring remains.

🤖 Generated with Claude Code

emooreatx and others added 2 commits June 9, 2026 16:41
Closes the documented `Scheduler` non-goal in coordinator.rs / mod.rs.
The application can now wire one ReplicationScheduler over a fleet of
Initiator-side coordinators; the scheduler fires per-coordinator
tokio::time::interval ticks and drives each round through the
SendThenWait → recv inbound → DriveStep loop to completion (or
timeout). Responder rounds stay inline in the application's listen
loop (no separate scheduling needed — they're event-driven by inbound
Summary arrivals).

## Coordinator additions

- `ReplicationCoordinator::deliver_inbound(msg)` — push an inbound
  ReplicationMessage into the coordinator's internal queue. Called by
  the application's Transport::listen loop after parse_inbound_bytes
  recognizes a CRPL-framed message. Bounded mpsc(8), back-pressure
  surfaces as `NoRoundInProgress` on the listen-loop call (typically
  logged + dropped — the next interval tick recovers).
- `ReplicationCoordinator::recv_inbound()` — internal-facing; the
  scheduler awaits on this between SendThenWait phases.
- `INBOUND_CHANNEL_CAPACITY = 8` — one round in flight has at most 3
  messages queued (Summary + Diff + Deliver); 8 absorbs a slightly-
  late deliver while the scheduler is mid-step.

## scheduler.rs (new module)

- `SchedulerConfig` — cadence + round_timeout, default 30s + 10s
  (matches MISSION "near-realtime convergence" anchor).
- `ReplicationScheduler::new(config)` / `add_initiator(coord)` —
  builder; debug_assert that role is Initiator (Responders run inline
  in the listen loop, no scheduler).
- `run_until_cancelled(watch::Receiver<bool>)` — spawn one tokio task
  per registered coordinator, each with its own interval; observe the
  watch signal on every iteration, exit cleanly when it flips true.
- `run_with_events(cancel, Option<mpsc::Sender<(peer_id, RoundEvent)>>)`
  — same, plus per-round event surface for metrics consumers /
  τ_partial telemetry. Sink-closed is tolerated (round loops continue).
- `RoundEvent { Completed(RoundReport), Refused, TimedOut, Error(String) }`
  — what one round produced. The scheduler logs each via tracing
  spans regardless of whether an event_sink is wired.

## Cancellation model

`tokio::sync::watch::Receiver<bool>` — no new dep, lives in tokio core.
The caller holds the Sender, calls `.send(true)` to ask the scheduler
to shut down. Shutdown latency bounded by the slowest in-flight
round's round_timeout.

## Round timeout posture

If a SendThenWait phase's recv_inbound() blocks past round_timeout, the
scheduler emits RoundEvent::TimedOut, the coordinator's session
auto-resets via the existing Complete-path (the next interval tick
starts fresh). Anti-entropy is eventually-consistent by design — a
stuck round is a missed cadence, not a fault.

## Tests (5 new — 214 lib total, was 209)

- `add_initiator_tracks_len` — builder smoke.
- `default_config_pins_30s_and_10s` — defaults stable.
- `scheduler_drives_initiator_round_to_completion` — end-to-end across
  two real coordinators + in-memory transport pair + listen-loop-
  simulation tasks. Driver task feeds the scheduler's Alice; the
  responder-side Bob runs inline via a drain task that mirrors what
  the application's Transport::listen integration looks like.
  Asserts the round Completes with admitted=2 and InSync staleness —
  the full long-lived-session + scheduler + inbound-channel + transport
  stack is wired correctly.
- `scheduler_reports_timeout_when_peer_silent` — black-hole the
  outbound; assert RoundEvent::TimedOut surfaces.
- `scheduler_exits_on_cancel` — empty scheduler shuts down cleanly.

Uses real-time tokio (no `start_paused`) because the lib-test build
doesn't pull tokio's test-util feature; 10ms cadence + 500ms timeout
finishes each test in ~200ms.

## Verification

- cargo check clean across all 4 CI feature combos under -D warnings.
- cargo clippy clean on both CI clippy combos.
- cargo fmt clean.
- 214 lib tests pass (was 209; +5 scheduler).

## Replication ladder status

PRs #69 protocol → #70 coordinator → #71 directory adapters → #72
wire-frame prefix → #73 long-lived Session → this PR (scheduler glue).
Layer (c-2) production wiring (FederationDirectory blanket impl +
PyO3 init path) remains. Stacks on the long-lived-Session branch
because the scheduler depends on Session being held across calls.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…andles it)

Closes the documentation gap from PR #74's discussion. Edge does NOT
implement STUN / TURN / ICE — and shouldn't:

- Reticulum routes by cryptographic destination hash, NOT endpoint
  address. Transit nodes carry packets without decrypt capability,
  so the relay can't read what it's carrying.
- The federation topology supplies the public-side peer set by
  construction: registry servers (substrate, public by definition);
  CIRISLens (public observability surface — opt-in is policy-layer
  consent, not transit-layer reachability; relays carry what they
  can't read); agent 2.9.6 CEG/RET-native community-server-opt-in;
  any other CEG 0.15 community peer with a public interface.
- HTTP transport stays the one exception — its accept-side needs
  operator-supplied port-forward / reverse-proxy. Documented as
  such; operators behind hard NATs should use Reticulum (MISSION
  §1.4 canonical anyway).

Documentation-only — no code change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

emooreatx added a commit that referenced this pull request Jun 10, 2026
…ime + PyReplicationHandle (closes #65 v1) (#79)

Closes the 8th and final rung of CIRISEdge#65's v1 ladder per
FSD/REPLICATION_WIRE_FORMAT_V1.md §3.7. After this PR, #65's v1
implementation is feature-complete: edge ships the orchestration
glue the operator wires into their application.

## What's in this PR

### `src/replication/registry.rs` (new module)

`ReplicationRegistry` — application-side dispatch table indexed by
`(peer_key_id, kind)`. Closes the 5th of the 5 design questions
raised in #65's pre-FSD discussion: how to dispatch CRPL-framed
inbound bytes to the matching coordinator.

- `register(peer_key_id, kind, coord)` / `deregister(...)` /
  `get(...)` / `len()` / `is_empty()` / `registered_keys()` —
  standard map ops, RwLock-backed (read-mostly).
- `route_inbound_bytes(peer_key_id, bytes)` — the hot-path call from
  the operator's Transport::listen loop. Trichotomy:
  - `Ok(Routed)` — bytes parsed as a CRPL+VER replication frame for a
    registered (peer, kind); coordinator's `deliver_inbound` accepted
  - `Ok(NotAReplicationFrame)` — no CRPL magic; caller routes to its
    non-replication dispatch
  - `Ok(NoCoordinatorRegistered { kind })` — CRPL present + parseable
    but no `(peer, kind)` registration; caller logs + drops
  - `Err(Protocol(_))` — malformed body or unknown wire version
  - `Err(BackPressure { … })` — coordinator's inbound channel full

9 new tests cover all paths.

### `src/replication/runtime.rs` (new module)

`ReplicationRuntime` — bundles bridge + registry + scheduler task +
shutdown handle into a single managed runtime. Closes the
orchestration concern from FSD §3.7.

- `ReplicationRuntime::start(directory, transport, peers, config)` —
  async. Constructs the FederationDirectoryReplicationBridge, builds
  one ReplicationCoordinator per (peer_key_id, kind) in Initiator
  role, registers them with a shared ReplicationRegistry, hands the
  Initiator set to a ReplicationScheduler, and spawns the scheduler's
  run loop on the current tokio runtime.
- `register_peer(peer_key_id, kind)` — hot-add. Defaults to Responder
  role; the scheduler's Initiator set is fixed at start in v1 (a
  v1.x patch will extend the scheduler with a dynamic-add API).
- `registry()` / `bridge()` — shared handles for operator
  integration + telemetry.
- `shutdown()` — flips the scheduler's cancel watch and awaits the
  run-loop task to completion. Idempotent.

4 new tests cover construction, hot-add, and shutdown.

### `src/ffi/pyo3.rs` — PyO3 surface

- `PyEdge::start_replication(peers, cadence_seconds=None)` — Python
  method on PyEdge. Pulls the federation_directory + first transport
  from the cohabitation Edge, parses the kind strings, dispatches to
  `ReplicationRuntime::start` via persist's executor capsule, and
  returns a `PyReplicationHandle`. Surfaces clear `ValueError`s on
  missing directory / missing transport / unknown kind string.

- `parse_envelope_kind(s)` — strict parser for the 10 wire-stable
  kind tokens (key / attestation / revocation / identity_occurrence
  / family / community / identity_occurrence_revocation /
  family_membership_revocation / community_membership_revocation /
  location_proof). Same `serde(rename_all = "snake_case")` tokens the
  wire frame uses; FSD §3.3 table.

- `PyReplicationHandle` — Python-facing handle wrapping the
  ReplicationRuntime. Methods: `registered_count` /
  `register_peer(peer_key_id, kind)` / `stop()`. `Arc<Mutex<Option<…>>>`
  inner pattern lets clones share state + makes `stop` idempotent
  (post-stop, the handle is dead but cloning works).

### Why no auto-routing into Transport::listen

Edge's Transport::listen already runs in the application's existing
dispatch loop. Wiring the registry's `route_inbound_bytes` INTO that
loop is operator code (a one-line addition to the application's
listen-handler), not edge's job. This keeps the v1 cut clean: the
runtime exposes the registry; the operator wires it. A v1.7
follow-up may add an opt-in `Edge::install_replication_routing(runtime)`
helper if operator demand surfaces.

## Verification

- 4 CI feature combos clean under RUSTFLAGS="-D warnings"
- cargo clippy clean on both CI clippy combos
- cargo fmt clean
- 244 lib tests green (was 231; +9 registry + +4 runtime)
- 5 proptest properties green

## #65 ladder status — v1 feature complete

All 8 rungs landed:
1. ✓ Protocol state machine (#69)
2. ✓ Coordinator (#70)
3. ✓ Directory adapters (#71)
4. ✓ Wire-frame prefix (#72)
5. ✓ Long-lived Session (#73)
6. ✓ Scheduler + wire-format lock (#75 / #76)
7. ✓ Layer (c-2) production bridge (#78 / v1.6.0)
8. ✓ **PyO3 init wiring + registry + runtime (this PR)**

#65 closes as v1 feature complete after this merges. v2 cycles (PyO3
listen-loop integration helper, ReadEngine bulk optimization, persist-
side lookup_*_by_content_hash, scheduler dynamic-add API) become
v1.7+ patch follow-ups.

## Path to CIRIS 2.0 (FSD §5) unchanged

- v1 wire LOCKED at 10 variants (WIRE_PROTOCOL_VERSION = 0x01)
- v2 cut gated on CIRISRegistry#58 Phase 2 (operational-data envelopes)
- v2 will be the natural moment to switch envelope_hash to a CEG-§0.9-
  conformant basis (per §3.2.1's deferred interop path)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
emooreatx added a commit that referenced this pull request Jun 10, 2026
…shipped

Pin-only bump for the release tag. The substantive work landed in
#79 (commit e5cca14) and is already on main; this commit moves the
version string from 1.6.2 → 1.6.3 so the v1.6.3 release tag points
at the post-#79 main with the right declared version.

## #65 ladder — all 8 rungs landed

| Rung | PR / cut         | Surface                                          |
|------|------------------|--------------------------------------------------|
| 1    | #69              | Protocol state machine                           |
| 2    | #70              | ReplicationCoordinator                           |
| 3    | #71              | Directory adapters                               |
| 4    | #72              | Wire-frame prefix (CRPL magic)                   |
| 5    | #73              | Long-lived Session in coordinator                |
| 6    | #75 / #76        | Scheduler + v1 wire-format lock                  |
| 7    | #78 (v1.6.0)     | Layer (c-2) FederationDirectoryReplicationBridge |
| 8    | #79 (this cut)   | ReplicationRegistry + Runtime + PyO3 init        |

After v1.6.3, #65 closes as v1 feature complete.

## What v1.6.3 ships (the #79 surfaces)

- `replication::registry::ReplicationRegistry` — application-side
  dispatch table indexed by (peer_key_id, kind). `route_inbound_bytes`
  hot-path call returns a RouteOutcome trichotomy (Routed /
  NotAReplicationFrame / NoCoordinatorRegistered) + BackPressure error
  for channel-full conditions.
- `replication::runtime::ReplicationRuntime::start(directory, transport,
  peers, config)` (async) — bundles bridge + registry + scheduler task
  + shutdown handle into one managed runtime.
- `PyEdge::start_replication(peers, cadence_seconds=None)` — Python
  surface that lifts the operator's peer list + cadence into a
  ReplicationRuntime via persist's executor capsule. Returns a
  `PyReplicationHandle` with `registered_count` / `register_peer` /
  `stop` methods.

## Substrate pins unchanged

- ciris-persist v4.15.0 (CEG 1.0 §0.9 JCS canonicalization flip)
- ciris-verify v5.0.0 (CEG 1.0 / Agent 3.0 substrate)
- v1 wire taxonomy locked at 10 variants (WIRE_PROTOCOL_VERSION = 0x01)

## v1.7+ follow-ups (deferred; non-blocking)

- `Edge::install_replication_routing(runtime)` opt-in helper that
  wires `registry.route_inbound_bytes` into the existing
  `Transport::listen` loop (currently operator-wired)
- ReadEngine bulk optimization for Key/Attestation/Revocation
- persist-side `lookup_*_by_content_hash` admit surfaces
- Scheduler dynamic-add API for hot-added Initiator coordinators

## Path to CIRIS 2.0 (FSD §5) — unchanged

- v2 cut gated on CIRISRegistry#58 Phase 2 (operational-data envelopes)
- v2 will be the natural moment to switch envelope_hash to a
  CEG-§0.9-conformant basis (per §3.2.1's deferred interop path)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
emooreatx added a commit that referenced this pull request Jun 16, 2026
…rd cut) (#130)

CEG 1.0-RC7 (CIRISRegistry 9535b2a) pinned the hybrid-required posture
at every federation-tier gate. The substrate co-bumped:

  - persist v7.1.0 → v7.2.0 — trace-tier hybrid hard cut (#225). V083
    migration adds signature_ml_dsa_65 + pubkey_ml_dsa_65 + pqc_key_id
    on `trace_events` (pg + sqlite parity, no TimescaleDB). VerifyMode::
    Full now rejects classical-only per-trace sigs at admission; a
    CRQC-era attacker who breaks Ed25519 cannot forge backdated traces
    under any historical key. Per-trace testimony is now post-quantum.

  - verify-family v5.6.0 → v5.7.0 — HybridPolicy::RequireHybrid default
    at three federation-tier gates (#75): threshold::verify_threshold_
    signatures, provenance::verify_provenance_chain, license gate. A
    stripped PQC half no longer counts; partnership envelope shape +
    infra/agency scope split (#76, #77) align with RC7 §8.1.12.7.1 /
    §5.6.8.10 / §1.3.

Edge impact: pure pin flip, no source change. Edge's verify surface
re-exports HybridPolicy from persist::prelude (src/verify.rs:48), and
that re-export tracks persist's flip transitively — no edge
signature touches the policy default. The seven-member partnership
envelope + scope-split verifier are producer/consumer at agent/server
tier; edge is byte-transport.

Aligns with v3.7.0's `KexAlgorithm::HybridRequired` for transit KEX —
the whole stack is now consistently HNDL-strict at the federation
boundary, with explicit AllowClassicalPending / Hybrid fallback for
local-tier paths only.

Gate sweep:
  - 4 build combos clean under RUSTFLAGS=-D warnings (core / reticulum /
    packet-radio / all-transports)
  - 30 targeted lib tests green (realtime_av: 14, federation_session:
    12, verify: 4)
  - Cargo↔pyproject skew check OK (persist v7.2.0 satisfies '>=7.0.0,<8')
  - Pre-push hook (`cargo clippy --all-targets -D warnings`) gates next.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant