replication: scheduler glue + inbound channel + NAT-traversal doc (partial #65)#75
Open
emooreatx wants to merge 2 commits into
Open
replication: scheduler glue + inbound channel + NAT-traversal doc (partial #65)#75emooreatx wants to merge 2 commits into
emooreatx wants to merge 2 commits into
Conversation
Closes the documented `Scheduler` non-goal in coordinator.rs / mod.rs.
The application can now wire one ReplicationScheduler over a fleet of
Initiator-side coordinators; the scheduler fires per-coordinator
tokio::time::interval ticks and drives each round through the
SendThenWait → recv inbound → DriveStep loop to completion (or
timeout). Responder rounds stay inline in the application's listen
loop (no separate scheduling needed — they're event-driven by inbound
Summary arrivals).
## Coordinator additions
- `ReplicationCoordinator::deliver_inbound(msg)` — push an inbound
ReplicationMessage into the coordinator's internal queue. Called by
the application's Transport::listen loop after parse_inbound_bytes
recognizes a CRPL-framed message. Bounded mpsc(8), back-pressure
surfaces as `NoRoundInProgress` on the listen-loop call (typically
logged + dropped — the next interval tick recovers).
- `ReplicationCoordinator::recv_inbound()` — internal-facing; the
scheduler awaits on this between SendThenWait phases.
- `INBOUND_CHANNEL_CAPACITY = 8` — one round in flight has at most 3
messages queued (Summary + Diff + Deliver); 8 absorbs a slightly-
late deliver while the scheduler is mid-step.
## scheduler.rs (new module)
- `SchedulerConfig` — cadence + round_timeout, default 30s + 10s
(matches MISSION "near-realtime convergence" anchor).
- `ReplicationScheduler::new(config)` / `add_initiator(coord)` —
builder; debug_assert that role is Initiator (Responders run inline
in the listen loop, no scheduler).
- `run_until_cancelled(watch::Receiver<bool>)` — spawn one tokio task
per registered coordinator, each with its own interval; observe the
watch signal on every iteration, exit cleanly when it flips true.
- `run_with_events(cancel, Option<mpsc::Sender<(peer_id, RoundEvent)>>)`
— same, plus per-round event surface for metrics consumers /
τ_partial telemetry. Sink-closed is tolerated (round loops continue).
- `RoundEvent { Completed(RoundReport), Refused, TimedOut, Error(String) }`
— what one round produced. The scheduler logs each via tracing
spans regardless of whether an event_sink is wired.
## Cancellation model
`tokio::sync::watch::Receiver<bool>` — no new dep, lives in tokio core.
The caller holds the Sender, calls `.send(true)` to ask the scheduler
to shut down. Shutdown latency bounded by the slowest in-flight
round's round_timeout.
## Round timeout posture
If a SendThenWait phase's recv_inbound() blocks past round_timeout, the
scheduler emits RoundEvent::TimedOut, the coordinator's session
auto-resets via the existing Complete-path (the next interval tick
starts fresh). Anti-entropy is eventually-consistent by design — a
stuck round is a missed cadence, not a fault.
## Tests (5 new — 214 lib total, was 209)
- `add_initiator_tracks_len` — builder smoke.
- `default_config_pins_30s_and_10s` — defaults stable.
- `scheduler_drives_initiator_round_to_completion` — end-to-end across
two real coordinators + in-memory transport pair + listen-loop-
simulation tasks. Driver task feeds the scheduler's Alice; the
responder-side Bob runs inline via a drain task that mirrors what
the application's Transport::listen integration looks like.
Asserts the round Completes with admitted=2 and InSync staleness —
the full long-lived-session + scheduler + inbound-channel + transport
stack is wired correctly.
- `scheduler_reports_timeout_when_peer_silent` — black-hole the
outbound; assert RoundEvent::TimedOut surfaces.
- `scheduler_exits_on_cancel` — empty scheduler shuts down cleanly.
Uses real-time tokio (no `start_paused`) because the lib-test build
doesn't pull tokio's test-util feature; 10ms cadence + 500ms timeout
finishes each test in ~200ms.
## Verification
- cargo check clean across all 4 CI feature combos under -D warnings.
- cargo clippy clean on both CI clippy combos.
- cargo fmt clean.
- 214 lib tests pass (was 209; +5 scheduler).
## Replication ladder status
PRs #69 protocol → #70 coordinator → #71 directory adapters → #72
wire-frame prefix → #73 long-lived Session → this PR (scheduler glue).
Layer (c-2) production wiring (FederationDirectory blanket impl +
PyO3 init path) remains. Stacks on the long-lived-Session branch
because the scheduler depends on Session being held across calls.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…andles it) Closes the documentation gap from PR #74's discussion. Edge does NOT implement STUN / TURN / ICE — and shouldn't: - Reticulum routes by cryptographic destination hash, NOT endpoint address. Transit nodes carry packets without decrypt capability, so the relay can't read what it's carrying. - The federation topology supplies the public-side peer set by construction: registry servers (substrate, public by definition); CIRISLens (public observability surface — opt-in is policy-layer consent, not transit-layer reachability; relays carry what they can't read); agent 2.9.6 CEG/RET-native community-server-opt-in; any other CEG 0.15 community peer with a public interface. - HTTP transport stays the one exception — its accept-side needs operator-supplied port-forward / reverse-proxy. Documented as such; operators behind hard NATs should use Reticulum (MISSION §1.4 canonical anyway). Documentation-only — no code change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
This was referenced Jun 9, 2026
emooreatx
added a commit
that referenced
this pull request
Jun 10, 2026
…ime + PyReplicationHandle (closes #65 v1) (#79) Closes the 8th and final rung of CIRISEdge#65's v1 ladder per FSD/REPLICATION_WIRE_FORMAT_V1.md §3.7. After this PR, #65's v1 implementation is feature-complete: edge ships the orchestration glue the operator wires into their application. ## What's in this PR ### `src/replication/registry.rs` (new module) `ReplicationRegistry` — application-side dispatch table indexed by `(peer_key_id, kind)`. Closes the 5th of the 5 design questions raised in #65's pre-FSD discussion: how to dispatch CRPL-framed inbound bytes to the matching coordinator. - `register(peer_key_id, kind, coord)` / `deregister(...)` / `get(...)` / `len()` / `is_empty()` / `registered_keys()` — standard map ops, RwLock-backed (read-mostly). - `route_inbound_bytes(peer_key_id, bytes)` — the hot-path call from the operator's Transport::listen loop. Trichotomy: - `Ok(Routed)` — bytes parsed as a CRPL+VER replication frame for a registered (peer, kind); coordinator's `deliver_inbound` accepted - `Ok(NotAReplicationFrame)` — no CRPL magic; caller routes to its non-replication dispatch - `Ok(NoCoordinatorRegistered { kind })` — CRPL present + parseable but no `(peer, kind)` registration; caller logs + drops - `Err(Protocol(_))` — malformed body or unknown wire version - `Err(BackPressure { … })` — coordinator's inbound channel full 9 new tests cover all paths. ### `src/replication/runtime.rs` (new module) `ReplicationRuntime` — bundles bridge + registry + scheduler task + shutdown handle into a single managed runtime. Closes the orchestration concern from FSD §3.7. - `ReplicationRuntime::start(directory, transport, peers, config)` — async. Constructs the FederationDirectoryReplicationBridge, builds one ReplicationCoordinator per (peer_key_id, kind) in Initiator role, registers them with a shared ReplicationRegistry, hands the Initiator set to a ReplicationScheduler, and spawns the scheduler's run loop on the current tokio runtime. - `register_peer(peer_key_id, kind)` — hot-add. Defaults to Responder role; the scheduler's Initiator set is fixed at start in v1 (a v1.x patch will extend the scheduler with a dynamic-add API). - `registry()` / `bridge()` — shared handles for operator integration + telemetry. - `shutdown()` — flips the scheduler's cancel watch and awaits the run-loop task to completion. Idempotent. 4 new tests cover construction, hot-add, and shutdown. ### `src/ffi/pyo3.rs` — PyO3 surface - `PyEdge::start_replication(peers, cadence_seconds=None)` — Python method on PyEdge. Pulls the federation_directory + first transport from the cohabitation Edge, parses the kind strings, dispatches to `ReplicationRuntime::start` via persist's executor capsule, and returns a `PyReplicationHandle`. Surfaces clear `ValueError`s on missing directory / missing transport / unknown kind string. - `parse_envelope_kind(s)` — strict parser for the 10 wire-stable kind tokens (key / attestation / revocation / identity_occurrence / family / community / identity_occurrence_revocation / family_membership_revocation / community_membership_revocation / location_proof). Same `serde(rename_all = "snake_case")` tokens the wire frame uses; FSD §3.3 table. - `PyReplicationHandle` — Python-facing handle wrapping the ReplicationRuntime. Methods: `registered_count` / `register_peer(peer_key_id, kind)` / `stop()`. `Arc<Mutex<Option<…>>>` inner pattern lets clones share state + makes `stop` idempotent (post-stop, the handle is dead but cloning works). ### Why no auto-routing into Transport::listen Edge's Transport::listen already runs in the application's existing dispatch loop. Wiring the registry's `route_inbound_bytes` INTO that loop is operator code (a one-line addition to the application's listen-handler), not edge's job. This keeps the v1 cut clean: the runtime exposes the registry; the operator wires it. A v1.7 follow-up may add an opt-in `Edge::install_replication_routing(runtime)` helper if operator demand surfaces. ## Verification - 4 CI feature combos clean under RUSTFLAGS="-D warnings" - cargo clippy clean on both CI clippy combos - cargo fmt clean - 244 lib tests green (was 231; +9 registry + +4 runtime) - 5 proptest properties green ## #65 ladder status — v1 feature complete All 8 rungs landed: 1. ✓ Protocol state machine (#69) 2. ✓ Coordinator (#70) 3. ✓ Directory adapters (#71) 4. ✓ Wire-frame prefix (#72) 5. ✓ Long-lived Session (#73) 6. ✓ Scheduler + wire-format lock (#75 / #76) 7. ✓ Layer (c-2) production bridge (#78 / v1.6.0) 8. ✓ **PyO3 init wiring + registry + runtime (this PR)** #65 closes as v1 feature complete after this merges. v2 cycles (PyO3 listen-loop integration helper, ReadEngine bulk optimization, persist- side lookup_*_by_content_hash, scheduler dynamic-add API) become v1.7+ patch follow-ups. ## Path to CIRIS 2.0 (FSD §5) unchanged - v1 wire LOCKED at 10 variants (WIRE_PROTOCOL_VERSION = 0x01) - v2 cut gated on CIRISRegistry#58 Phase 2 (operational-data envelopes) - v2 will be the natural moment to switch envelope_hash to a CEG-§0.9- conformant basis (per §3.2.1's deferred interop path) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
emooreatx
added a commit
that referenced
this pull request
Jun 10, 2026
…shipped Pin-only bump for the release tag. The substantive work landed in #79 (commit e5cca14) and is already on main; this commit moves the version string from 1.6.2 → 1.6.3 so the v1.6.3 release tag points at the post-#79 main with the right declared version. ## #65 ladder — all 8 rungs landed | Rung | PR / cut | Surface | |------|------------------|--------------------------------------------------| | 1 | #69 | Protocol state machine | | 2 | #70 | ReplicationCoordinator | | 3 | #71 | Directory adapters | | 4 | #72 | Wire-frame prefix (CRPL magic) | | 5 | #73 | Long-lived Session in coordinator | | 6 | #75 / #76 | Scheduler + v1 wire-format lock | | 7 | #78 (v1.6.0) | Layer (c-2) FederationDirectoryReplicationBridge | | 8 | #79 (this cut) | ReplicationRegistry + Runtime + PyO3 init | After v1.6.3, #65 closes as v1 feature complete. ## What v1.6.3 ships (the #79 surfaces) - `replication::registry::ReplicationRegistry` — application-side dispatch table indexed by (peer_key_id, kind). `route_inbound_bytes` hot-path call returns a RouteOutcome trichotomy (Routed / NotAReplicationFrame / NoCoordinatorRegistered) + BackPressure error for channel-full conditions. - `replication::runtime::ReplicationRuntime::start(directory, transport, peers, config)` (async) — bundles bridge + registry + scheduler task + shutdown handle into one managed runtime. - `PyEdge::start_replication(peers, cadence_seconds=None)` — Python surface that lifts the operator's peer list + cadence into a ReplicationRuntime via persist's executor capsule. Returns a `PyReplicationHandle` with `registered_count` / `register_peer` / `stop` methods. ## Substrate pins unchanged - ciris-persist v4.15.0 (CEG 1.0 §0.9 JCS canonicalization flip) - ciris-verify v5.0.0 (CEG 1.0 / Agent 3.0 substrate) - v1 wire taxonomy locked at 10 variants (WIRE_PROTOCOL_VERSION = 0x01) ## v1.7+ follow-ups (deferred; non-blocking) - `Edge::install_replication_routing(runtime)` opt-in helper that wires `registry.route_inbound_bytes` into the existing `Transport::listen` loop (currently operator-wired) - ReadEngine bulk optimization for Key/Attestation/Revocation - persist-side `lookup_*_by_content_hash` admit surfaces - Scheduler dynamic-add API for hot-added Initiator coordinators ## Path to CIRIS 2.0 (FSD §5) — unchanged - v2 cut gated on CIRISRegistry#58 Phase 2 (operational-data envelopes) - v2 will be the natural moment to switch envelope_hash to a CEG-§0.9-conformant basis (per §3.2.1's deferred interop path) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
emooreatx
added a commit
that referenced
this pull request
Jun 16, 2026
…rd cut) (#130) CEG 1.0-RC7 (CIRISRegistry 9535b2a) pinned the hybrid-required posture at every federation-tier gate. The substrate co-bumped: - persist v7.1.0 → v7.2.0 — trace-tier hybrid hard cut (#225). V083 migration adds signature_ml_dsa_65 + pubkey_ml_dsa_65 + pqc_key_id on `trace_events` (pg + sqlite parity, no TimescaleDB). VerifyMode:: Full now rejects classical-only per-trace sigs at admission; a CRQC-era attacker who breaks Ed25519 cannot forge backdated traces under any historical key. Per-trace testimony is now post-quantum. - verify-family v5.6.0 → v5.7.0 — HybridPolicy::RequireHybrid default at three federation-tier gates (#75): threshold::verify_threshold_ signatures, provenance::verify_provenance_chain, license gate. A stripped PQC half no longer counts; partnership envelope shape + infra/agency scope split (#76, #77) align with RC7 §8.1.12.7.1 / §5.6.8.10 / §1.3. Edge impact: pure pin flip, no source change. Edge's verify surface re-exports HybridPolicy from persist::prelude (src/verify.rs:48), and that re-export tracks persist's flip transitively — no edge signature touches the policy default. The seven-member partnership envelope + scope-split verifier are producer/consumer at agent/server tier; edge is byte-transport. Aligns with v3.7.0's `KexAlgorithm::HybridRequired` for transit KEX — the whole stack is now consistently HNDL-strict at the federation boundary, with explicit AllowClassicalPending / Hybrid fallback for local-tier paths only. Gate sweep: - 4 build combos clean under RUSTFLAGS=-D warnings (core / reticulum / packet-radio / all-transports) - 30 targeted lib tests green (realtime_av: 14, federation_session: 12, verify: 4) - Cargo↔pyproject skew check OK (persist v7.2.0 satisfies '>=7.0.0,<8') - Pre-push hook (`cargo clippy --all-targets -D warnings`) gates next.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-cut of closed PR #74 (base was the now-merged `replication-long-lived-session` branch). Rebased onto main with #73 squashed in.
What's in this PR
Scheduler glue — `ReplicationScheduler` + per-coordinator
`tokio::time::interval` tasks + inbound mpsc channel. Closes
the documented "Scheduler" non-goal in coordinator.rs / mod.rs.
Responder rounds stay inline in the application's listen loop
(event-driven, no scheduler needed).
match MISSION "near-realtime convergence" anchor
`RoundEvent { Completed(RoundReport), Refused, TimedOut, Error }`
for metrics / τ_partial telemetry
NAT-traversal documentation — explicit paragraph in
replication/mod.rs explaining that edge does NOT implement
STUN/TURN/ICE. Reticulum routes by cryptographic destination
hash; transit nodes carry packets without decrypt capability;
the federation topology (registry / lens / agent 2.9.6 / CEG
0.15 community peers) supplies the public-side peer set by
construction. HTTP transport stays the one exception (operator
config, not edge code).
Tests
5 new (214 lib total, was 209 after #73):
across two coordinators + in-memory transport + listen-loop sim;
asserts `StalenessSignal::InSync`
`RoundEvent::TimedOut`
Real-time tokio (10ms cadence / 500ms timeout) — lib-test build
doesn't pull tokio's `test-util` feature; tests finish in ~200ms.
Test plan
Replication ladder
#69 protocol → #70 coordinator → #71 directory adapters → #72
wire-frame prefix → #73 long-lived Session (merged) → this PR
(scheduler glue + NAT doc). Layer (c-2) production wiring remains.
🤖 Generated with Claude Code