feat: Implement full application flow#66
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces agent-forwarding with per-job VM orchestration: backend now allocates IPs, builds/selects initramfs images, creates a VM per job, waits for the in-VM agent, executes job via HTTP, collects results, and tears down the VM; adds initramfs rebuild metadata, VMM stop signaling, safer stdin handling, and new docs (QUICKSTART, README updates). Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Backend as Backend
participant IpMgr as IpManager
participant VM as VmHandle
participant VMM as VMM
participant Agent as Agent
Client->>Backend: Submit job (code, language)
Backend->>IpMgr: Allocate IP
IpMgr-->>Backend: IP assigned
Backend->>VM: Create VM(language, VmConfig)
VM->>VMM: Start (kernel + initramfs)
VMM-->>VM: VMM running
VM->>Agent: Wait for agent / health
Agent-->>VM: Agent ready
Backend->>Agent: HTTP execute job
Agent-->>Backend: Job result (stdout, stderr, exit_code)
Backend->>VM: Destroy VM (set stop_handle)
VM->>VMM: Stop
VMM-->>VM: Stopped
VM->>IpMgr: Release IP
Backend-->>Client: Job complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can enforce grammar and style rules using `languagetool`.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/src/vm_lifecycle.rs (1)
341-369:⚠️ Potential issue | 🟡 MinorTimeout calculation may exceed stated 30-second limit.
The comment states "Try for up to 30 seconds," and with 300 iterations × 100ms sleep, that adds up to 30 seconds of sleep time. However, each
client.get()has a 2-second timeout (line 337). If the agent is unreachable (connection hangs), each attempt could take up to 2 seconds + 100ms sleep, potentially extending total wait time well beyond 30 seconds.Consider either:
- Reducing the per-request timeout to a shorter value (e.g., 500ms).
- Tracking elapsed wall-clock time instead of iteration count.
- Updating the comment to reflect actual worst-case behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/vm_lifecycle.rs` around lines 341 - 369, The loop that checks agent readiness (the for attempt in 1..=300 loop using client.get(&agent_health_url)) can exceed the documented "30 seconds" because each HTTP request uses a 2s timeout and you also sleep 100ms per iteration; change the loop to use elapsed wall-clock time (e.g., let start = Instant::now(); while start.elapsed() < Duration::from_secs(30) { ... }) instead of a fixed 300-iteration count, and either reduce the per-request timeout (configured where the reqwest Client is built) to a shorter value (e.g., 500ms) or explicitly document the actual worst-case timeout in the comment; ensure vm_running checks and the same response handling (Ok(resp) branches) remain unchanged.
🧹 Nitpick comments (2)
backend/src/vm_lifecycle.rs (2)
284-311: Blocking I/O in async function.
std::fs::read_dirandstd::fs::metadataare synchronous blocking calls used inside anasync fn. While this is unlikely to cause issues for a small directory scan, it blocks the Tokio runtime thread. Consider usingtokio::fs::read_dirandtokio::fs::metadatafor consistency with other async filesystem operations in this file (e.g., line 229).♻️ Suggested async approach
- let entries = std::fs::read_dir(&config.initramfs_dir).map_err(|e| { + let mut entries = tokio::fs::read_dir(&config.initramfs_dir).await.map_err(|e| { VmError::InitramfsBuild(format!( "Failed to read initramfs dir '{}': {}", config.initramfs_dir.display(), e )) })?; - for entry in entries.flatten() { - let path = entry.path(); - if !path.is_file() { + while let Ok(Some(entry)) = entries.next_entry().await { + let path = entry.path(); + let is_file = tokio::fs::metadata(&path) + .await + .map(|m| m.is_file()) + .unwrap_or(false); + if !is_file { continue; } // ... rest of the logic with tokio::fs::metadata for size check🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/vm_lifecycle.rs` around lines 284 - 311, The loop currently uses blocking std::fs calls inside an async fn; replace std::fs::read_dir and std::fs::metadata with tokio::fs::read_dir and tokio::fs::metadata (awaiting where necessary) in the block that scans config.initramfs_dir, and adapt handling of the async ReadDir/DirEntry stream (e.g., while let Some(entry) = read_dir.next_entry().await?) to preserve the existing logic: skip non-files, convert entry.path().file_name().and_then(|s| s.to_str()), check name.starts_with(&prefix) && name.ends_with(".cpio.gz"), use tokio::fs::metadata(&path).await.map(|m| m.len() > 0).unwrap_or(false) to detect non-empty files, then set candidate = Some(path) and break; keep the VmError mapping for the initial read_dir failure consistent with the current error variant.
257-265: Fallbackvmm_stophandle won't control the actual VMM.When
vmm_stop_handleisNone(e.g., if VMM creation failed before sending the handle), the fallbackArc::new(AtomicBool::new(false))creates a disconnected flag that has no effect on the VMM. While this may be acceptable since the VMM likely isn't running in that scenario, consider logging a warning or explicitly handling this case to avoid silent failures during cleanup.♻️ Suggested improvement
- vmm_stop: vmm_stop_handle.unwrap_or_else(|| Arc::new(std::sync::atomic::AtomicBool::new(false))), + vmm_stop: vmm_stop_handle.unwrap_or_else(|| { + warn!(vm_id = %vm_id, "VMM stop handle not received; shutdown signaling may be ineffective"); + Arc::new(std::sync::atomic::AtomicBool::new(false)) + }),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/vm_lifecycle.rs` around lines 257 - 265, The fallback for vmm_stop in the VmHandle construction silently creates a disconnected AtomicBool when vmm_stop_handle is None; update the code where VmHandle is constructed so that if vmm_stop_handle.is_none() you emit a warning (e.g., via logger.warn/error) about the missing VMM stop handle and then explicitly create the fallback Arc<AtomicBool> (vmm_stop) while documenting it as a disconnected sentinel to aid cleanup; reference the VmHandle struct and the vmm_stop and vmm_stop_handle symbols so reviewers can find and verify the added log and the explicit handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@agent/src/main.rs`:
- Around line 294-302: The rx.recv() branch stays enabled after it yields None,
causing a busy-spin; add a mutable flag (e.g., recv_closed) before the
tokio::select! loop and set it true when you observe None from rx.recv() (the
reader EOF case), then guard the select arm so it only polls rx.recv() when
recv_closed is false (e.g., use conditional pattern: if !recv_closed =>
Some(stream_result) = rx.recv() => { ... } ); keep handling
StreamResult::Exceeded via StreamResult::Exceeded and call child.kill().await as
before, but stop polling rx after you mark recv_closed to avoid the busy loop.
In `@backend/src/initramfs_manager.rs`:
- Around line 154-167: The should_rebuild function currently only compares
out_mtime against agent_binary and init_script; update it to also consider the
effective base image (and/or the languages.json entry that determines
base_image) as a build input so changes there force a rebuild: locate
should_rebuild in initramfs_manager.rs and add logic to resolve the effective
base_image (or read the relevant languages.json entry), stat that file or derive
a stable fingerprint (e.g., mtime or hash of its content or the resolved image
tag) and include it in the final Ok(...) condition so the function returns true
when the base_image/languages.json input is newer or different than out_path.
Ensure errors from stat/reading are mapped similarly to existing error handling.
In `@backend/src/main.rs`:
- Line 296: The current line normalizes user source by calling
payload.code.trim().to_string() which alters program text and can shift
diagnostics; instead, keep the original source bytes for execution by assigning
the raw string (e.g., let code = payload.code.clone() or
payload.code.to_string()) and only use payload.code.trim().is_empty() to reject
empty submissions before proceeding; update the check where you currently create
code (replace the trim call with a raw copy) and add an explicit emptiness check
using payload.code.trim().is_empty() to reject empty input.
- Around line 194-203: The VM initramfs dir from env (vm_initramfs_dir) is set
and passed into VmConfig but InitramfsLanguage::prepare_paths() still writes to
hardcoded "tmp/"; update prepare_paths to accept the configured initramfs
directory (or read it from VmConfig) and use that path when creating/writing
archives, then update all callers of InitramfsLanguage::prepare_paths() to pass
vm_initramfs_dir (or VmConfig) so generated initramfs files land in the same
directory created by tokio::fs::create_dir_all; ensure the signature change and
call sites use the unique symbols InitramfsLanguage::prepare_paths and
vm_initramfs_dir (or VmConfig.initramfs_dir).
- Around line 153-161: The current check only rejects ip_mask > 30 but allows 0
which leads to host_bits = 32 and an invalid left shift (1_u32 << host_bits);
update the validation that uses IP_MASK (the ip_mask variable and the broadcast
offset/host_bits calculation) to require 1..=30 and return an
std::io::Error::new(std::io::ErrorKind::InvalidInput, ...) when out of range;
specifically ensure the guard before computing host_bits checks ip_mask >= 1 &&
ip_mask <= 30 and include the offending ip_mask value in the error message to
make the failure explicit.
In `@QUICKSTART.md`:
- Around line 19-21: The docs currently cross-compile backend/cli/agent to
x86_64-unknown-linux-musl (via "cargo build -p backend -p cli -p agent --target
x86_64-unknown-linux-musl"), but subsequent steps expect host binaries under
./target/debug (e.g., launching ../target/debug/backend and using cargo run -p
cli); update the instructions so backend and cli are built for the host (remove
the --target for those crates or run a separate host build like "cargo build -p
backend -p cli") while still cross-compiling agent if needed, or alternatively
change later steps to reference the musl-built paths
(./target/x86_64-unknown-linux-musl/debug/backend and
./target/x86_64-unknown-linux-musl/debug/cli) so the commands and built artifact
locations (agent, backend, cli, target triple) are consistent.
---
Outside diff comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 341-369: The loop that checks agent readiness (the for attempt in
1..=300 loop using client.get(&agent_health_url)) can exceed the documented "30
seconds" because each HTTP request uses a 2s timeout and you also sleep 100ms
per iteration; change the loop to use elapsed wall-clock time (e.g., let start =
Instant::now(); while start.elapsed() < Duration::from_secs(30) { ... }) instead
of a fixed 300-iteration count, and either reduce the per-request timeout
(configured where the reqwest Client is built) to a shorter value (e.g., 500ms)
or explicitly document the actual worst-case timeout in the comment; ensure
vm_running checks and the same response handling (Ok(resp) branches) remain
unchanged.
---
Nitpick comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 284-311: The loop currently uses blocking std::fs calls inside an
async fn; replace std::fs::read_dir and std::fs::metadata with
tokio::fs::read_dir and tokio::fs::metadata (awaiting where necessary) in the
block that scans config.initramfs_dir, and adapt handling of the async
ReadDir/DirEntry stream (e.g., while let Some(entry) =
read_dir.next_entry().await?) to preserve the existing logic: skip non-files,
convert entry.path().file_name().and_then(|s| s.to_str()), check
name.starts_with(&prefix) && name.ends_with(".cpio.gz"), use
tokio::fs::metadata(&path).await.map(|m| m.len() > 0).unwrap_or(false) to detect
non-empty files, then set candidate = Some(path) and break; keep the VmError
mapping for the initial read_dir failure consistent with the current error
variant.
- Around line 257-265: The fallback for vmm_stop in the VmHandle construction
silently creates a disconnected AtomicBool when vmm_stop_handle is None; update
the code where VmHandle is constructed so that if vmm_stop_handle.is_none() you
emit a warning (e.g., via logger.warn/error) about the missing VMM stop handle
and then explicitly create the fallback Arc<AtomicBool> (vmm_stop) while
documenting it as a disconnected sentinel to aid cleanup; reference the VmHandle
struct and the vmm_stop and vmm_stop_handle symbols so reviewers can find and
verify the added log and the explicit handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a6c37b16-8b8e-46af-a14c-efb677ac44fa
📒 Files selected for processing (9)
QUICKSTART.mdagent/src/main.rsbackend/README.mdbackend/src/initramfs_manager.rsbackend/src/main.rsbackend/src/vm_lifecycle.rsvmm/README.mdvmm/src/devices/stdin.rsvmm/src/lib.rs
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/README.md`:
- Around line 72-73: The README lists VM_INITRAMFS_DIR as configurable but the
initramfs caching code in initramfs_manager.rs uses a hardcoded "tmp" path;
update the code that writes cached initramfs files (e.g., the function/method
that builds/writes the initramfs cache in initramfs_manager::*) to read the
VM_INITRAMFS_DIR configuration (env var or config struct) instead of "tmp",
construct and create the directory if missing, and use that path for both
initramfs files and any related IP_ALLOCATIONS_PATH logic so the runtime honors
the documented setting end-to-end; alternatively, if changing code is
undesirable, update README to explicitly state the current hardcoded limitation.
- Around line 24-26: The README shows inconsistent build vs run targets: it adds
the musl target and builds with --target x86_64-unknown-linux-musl (commands
rustup target add x86_64-unknown-linux-musl and cargo build -p backend -p agent
--target x86_64-unknown-linux-musl) but later runs ../target/debug/backend which
is the host debug path and may not exist; update the README so the run commands
point to the same target as the build (e.g.,
../target/x86_64-unknown-linux-musl/debug/backend and the copied agent path
./target/x86_64-unknown-linux-musl/debug/agent -> ./backend/cloude-agentd), or
change the build to omit --target so the existing run paths remain valid—ensure
the paths and the build flags for backend and agent match consistently.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/src/vm_lifecycle.rs (1)
207-232:⚠️ Potential issue | 🔴 CriticalRequire a real VMM stop handle before proceeding.
There is a race where tap creation can be detected after
try_recv()but before the next iteration, leading tovmm_stop_handlestayingNone. The fallback sentinel is not connected to VMM state, so teardown may fail to stop the VM thread and block on join.💡 Proposed fix
- if tokio::fs::metadata(&tap_path).await.is_ok() { - info!(vm_id = %vm_id, tap = %tap_device, "Tap device created"); - break; - } + if tokio::fs::metadata(&tap_path).await.is_ok() { + if vmm_stop_handle.is_none() { + if let Ok(handle) = stop_handle_rx.try_recv() { + vmm_stop_handle = Some(handle); + } + } + if vmm_stop_handle.is_some() { + info!(vm_id = %vm_id, tap = %tap_device, "Tap device created"); + break; + } + } - let vmm_stop = if let Some(handle) = vmm_stop_handle { - handle - } else { - warn!(vm_id = %vm_id, "Missing VMM stop handle; using disconnected fallback sentinel"); - Arc::new(std::sync::atomic::AtomicBool::new(false)) - }; + let vmm_stop = match vmm_stop_handle { + Some(handle) => handle, + None => { + let _ = Self::release_ip_internal(&vm_id, &ip_manager); + return Err(VmError::VmmCreation( + "VMM stop handle not received during startup".to_string(), + )); + } + };Also applies to: 257-262
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/vm_lifecycle.rs` around lines 207 - 232, The code can break out once the tap appears while vmm_stop_handle is still None, leaving only the sentinel and preventing proper teardown; update the loop around the startup/tap check to require a real VMM stop handle before proceeding: after checking startup_rx and before breaking when tokio::fs::metadata(&tap_path).await.is_ok(), attempt to obtain a real handle from stop_handle_rx (e.g., loop on stop_handle_rx.try_recv until you get Ok(handle) or Treat Disconnected as an error), set vmm_stop_handle = Some(handle) and only then break; apply the same change to the analogous block referenced at the later location (lines ~257-262) so teardown always has a concrete stop handle and not the sentinel.
🧹 Nitpick comments (1)
backend/src/vm_lifecycle.rs (1)
288-327: Make initramfs selection deterministic when multiple candidates exist.The first-match
read_dirscan is filesystem-order dependent. If multiplelanguage-*.cpio.gzfiles exist, selected image becomes nondeterministic.💡 Proposed refactor
- let mut candidate: Option<PathBuf> = None; + let mut candidate: Option<(std::time::SystemTime, PathBuf)> = None; ... - if name.starts_with(&prefix) && name.ends_with(".cpio.gz") { - let is_non_empty = tokio::fs::metadata(&path) + if name.starts_with(&prefix) && name.ends_with(".cpio.gz") { + let meta = tokio::fs::metadata(&path) .await - .map(|m| m.len() > 0) - .unwrap_or(false); - if is_non_empty { - candidate = Some(path); - break; + .ok(); + if let Some(m) = meta { + if m.len() == 0 { + continue; + } + let modified = m.modified().unwrap_or(std::time::SystemTime::UNIX_EPOCH); + match &candidate { + Some((best_ts, _)) if &modified <= best_ts => {} + _ => candidate = Some((modified, path)), + } } } } - match candidate { - Some(path) => Ok(path), + match candidate { + Some((_, path)) => Ok(path),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/vm_lifecycle.rs` around lines 288 - 327, The current initramfs selection stops at the first filesystem-order match (using prefix and entries from config.initramfs_dir), which is nondeterministic; modify the logic in vm_lifecycle.rs around the prefix/candidate handling to collect all files matching name.starts_with(&prefix) && name.ends_with(".cpio.gz") and with non-zero size into a vector, then choose one deterministically (e.g., sort by file name or by a well-defined attribute like modified time and pick the first/last). Replace the early break that sets candidate with this deterministic selection and return that chosen PathBuf.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/src/initramfs_manager.rs`:
- Around line 80-82: The cleanup step removes same-prefix files including the
freshly written .cpio.gz.meta, causing read_build_metadata to miss on restart;
update cleanup_old_versions so it preserves metadata for the current artifact
(skip deleting files that end with ".meta" or specifically skip current_filename
+ ".meta") or alternatively move the call to write_build_metadata to after
cleanup; change the code around write_build_metadata and cleanup_old_versions so
the current metadata is not deleted (refer to functions write_build_metadata,
read_build_metadata, cleanup_old_versions and the variables
current_prefix/current_filename).
In `@backend/src/main.rs`:
- Around line 210-213: The arithmetic computing host_bits, broadcast_offset,
pool_start and pool_end can overflow and should use checked operations and
explicit error returns; replace the unchecked expressions that derive host_bits
from ip_mask and compute broadcast_offset, pool_start and pool_end from ip_range
with checked shifts/adds/subs (e.g. checked_sub, checked_shl/checked_shr,
checked_add, checked_sub) and return an appropriate
std::io::ErrorKind::InvalidInput (or convert to your function's error type) with
a clear message when any checked operation returns None; update the code paths
that use host_bits, broadcast_offset, pool_start and pool_end to rely on the
validated values.
---
Outside diff comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 207-232: The code can break out once the tap appears while
vmm_stop_handle is still None, leaving only the sentinel and preventing proper
teardown; update the loop around the startup/tap check to require a real VMM
stop handle before proceeding: after checking startup_rx and before breaking
when tokio::fs::metadata(&tap_path).await.is_ok(), attempt to obtain a real
handle from stop_handle_rx (e.g., loop on stop_handle_rx.try_recv until you get
Ok(handle) or Treat Disconnected as an error), set vmm_stop_handle =
Some(handle) and only then break; apply the same change to the analogous block
referenced at the later location (lines ~257-262) so teardown always has a
concrete stop handle and not the sentinel.
---
Nitpick comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 288-327: The current initramfs selection stops at the first
filesystem-order match (using prefix and entries from config.initramfs_dir),
which is nondeterministic; modify the logic in vm_lifecycle.rs around the
prefix/candidate handling to collect all files matching
name.starts_with(&prefix) && name.ends_with(".cpio.gz") and with non-zero size
into a vector, then choose one deterministically (e.g., sort by file name or by
a well-defined attribute like modified time and pick the first/last). Replace
the early break that sets candidate with this deterministic selection and return
that chosen PathBuf.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3df1601c-31fe-4673-86dc-1faaa62b70e6
📒 Files selected for processing (5)
QUICKSTART.mdagent/src/main.rsbackend/src/initramfs_manager.rsbackend/src/main.rsbackend/src/vm_lifecycle.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- agent/src/main.rs
There was a problem hiding this comment.
🧹 Nitpick comments (2)
backend/src/main.rs (1)
421-459: Consider the interaction between client timeout and retry logic.The HTTP client has a 300-second timeout (line 190). If the agent hangs without responding, each attempt could wait up to 5 minutes before timing out, making the total worst-case wait ~25 minutes across 5 retries.
Since VMs are ephemeral per-job, a shorter timeout (e.g., 60s) with the existing retry logic may provide better failure detection while still accommodating legitimate long-running code execution.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/main.rs` around lines 421 - 459, The current retry loop around state.client.post(&execute_url).json(&request_payload).send() can wait the client's global 300s timeout per attempt, causing up to ~25 minutes of blocking; fix by enforcing a much shorter per-request timeout (e.g., 60s) or reducing the client's default timeout so each attempt fails fast. Concretely, add a per-request timeout to the request builder (e.g., .timeout(std::time::Duration::from_secs(60)) on the post(...) chain before .send()), or adjust the client creation to use a 60s timeout, keeping the existing for attempt 1..=5 retry logic and preserving handling of execution_result and error logging.backend/src/vm_lifecycle.rs (1)
347-368: Lexicographic sort may select the wrong version for multi-digit components.The current approach sorts filenames lexicographically and picks the last entry. This works for single-digit versions but fails for semver:
python-1.10.cpio.gzsorts beforepython-1.9.cpio.gz.If only one version per language is expected at runtime (after cleanup removes old versions), this is fine. Otherwise, consider parsing version strings properly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/vm_lifecycle.rs` around lines 347 - 368, The code currently sorts candidate paths lexicographically (candidates.sort_by(...)) and picks the last via candidates.into_iter().next_back(), which misorders multi-digit semantic versions (e.g., 1.10 < 1.9 lexicographically); change the selection to parse the version component from file_name() (matching prefix + "<version>.cpio.gz") and compare versions numerically (e.g., split by dots and compare each component as integers or use the semver crate) in the sort_by closure (or use max_by with the parsed-version key) so the highest semantic version is chosen; ensure errors still return the same VmError::InitramfsBuild message when no candidates remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/src/main.rs`:
- Around line 421-459: The current retry loop around
state.client.post(&execute_url).json(&request_payload).send() can wait the
client's global 300s timeout per attempt, causing up to ~25 minutes of blocking;
fix by enforcing a much shorter per-request timeout (e.g., 60s) or reducing the
client's default timeout so each attempt fails fast. Concretely, add a
per-request timeout to the request builder (e.g.,
.timeout(std::time::Duration::from_secs(60)) on the post(...) chain before
.send()), or adjust the client creation to use a 60s timeout, keeping the
existing for attempt 1..=5 retry logic and preserving handling of
execution_result and error logging.
In `@backend/src/vm_lifecycle.rs`:
- Around line 347-368: The code currently sorts candidate paths
lexicographically (candidates.sort_by(...)) and picks the last via
candidates.into_iter().next_back(), which misorders multi-digit semantic
versions (e.g., 1.10 < 1.9 lexicographically); change the selection to parse the
version component from file_name() (matching prefix + "<version>.cpio.gz") and
compare versions numerically (e.g., split by dots and compare each component as
integers or use the semver crate) in the sort_by closure (or use max_by with the
parsed-version key) so the highest semantic version is chosen; ensure errors
still return the same VmError::InitramfsBuild message when no candidates remain.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0389cb31-3bab-47d7-991e-7f57da38c197
📒 Files selected for processing (3)
backend/src/initramfs_manager.rsbackend/src/main.rsbackend/src/vm_lifecycle.rs
vmm/README.md
Outdated
| - vCPU threads run while internal `running == true`. | ||
| - `run()` executes the event loop and joins vCPU threads at shutdown. | ||
| - SIGUSR1 is used internally to interrupt threads blocked in `KVM_RUN` during | ||
| shutdown. |
There was a problem hiding this comment.
It's only a bunch of (seemingly random ?) implementation details. Plz remove
vmm/README.md
Outdated
| ## Debugging notes | ||
|
|
||
| - Guest serial output destination is chosen by the caller. | ||
| - Backend can forward to host stdout (verbose) | ||
| - or to sink (quiet mode) |
There was a problem hiding this comment.
Unecessary details ?
backend/src/vm_lifecycle.rs
Outdated
| return Err(vm_err); | ||
| } | ||
| Err(std::sync::mpsc::TryRecvError::Disconnected) => { | ||
| if !vm_running.load(std::sync::atomic::Ordering::SeqCst) { |
There was a problem hiding this comment.
Are you sure vm_running is needed ? Can't you use stop_handle ?
backend/src/vm_lifecycle.rs
Outdated
| @@ -14,6 +15,7 @@ pub struct VmHandle { | |||
| pub tap_device: String, | |||
| vm_thread: Option<thread::JoinHandle<()>>, | |||
| vm_running: Arc<std::sync::atomic::AtomicBool>, | |||
There was a problem hiding this comment.
Plz document vm_running if you want to keep it
|
@Proxyfil there are conflicts |
562c671 to
7f1b4d4
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/src/vm_lifecycle.rs (2)
250-269:⚠️ Potential issue | 🔴 CriticalStartup failure paths can orphan a running VM thread.
On timeout/bridge-attach failure, the function returns after flipping
vm_running, but may leave the VMM running because the internal stop flag is not signaled on all paths before returning.Proposed fix
if start.elapsed() >= max_wait { error!(vm_id = %vm_id, tap = %tap_device, "Tap device not created within timeout"); + if let Some(stop) = &vmm_stop_handle { + stop.store(false, std::sync::atomic::Ordering::SeqCst); + } vm_running.store(false, std::sync::atomic::Ordering::SeqCst); let _ = Self::release_ip_internal(&vm_id, &ip_manager); return Err(VmError::NetworkSetup(format!( "Tap device {} not created within {:?}", tap_device, max_wait ))); } @@ if let Err(e) = virt::network::setup_guest_iface(&tap_device, &config.bridge_name).await { error!(vm_id = %vm_id, "Failed to attach tap to bridge: {}", e); + if let Some(stop) = &vmm_stop_handle { + stop.store(false, std::sync::atomic::Ordering::SeqCst); + } vm_running.store(false, std::sync::atomic::Ordering::SeqCst); let _ = Self::release_ip_internal(&vm_id, &ip_manager); return Err(VmError::NetworkSetup(e.to_string())); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/vm_lifecycle.rs` around lines 250 - 269, On both failure paths (the tap-creation timeout branch and the bridge attach error after virt::network::setup_guest_iface) ensure you signal the VM's internal stop flag/mechanism used by the running VMM thread in addition to flipping vm_running and calling Self::release_ip_internal; update the timeout branch and the Err(e) branch to invoke the same stop-signal (channel send or atomic/store used by the VM thread) before returning Err(VmError::NetworkSetup(...)) so the VMM thread is reliably asked to exit and cannot be orphaned.
434-437:⚠️ Potential issue | 🟠 MajorMove
thread.join()tospawn_blockingto prevent blocking the async executor.
destroy()is an async function that callsthread.join()without timeout at line 436. Although the comment says "Don't wait forever",JoinHandle::join()actually blocks indefinitely until the thread completes. In a tokio runtime, this blocks a worker thread and reduces available capacity for other tasks. Usetokio::task::spawn_blocking()to move the join to a dedicated thread pool.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/vm_lifecycle.rs` around lines 434 - 437, In destroy(), avoid blocking the async executor by moving the blocking JoinHandle::join() call onto the blocking pool: when self.vm_thread.take() yields Some(thread), call tokio::task::spawn_blocking(...) to run thread.join() there (await the spawn_blocking handle), rather than calling thread.join() directly; update references around destroy(), vm_thread and the JoinHandle::join() invocation to use spawn_blocking so the async runtime isn’t stalled.
♻️ Duplicate comments (2)
backend/README.md (1)
41-43:⚠️ Potential issue | 🟠 MajorBuild/run target mismatch is reintroduced in setup instructions.
The docs build backend for
x86_64-unknown-linux-muslbut later run../target/debug/backend(host path), so the documented run binary may not exist.Proposed fix
rustup target add x86_64-unknown-linux-musl -cargo build -p backend -p agent --target x86_64-unknown-linux-musl +cargo build -p backend +cargo build -p agent --target x86_64-unknown-linux-musl cp ./target/x86_64-unknown-linux-musl/debug/agent ./backend/cloude-agentd chmod +x ./backend/cloude-agentd🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/README.md` around lines 41 - 43, The README shows building backend and agent for the musl target (commands using x86_64-unknown-linux-musl and producing ./target/x86_64-unknown-linux-musl/debug/agent) but later refers to running ../target/debug/backend which won’t exist for the musl build; update the instructions so the run step points to the same built binary path (e.g., ./target/x86_64-unknown-linux-musl/debug/backend) or change the build step to produce the host debug binary (remove the --target flag) so ../target/debug/backend matches; adjust references to agent/cloude-agentd consistently (agent vs cloude-agentd) to avoid path/name mismatch.backend/src/initramfs_manager.rs (1)
149-152:⚠️ Potential issue | 🟠 MajorCleanup predicate is over-broad and may delete non-initramfs files.
Current logic removes any same-prefix file, not only initramfs artifacts. In a shared/custom
initramfs_dir, this can delete unrelated files.Proposed fix
if let Some(fname) = path.file_name().and_then(|s| s.to_str()) { + let is_initramfs_artifact = + fname.ends_with(".cpio.gz") || fname.ends_with(".cpio.gz.meta"); if fname.starts_with(current_prefix) + && is_initramfs_artifact && fname != current_filename && fname != current_metadata_filename { fs::remove_file(&path) .map_err(|e| Error::new(ErrorKind::Other, e.to_string()))?; } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/initramfs_manager.rs` around lines 149 - 152, The current cleanup predicate using fname.starts_with(current_prefix) && fname != current_filename && fname != current_metadata_filename is too broad and may remove unrelated files; change the predicate to only target initramfs artifacts by additionally checking that fname matches the repository's initramfs naming pattern (e.g., specific suffixes or exact constructed names) — use the existing symbols fname, current_prefix, current_filename, current_metadata_filename to implement a stricter test such as ensuring fname ends_with known initramfs suffix(es) or equals one of the canonical generated names derived from current_prefix before deleting; this ensures only real initramfs images/metadata are removed.
🧹 Nitpick comments (2)
backend/src/main.rs (2)
30-30:IpManageris double-synchronized; outerMutexis redundant.
IpManageralready serializes internally, so wrapping it inArc<Mutex<...>>adds avoidable contention and lock layering. PreferArc<IpManager>and pass shared references directly.Also applies to: 255-262
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/main.rs` at line 30, The ip_manager field is wrapped in Arc<Mutex<IpManager>> which double-synchronizes because IpManager already handles internal synchronization; change the type to Arc<IpManager> (replace uses of ip_manager.lock()/mutex guards with direct Arc-cloned references or direct method calls on IpManager) and remove the unnecessary Mutex wraps/imports; also update all other places that currently call ip_manager.lock() (e.g., the usages noted around the later block) to call IpManager's thread-safe methods or clone the Arc before invoking methods so no explicit locking remains.
424-445: Retry loop stops on first non-2xx, including transient server failures.For resilience, consider retrying transient statuses (e.g., 429/5xx) with backoff instead of breaking immediately on any non-success response.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/src/main.rs` around lines 424 - 445, The current retry loop breaks on any non-2xx response; update the loop around state.client.post(&execute_url).json(&request_payload).send().await so that for transient statuses (e.g., resp.status().as_u16() == 429 or resp.status().is_server_error()) you do not set execution_result and break immediately but instead perform an exponential backoff (use tokio::time::sleep with a duration derived from attempt, e.g., base_ms * 2_u64.pow(attempt-1)) and continue to the next iteration; only parse into AgentExecuteResponse and set execution_result on successful responses, and for non-retryable client errors set execution_result to an Err and break. Ensure you reference execute_url, state.client.post, AgentExecuteResponse, execution_result and the for attempt in 1..=5 loop when making the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 347-355: The current selection sorts candidates lexicographically
using candidates.sort_by with file_name().and_then(|s| s.to_str()).cmp(...) and
then picks next_back(), which misorders versions like "python-3.9" vs
"python-3.10"; change the comparison to parse a semantic/numeric version from
the filename (e.g., via a regex extracting major/minor/patch or use the semver
crate) and sort by that numeric tuple instead, falling back to the original
filename string comparison if parsing fails; update the candidates.sort_by call
to compare parsed (major, minor, patch) integers so match
candidates.into_iter().next_back() reliably returns the highest semantic
version.
---
Outside diff comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 250-269: On both failure paths (the tap-creation timeout branch
and the bridge attach error after virt::network::setup_guest_iface) ensure you
signal the VM's internal stop flag/mechanism used by the running VMM thread in
addition to flipping vm_running and calling Self::release_ip_internal; update
the timeout branch and the Err(e) branch to invoke the same stop-signal (channel
send or atomic/store used by the VM thread) before returning
Err(VmError::NetworkSetup(...)) so the VMM thread is reliably asked to exit and
cannot be orphaned.
- Around line 434-437: In destroy(), avoid blocking the async executor by moving
the blocking JoinHandle::join() call onto the blocking pool: when
self.vm_thread.take() yields Some(thread), call tokio::task::spawn_blocking(...)
to run thread.join() there (await the spawn_blocking handle), rather than
calling thread.join() directly; update references around destroy(), vm_thread
and the JoinHandle::join() invocation to use spawn_blocking so the async runtime
isn’t stalled.
---
Duplicate comments:
In `@backend/README.md`:
- Around line 41-43: The README shows building backend and agent for the musl
target (commands using x86_64-unknown-linux-musl and producing
./target/x86_64-unknown-linux-musl/debug/agent) but later refers to running
../target/debug/backend which won’t exist for the musl build; update the
instructions so the run step points to the same built binary path (e.g.,
./target/x86_64-unknown-linux-musl/debug/backend) or change the build step to
produce the host debug binary (remove the --target flag) so
../target/debug/backend matches; adjust references to agent/cloude-agentd
consistently (agent vs cloude-agentd) to avoid path/name mismatch.
In `@backend/src/initramfs_manager.rs`:
- Around line 149-152: The current cleanup predicate using
fname.starts_with(current_prefix) && fname != current_filename && fname !=
current_metadata_filename is too broad and may remove unrelated files; change
the predicate to only target initramfs artifacts by additionally checking that
fname matches the repository's initramfs naming pattern (e.g., specific suffixes
or exact constructed names) — use the existing symbols fname, current_prefix,
current_filename, current_metadata_filename to implement a stricter test such as
ensuring fname ends_with known initramfs suffix(es) or equals one of the
canonical generated names derived from current_prefix before deleting; this
ensures only real initramfs images/metadata are removed.
---
Nitpick comments:
In `@backend/src/main.rs`:
- Line 30: The ip_manager field is wrapped in Arc<Mutex<IpManager>> which
double-synchronizes because IpManager already handles internal synchronization;
change the type to Arc<IpManager> (replace uses of ip_manager.lock()/mutex
guards with direct Arc-cloned references or direct method calls on IpManager)
and remove the unnecessary Mutex wraps/imports; also update all other places
that currently call ip_manager.lock() (e.g., the usages noted around the later
block) to call IpManager's thread-safe methods or clone the Arc before invoking
methods so no explicit locking remains.
- Around line 424-445: The current retry loop breaks on any non-2xx response;
update the loop around
state.client.post(&execute_url).json(&request_payload).send().await so that for
transient statuses (e.g., resp.status().as_u16() == 429 or
resp.status().is_server_error()) you do not set execution_result and break
immediately but instead perform an exponential backoff (use tokio::time::sleep
with a duration derived from attempt, e.g., base_ms * 2_u64.pow(attempt-1)) and
continue to the next iteration; only parse into AgentExecuteResponse and set
execution_result on successful responses, and for non-retryable client errors
set execution_result to an Err and break. Ensure you reference execute_url,
state.client.post, AgentExecuteResponse, execution_result and the for attempt in
1..=5 loop when making the changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f8d1de6b-f151-4880-b636-3ea755e1c07a
📒 Files selected for processing (9)
QUICKSTART.mdagent/src/main.rsbackend/README.mdbackend/src/initramfs_manager.rsbackend/src/main.rsbackend/src/vm_lifecycle.rsvmm/README.mdvmm/src/devices/stdin.rsvmm/src/lib.rs
🚧 Files skipped from review as they are similar to previous changes (4)
- vmm/src/devices/stdin.rs
- agent/src/main.rs
- QUICKSTART.md
- vmm/src/lib.rs
| candidates.sort_by(|a, b| { | ||
| a.file_name() | ||
| .and_then(|s| s.to_str()) | ||
| .cmp(&b.file_name().and_then(|s| s.to_str())) | ||
| }); | ||
|
|
||
| Ok(initramfs_path) | ||
| match candidates.into_iter().next_back() { | ||
| Some(path) => Ok(path), | ||
| None => { |
There was a problem hiding this comment.
Initramfs selection is version-unsafe due to lexicographic sort.
Sorting by filename string can choose the wrong version (python-3.9 > python-3.10). This can boot an unexpected runtime.
Proposed fix
- candidates.sort_by(|a, b| {
- a.file_name()
- .and_then(|s| s.to_str())
- .cmp(&b.file_name().and_then(|s| s.to_str()))
- });
-
- match candidates.into_iter().next_back() {
+ let mut newest: Option<(PathBuf, std::time::SystemTime)> = None;
+ for path in candidates {
+ let modified = tokio::fs::metadata(&path)
+ .await
+ .ok()
+ .and_then(|m| m.modified().ok())
+ .unwrap_or(std::time::SystemTime::UNIX_EPOCH);
+ match &newest {
+ Some((_, t)) if &modified <= t => {}
+ _ => newest = Some((path, modified)),
+ }
+ }
+
+ match newest.map(|(p, _)| p) {
Some(path) => Ok(path),
None => {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/src/vm_lifecycle.rs` around lines 347 - 355, The current selection
sorts candidates lexicographically using candidates.sort_by with
file_name().and_then(|s| s.to_str()).cmp(...) and then picks next_back(), which
misorders versions like "python-3.9" vs "python-3.10"; change the comparison to
parse a semantic/numeric version from the filename (e.g., via a regex extracting
major/minor/patch or use the semver crate) and sort by that numeric tuple
instead, falling back to the original filename string comparison if parsing
fails; update the candidates.sort_by call to compare parsed (major, minor,
patch) integers so match candidates.into_iter().next_back() reliably returns the
highest semantic version.
PR Description
This PR stabilizes the full VM execution flow by improving VMM shutdown behavior, fixing agent output handling, and hardening backend VM/agent lifecycle steps (readiness checks, execute retries, initramfs refresh, and job finalization order).
These changes were made to prevent backend freezes, reduce transient VM-agent connectivity failures, and ensure jobs reliably resolve with usable output or clear errors.
How to test
Use the simplified full flow from QUICKSTART.md:
nftables,docker, Rust musl target).backend/cloude-agentd.VM_KERNEL_PATH,VM_INITRAMFS_DIR,VM_LOG_GUEST_CONSOLE=false, etc.).cargo run -p cli -- go --language python --file agent/examples/hello.pycargo run -p cli -- status <JOB_ID>Summary by CodeRabbit
New Features
Bug Fixes
Documentation