Skip to content

feat: Implement full application flow#66

Merged
ThomasRubini merged 15 commits intomainfrom
feat/implement-full-flow
Mar 16, 2026
Merged

feat: Implement full application flow#66
ThomasRubini merged 15 commits intomainfrom
feat/implement-full-flow

Conversation

@Proxyfil
Copy link
Copy Markdown
Contributor

@Proxyfil Proxyfil commented Mar 15, 2026

PR Description

This PR stabilizes the full VM execution flow by improving VMM shutdown behavior, fixing agent output handling, and hardening backend VM/agent lifecycle steps (readiness checks, execute retries, initramfs refresh, and job finalization order).
These changes were made to prevent backend freezes, reduce transient VM-agent connectivity failures, and ensure jobs reliably resolve with usable output or clear errors.

How to test

Use the simplified full flow from QUICKSTART.md:

  1. Install one-time prerequisites (nftables, docker, Rust musl target).
  2. Build backend + CLI + musl agent, then copy the agent binary to backend/cloude-agentd.
  3. Start backend with the documented env vars (VM_KERNEL_PATH, VM_INITRAMFS_DIR, VM_LOG_GUEST_CONSOLE=false, etc.).
  4. Submit a job with CLI:
    • cargo run -p cli -- go --language python --file agent/examples/hello.py
  5. Poll status:
    • cargo run -p cli -- status <JOB_ID>
  6. Confirm the job completes and returns stdout/stderr, and backend remains responsive for repeated runs.

Summary by CodeRabbit

  • New Features

    • Quickstart guide for local development
    • VM-based job execution with per-job VM lifecycle, IP allocation, and optional guest console logging
    • Graceful VM stop signaling for cleaner shutdowns
  • Bug Fixes

    • Server now listens on all interfaces by default
    • Improved handling of command output EOFs and channel closures to avoid busy-waiting or panics
    • Retries and clearer failure states for job runs
  • Documentation

    • Expanded backend docs: env vars, musl build guidance, troubleshooting, and initramfs cache reset steps

@Proxyfil Proxyfil self-assigned this Mar 15, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces agent-forwarding with per-job VM orchestration: backend now allocates IPs, builds/selects initramfs images, creates a VM per job, waits for the in-VM agent, executes job via HTTP, collects results, and tears down the VM; adds initramfs rebuild metadata, VMM stop signaling, safer stdin handling, and new docs (QUICKSTART, README updates).

Changes

Cohort / File(s) Summary
Documentation
QUICKSTART.md, backend/README.md, vmm/README.md
Adds QUICKSTART; expands backend and vmm READMEs with musl build steps, new env vars (VM_KERNEL_PATH, VM_INITRAMFS_DIR, VM_LOG_GUEST_CONSOLE), troubleshooting, initramfs cache/reset guidance, and VMM API/behavior docs.
Agent
agent/src/main.rs
Default bind changed to 0.0.0.0:3001; run_process output reader now tracks EOF with recv_closed, treats closed channel as normal EOF, and handles StreamResult::Exceeded by killing the child process.
Backend — Core / VM orchestration
backend/src/main.rs, backend/src/vm_lifecycle.rs, backend/src/ip_manager.rs*
Removes agent_url; AppState now holds vm_config and ip_manager; creates a VM per job, allocates/releases IPs, retries agent calls, captures outputs, ensures VM teardown; VmConfig/VmHandle APIs and signatures updated (language param, initramfs_dir, log_guest_console).
Backend — Initramfs
backend/src/initramfs_manager.rs
Threaded initramfs_dir through setup, changed prepare_paths signature, added should_rebuild, persisted build metadata (read/write), excluded metadata from cleanup, and conditionally rebuilds initramfs based on mtimes/base image.
VMM
vmm/src/lib.rs, vmm/src/devices/stdin.rs
Adds VMM::stop_handle() exposing the running flag; stdin device now handles EOF and event add/remove errors gracefully (logs instead of panics).
Misc / Examples
backend/README.md, QUICKSTART.md
Examples and env defaults updated to reference musl-built agent, kernel/initramfs paths, initramfs caching/reset instructions, and new runtime env vars.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Backend as Backend
    participant IpMgr as IpManager
    participant VM as VmHandle
    participant VMM as VMM
    participant Agent as Agent

    Client->>Backend: Submit job (code, language)
    Backend->>IpMgr: Allocate IP
    IpMgr-->>Backend: IP assigned
    Backend->>VM: Create VM(language, VmConfig)
    VM->>VMM: Start (kernel + initramfs)
    VMM-->>VM: VMM running
    VM->>Agent: Wait for agent / health
    Agent-->>VM: Agent ready
    Backend->>Agent: HTTP execute job
    Agent-->>Backend: Job result (stdout, stderr, exit_code)
    Backend->>VM: Destroy VM (set stop_handle)
    VM->>VMM: Stop
    VMM-->>VM: Stopped
    VM->>IpMgr: Release IP
    Backend-->>Client: Job complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • ThomasRubini
  • jorisvilardell
  • axelfrache
  • surina-margarita

Poem

🐰 I hopped through bytes and fixed the flow,
Built tiny VMs where agents grow.
IPs assigned and initramfs snug,
Jobs run quick — then teardown’s a hug. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Implement full application flow' accurately describes the main objective: stabilizing the full VM execution flow. This is clearly the primary change across all modified files including VMM shutdown, agent output handling, backend VM/agent lifecycle, and job finalization.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/implement-full-flow
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can enforce grammar and style rules using `languagetool`.

Configure the reviews.tools.languagetool setting to enable/disable rules and categories. Refer to the LanguageTool Community to learn more.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/src/vm_lifecycle.rs (1)

341-369: ⚠️ Potential issue | 🟡 Minor

Timeout calculation may exceed stated 30-second limit.

The comment states "Try for up to 30 seconds," and with 300 iterations × 100ms sleep, that adds up to 30 seconds of sleep time. However, each client.get() has a 2-second timeout (line 337). If the agent is unreachable (connection hangs), each attempt could take up to 2 seconds + 100ms sleep, potentially extending total wait time well beyond 30 seconds.

Consider either:

  1. Reducing the per-request timeout to a shorter value (e.g., 500ms).
  2. Tracking elapsed wall-clock time instead of iteration count.
  3. Updating the comment to reflect actual worst-case behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/vm_lifecycle.rs` around lines 341 - 369, The loop that checks
agent readiness (the for attempt in 1..=300 loop using
client.get(&agent_health_url)) can exceed the documented "30 seconds" because
each HTTP request uses a 2s timeout and you also sleep 100ms per iteration;
change the loop to use elapsed wall-clock time (e.g., let start =
Instant::now(); while start.elapsed() < Duration::from_secs(30) { ... }) instead
of a fixed 300-iteration count, and either reduce the per-request timeout
(configured where the reqwest Client is built) to a shorter value (e.g., 500ms)
or explicitly document the actual worst-case timeout in the comment; ensure
vm_running checks and the same response handling (Ok(resp) branches) remain
unchanged.
🧹 Nitpick comments (2)
backend/src/vm_lifecycle.rs (2)

284-311: Blocking I/O in async function.

std::fs::read_dir and std::fs::metadata are synchronous blocking calls used inside an async fn. While this is unlikely to cause issues for a small directory scan, it blocks the Tokio runtime thread. Consider using tokio::fs::read_dir and tokio::fs::metadata for consistency with other async filesystem operations in this file (e.g., line 229).

♻️ Suggested async approach
-        let entries = std::fs::read_dir(&config.initramfs_dir).map_err(|e| {
+        let mut entries = tokio::fs::read_dir(&config.initramfs_dir).await.map_err(|e| {
             VmError::InitramfsBuild(format!(
                 "Failed to read initramfs dir '{}': {}",
                 config.initramfs_dir.display(),
                 e
             ))
         })?;

-        for entry in entries.flatten() {
-            let path = entry.path();
-            if !path.is_file() {
+        while let Ok(Some(entry)) = entries.next_entry().await {
+            let path = entry.path();
+            let is_file = tokio::fs::metadata(&path)
+                .await
+                .map(|m| m.is_file())
+                .unwrap_or(false);
+            if !is_file {
                 continue;
             }
             // ... rest of the logic with tokio::fs::metadata for size check
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/vm_lifecycle.rs` around lines 284 - 311, The loop currently uses
blocking std::fs calls inside an async fn; replace std::fs::read_dir and
std::fs::metadata with tokio::fs::read_dir and tokio::fs::metadata (awaiting
where necessary) in the block that scans config.initramfs_dir, and adapt
handling of the async ReadDir/DirEntry stream (e.g., while let Some(entry) =
read_dir.next_entry().await?) to preserve the existing logic: skip non-files,
convert entry.path().file_name().and_then(|s| s.to_str()), check
name.starts_with(&prefix) && name.ends_with(".cpio.gz"), use
tokio::fs::metadata(&path).await.map(|m| m.len() > 0).unwrap_or(false) to detect
non-empty files, then set candidate = Some(path) and break; keep the VmError
mapping for the initial read_dir failure consistent with the current error
variant.

257-265: Fallback vmm_stop handle won't control the actual VMM.

When vmm_stop_handle is None (e.g., if VMM creation failed before sending the handle), the fallback Arc::new(AtomicBool::new(false)) creates a disconnected flag that has no effect on the VMM. While this may be acceptable since the VMM likely isn't running in that scenario, consider logging a warning or explicitly handling this case to avoid silent failures during cleanup.

♻️ Suggested improvement
-            vmm_stop: vmm_stop_handle.unwrap_or_else(|| Arc::new(std::sync::atomic::AtomicBool::new(false))),
+            vmm_stop: vmm_stop_handle.unwrap_or_else(|| {
+                warn!(vm_id = %vm_id, "VMM stop handle not received; shutdown signaling may be ineffective");
+                Arc::new(std::sync::atomic::AtomicBool::new(false))
+            }),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/vm_lifecycle.rs` around lines 257 - 265, The fallback for
vmm_stop in the VmHandle construction silently creates a disconnected AtomicBool
when vmm_stop_handle is None; update the code where VmHandle is constructed so
that if vmm_stop_handle.is_none() you emit a warning (e.g., via
logger.warn/error) about the missing VMM stop handle and then explicitly create
the fallback Arc<AtomicBool> (vmm_stop) while documenting it as a disconnected
sentinel to aid cleanup; reference the VmHandle struct and the vmm_stop and
vmm_stop_handle symbols so reviewers can find and verify the added log and the
explicit handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent/src/main.rs`:
- Around line 294-302: The rx.recv() branch stays enabled after it yields None,
causing a busy-spin; add a mutable flag (e.g., recv_closed) before the
tokio::select! loop and set it true when you observe None from rx.recv() (the
reader EOF case), then guard the select arm so it only polls rx.recv() when
recv_closed is false (e.g., use conditional pattern: if !recv_closed =>
Some(stream_result) = rx.recv() => { ... } ); keep handling
StreamResult::Exceeded via StreamResult::Exceeded and call child.kill().await as
before, but stop polling rx after you mark recv_closed to avoid the busy loop.

In `@backend/src/initramfs_manager.rs`:
- Around line 154-167: The should_rebuild function currently only compares
out_mtime against agent_binary and init_script; update it to also consider the
effective base image (and/or the languages.json entry that determines
base_image) as a build input so changes there force a rebuild: locate
should_rebuild in initramfs_manager.rs and add logic to resolve the effective
base_image (or read the relevant languages.json entry), stat that file or derive
a stable fingerprint (e.g., mtime or hash of its content or the resolved image
tag) and include it in the final Ok(...) condition so the function returns true
when the base_image/languages.json input is newer or different than out_path.
Ensure errors from stat/reading are mapped similarly to existing error handling.

In `@backend/src/main.rs`:
- Line 296: The current line normalizes user source by calling
payload.code.trim().to_string() which alters program text and can shift
diagnostics; instead, keep the original source bytes for execution by assigning
the raw string (e.g., let code = payload.code.clone() or
payload.code.to_string()) and only use payload.code.trim().is_empty() to reject
empty submissions before proceeding; update the check where you currently create
code (replace the trim call with a raw copy) and add an explicit emptiness check
using payload.code.trim().is_empty() to reject empty input.
- Around line 194-203: The VM initramfs dir from env (vm_initramfs_dir) is set
and passed into VmConfig but InitramfsLanguage::prepare_paths() still writes to
hardcoded "tmp/"; update prepare_paths to accept the configured initramfs
directory (or read it from VmConfig) and use that path when creating/writing
archives, then update all callers of InitramfsLanguage::prepare_paths() to pass
vm_initramfs_dir (or VmConfig) so generated initramfs files land in the same
directory created by tokio::fs::create_dir_all; ensure the signature change and
call sites use the unique symbols InitramfsLanguage::prepare_paths and
vm_initramfs_dir (or VmConfig.initramfs_dir).
- Around line 153-161: The current check only rejects ip_mask > 30 but allows 0
which leads to host_bits = 32 and an invalid left shift (1_u32 << host_bits);
update the validation that uses IP_MASK (the ip_mask variable and the broadcast
offset/host_bits calculation) to require 1..=30 and return an
std::io::Error::new(std::io::ErrorKind::InvalidInput, ...) when out of range;
specifically ensure the guard before computing host_bits checks ip_mask >= 1 &&
ip_mask <= 30 and include the offending ip_mask value in the error message to
make the failure explicit.

In `@QUICKSTART.md`:
- Around line 19-21: The docs currently cross-compile backend/cli/agent to
x86_64-unknown-linux-musl (via "cargo build -p backend -p cli -p agent --target
x86_64-unknown-linux-musl"), but subsequent steps expect host binaries under
./target/debug (e.g., launching ../target/debug/backend and using cargo run -p
cli); update the instructions so backend and cli are built for the host (remove
the --target for those crates or run a separate host build like "cargo build -p
backend -p cli") while still cross-compiling agent if needed, or alternatively
change later steps to reference the musl-built paths
(./target/x86_64-unknown-linux-musl/debug/backend and
./target/x86_64-unknown-linux-musl/debug/cli) so the commands and built artifact
locations (agent, backend, cli, target triple) are consistent.

---

Outside diff comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 341-369: The loop that checks agent readiness (the for attempt in
1..=300 loop using client.get(&agent_health_url)) can exceed the documented "30
seconds" because each HTTP request uses a 2s timeout and you also sleep 100ms
per iteration; change the loop to use elapsed wall-clock time (e.g., let start =
Instant::now(); while start.elapsed() < Duration::from_secs(30) { ... }) instead
of a fixed 300-iteration count, and either reduce the per-request timeout
(configured where the reqwest Client is built) to a shorter value (e.g., 500ms)
or explicitly document the actual worst-case timeout in the comment; ensure
vm_running checks and the same response handling (Ok(resp) branches) remain
unchanged.

---

Nitpick comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 284-311: The loop currently uses blocking std::fs calls inside an
async fn; replace std::fs::read_dir and std::fs::metadata with
tokio::fs::read_dir and tokio::fs::metadata (awaiting where necessary) in the
block that scans config.initramfs_dir, and adapt handling of the async
ReadDir/DirEntry stream (e.g., while let Some(entry) =
read_dir.next_entry().await?) to preserve the existing logic: skip non-files,
convert entry.path().file_name().and_then(|s| s.to_str()), check
name.starts_with(&prefix) && name.ends_with(".cpio.gz"), use
tokio::fs::metadata(&path).await.map(|m| m.len() > 0).unwrap_or(false) to detect
non-empty files, then set candidate = Some(path) and break; keep the VmError
mapping for the initial read_dir failure consistent with the current error
variant.
- Around line 257-265: The fallback for vmm_stop in the VmHandle construction
silently creates a disconnected AtomicBool when vmm_stop_handle is None; update
the code where VmHandle is constructed so that if vmm_stop_handle.is_none() you
emit a warning (e.g., via logger.warn/error) about the missing VMM stop handle
and then explicitly create the fallback Arc<AtomicBool> (vmm_stop) while
documenting it as a disconnected sentinel to aid cleanup; reference the VmHandle
struct and the vmm_stop and vmm_stop_handle symbols so reviewers can find and
verify the added log and the explicit handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a6c37b16-8b8e-46af-a14c-efb677ac44fa

📥 Commits

Reviewing files that changed from the base of the PR and between 35fc611 and bab1f8e.

📒 Files selected for processing (9)
  • QUICKSTART.md
  • agent/src/main.rs
  • backend/README.md
  • backend/src/initramfs_manager.rs
  • backend/src/main.rs
  • backend/src/vm_lifecycle.rs
  • vmm/README.md
  • vmm/src/devices/stdin.rs
  • vmm/src/lib.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/README.md`:
- Around line 72-73: The README lists VM_INITRAMFS_DIR as configurable but the
initramfs caching code in initramfs_manager.rs uses a hardcoded "tmp" path;
update the code that writes cached initramfs files (e.g., the function/method
that builds/writes the initramfs cache in initramfs_manager::*) to read the
VM_INITRAMFS_DIR configuration (env var or config struct) instead of "tmp",
construct and create the directory if missing, and use that path for both
initramfs files and any related IP_ALLOCATIONS_PATH logic so the runtime honors
the documented setting end-to-end; alternatively, if changing code is
undesirable, update README to explicitly state the current hardcoded limitation.
- Around line 24-26: The README shows inconsistent build vs run targets: it adds
the musl target and builds with --target x86_64-unknown-linux-musl (commands
rustup target add x86_64-unknown-linux-musl and cargo build -p backend -p agent
--target x86_64-unknown-linux-musl) but later runs ../target/debug/backend which
is the host debug path and may not exist; update the README so the run commands
point to the same target as the build (e.g.,
../target/x86_64-unknown-linux-musl/debug/backend and the copied agent path
./target/x86_64-unknown-linux-musl/debug/agent -> ./backend/cloude-agentd), or
change the build to omit --target so the existing run paths remain valid—ensure
the paths and the build flags for backend and agent match consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4e67b14c-f4f1-49e5-8a07-22724581eabe

📥 Commits

Reviewing files that changed from the base of the PR and between bab1f8e and 5edcb2b.

📒 Files selected for processing (1)
  • backend/README.md

Copy link
Copy Markdown
Contributor

@jorisvilardell jorisvilardell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/src/vm_lifecycle.rs (1)

207-232: ⚠️ Potential issue | 🔴 Critical

Require a real VMM stop handle before proceeding.

There is a race where tap creation can be detected after try_recv() but before the next iteration, leading to vmm_stop_handle staying None. The fallback sentinel is not connected to VMM state, so teardown may fail to stop the VM thread and block on join.

💡 Proposed fix
-            if tokio::fs::metadata(&tap_path).await.is_ok() {
-                info!(vm_id = %vm_id, tap = %tap_device, "Tap device created");
-                break;
-            }
+            if tokio::fs::metadata(&tap_path).await.is_ok() {
+                if vmm_stop_handle.is_none() {
+                    if let Ok(handle) = stop_handle_rx.try_recv() {
+                        vmm_stop_handle = Some(handle);
+                    }
+                }
+                if vmm_stop_handle.is_some() {
+                    info!(vm_id = %vm_id, tap = %tap_device, "Tap device created");
+                    break;
+                }
+            }

-        let vmm_stop = if let Some(handle) = vmm_stop_handle {
-            handle
-        } else {
-            warn!(vm_id = %vm_id, "Missing VMM stop handle; using disconnected fallback sentinel");
-            Arc::new(std::sync::atomic::AtomicBool::new(false))
-        };
+        let vmm_stop = match vmm_stop_handle {
+            Some(handle) => handle,
+            None => {
+                let _ = Self::release_ip_internal(&vm_id, &ip_manager);
+                return Err(VmError::VmmCreation(
+                    "VMM stop handle not received during startup".to_string(),
+                ));
+            }
+        };

Also applies to: 257-262

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/vm_lifecycle.rs` around lines 207 - 232, The code can break out
once the tap appears while vmm_stop_handle is still None, leaving only the
sentinel and preventing proper teardown; update the loop around the startup/tap
check to require a real VMM stop handle before proceeding: after checking
startup_rx and before breaking when
tokio::fs::metadata(&tap_path).await.is_ok(), attempt to obtain a real handle
from stop_handle_rx (e.g., loop on stop_handle_rx.try_recv until you get
Ok(handle) or Treat Disconnected as an error), set vmm_stop_handle =
Some(handle) and only then break; apply the same change to the analogous block
referenced at the later location (lines ~257-262) so teardown always has a
concrete stop handle and not the sentinel.
🧹 Nitpick comments (1)
backend/src/vm_lifecycle.rs (1)

288-327: Make initramfs selection deterministic when multiple candidates exist.

The first-match read_dir scan is filesystem-order dependent. If multiple language-*.cpio.gz files exist, selected image becomes nondeterministic.

💡 Proposed refactor
-        let mut candidate: Option<PathBuf> = None;
+        let mut candidate: Option<(std::time::SystemTime, PathBuf)> = None;
...
-            if name.starts_with(&prefix) && name.ends_with(".cpio.gz") {
-                let is_non_empty = tokio::fs::metadata(&path)
+            if name.starts_with(&prefix) && name.ends_with(".cpio.gz") {
+                let meta = tokio::fs::metadata(&path)
                     .await
-                    .map(|m| m.len() > 0)
-                    .unwrap_or(false);
-                if is_non_empty {
-                    candidate = Some(path);
-                    break;
+                    .ok();
+                if let Some(m) = meta {
+                    if m.len() == 0 {
+                        continue;
+                    }
+                    let modified = m.modified().unwrap_or(std::time::SystemTime::UNIX_EPOCH);
+                    match &candidate {
+                        Some((best_ts, _)) if &modified <= best_ts => {}
+                        _ => candidate = Some((modified, path)),
+                    }
                 }
             }
         }

-        match candidate {
-            Some(path) => Ok(path),
+        match candidate {
+            Some((_, path)) => Ok(path),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/vm_lifecycle.rs` around lines 288 - 327, The current initramfs
selection stops at the first filesystem-order match (using prefix and entries
from config.initramfs_dir), which is nondeterministic; modify the logic in
vm_lifecycle.rs around the prefix/candidate handling to collect all files
matching name.starts_with(&prefix) && name.ends_with(".cpio.gz") and with
non-zero size into a vector, then choose one deterministically (e.g., sort by
file name or by a well-defined attribute like modified time and pick the
first/last). Replace the early break that sets candidate with this deterministic
selection and return that chosen PathBuf.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/src/initramfs_manager.rs`:
- Around line 80-82: The cleanup step removes same-prefix files including the
freshly written .cpio.gz.meta, causing read_build_metadata to miss on restart;
update cleanup_old_versions so it preserves metadata for the current artifact
(skip deleting files that end with ".meta" or specifically skip current_filename
+ ".meta") or alternatively move the call to write_build_metadata to after
cleanup; change the code around write_build_metadata and cleanup_old_versions so
the current metadata is not deleted (refer to functions write_build_metadata,
read_build_metadata, cleanup_old_versions and the variables
current_prefix/current_filename).

In `@backend/src/main.rs`:
- Around line 210-213: The arithmetic computing host_bits, broadcast_offset,
pool_start and pool_end can overflow and should use checked operations and
explicit error returns; replace the unchecked expressions that derive host_bits
from ip_mask and compute broadcast_offset, pool_start and pool_end from ip_range
with checked shifts/adds/subs (e.g. checked_sub, checked_shl/checked_shr,
checked_add, checked_sub) and return an appropriate
std::io::ErrorKind::InvalidInput (or convert to your function's error type) with
a clear message when any checked operation returns None; update the code paths
that use host_bits, broadcast_offset, pool_start and pool_end to rely on the
validated values.

---

Outside diff comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 207-232: The code can break out once the tap appears while
vmm_stop_handle is still None, leaving only the sentinel and preventing proper
teardown; update the loop around the startup/tap check to require a real VMM
stop handle before proceeding: after checking startup_rx and before breaking
when tokio::fs::metadata(&tap_path).await.is_ok(), attempt to obtain a real
handle from stop_handle_rx (e.g., loop on stop_handle_rx.try_recv until you get
Ok(handle) or Treat Disconnected as an error), set vmm_stop_handle =
Some(handle) and only then break; apply the same change to the analogous block
referenced at the later location (lines ~257-262) so teardown always has a
concrete stop handle and not the sentinel.

---

Nitpick comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 288-327: The current initramfs selection stops at the first
filesystem-order match (using prefix and entries from config.initramfs_dir),
which is nondeterministic; modify the logic in vm_lifecycle.rs around the
prefix/candidate handling to collect all files matching
name.starts_with(&prefix) && name.ends_with(".cpio.gz") and with non-zero size
into a vector, then choose one deterministically (e.g., sort by file name or by
a well-defined attribute like modified time and pick the first/last). Replace
the early break that sets candidate with this deterministic selection and return
that chosen PathBuf.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3df1601c-31fe-4673-86dc-1faaa62b70e6

📥 Commits

Reviewing files that changed from the base of the PR and between 5edcb2b and 6af74e1.

📒 Files selected for processing (5)
  • QUICKSTART.md
  • agent/src/main.rs
  • backend/src/initramfs_manager.rs
  • backend/src/main.rs
  • backend/src/vm_lifecycle.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • agent/src/main.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/src/main.rs (1)

421-459: Consider the interaction between client timeout and retry logic.

The HTTP client has a 300-second timeout (line 190). If the agent hangs without responding, each attempt could wait up to 5 minutes before timing out, making the total worst-case wait ~25 minutes across 5 retries.

Since VMs are ephemeral per-job, a shorter timeout (e.g., 60s) with the existing retry logic may provide better failure detection while still accommodating legitimate long-running code execution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/main.rs` around lines 421 - 459, The current retry loop around
state.client.post(&execute_url).json(&request_payload).send() can wait the
client's global 300s timeout per attempt, causing up to ~25 minutes of blocking;
fix by enforcing a much shorter per-request timeout (e.g., 60s) or reducing the
client's default timeout so each attempt fails fast. Concretely, add a
per-request timeout to the request builder (e.g.,
.timeout(std::time::Duration::from_secs(60)) on the post(...) chain before
.send()), or adjust the client creation to use a 60s timeout, keeping the
existing for attempt 1..=5 retry logic and preserving handling of
execution_result and error logging.
backend/src/vm_lifecycle.rs (1)

347-368: Lexicographic sort may select the wrong version for multi-digit components.

The current approach sorts filenames lexicographically and picks the last entry. This works for single-digit versions but fails for semver: python-1.10.cpio.gz sorts before python-1.9.cpio.gz.

If only one version per language is expected at runtime (after cleanup removes old versions), this is fine. Otherwise, consider parsing version strings properly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/vm_lifecycle.rs` around lines 347 - 368, The code currently sorts
candidate paths lexicographically (candidates.sort_by(...)) and picks the last
via candidates.into_iter().next_back(), which misorders multi-digit semantic
versions (e.g., 1.10 < 1.9 lexicographically); change the selection to parse the
version component from file_name() (matching prefix + "<version>.cpio.gz") and
compare versions numerically (e.g., split by dots and compare each component as
integers or use the semver crate) in the sort_by closure (or use max_by with the
parsed-version key) so the highest semantic version is chosen; ensure errors
still return the same VmError::InitramfsBuild message when no candidates remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/src/main.rs`:
- Around line 421-459: The current retry loop around
state.client.post(&execute_url).json(&request_payload).send() can wait the
client's global 300s timeout per attempt, causing up to ~25 minutes of blocking;
fix by enforcing a much shorter per-request timeout (e.g., 60s) or reducing the
client's default timeout so each attempt fails fast. Concretely, add a
per-request timeout to the request builder (e.g.,
.timeout(std::time::Duration::from_secs(60)) on the post(...) chain before
.send()), or adjust the client creation to use a 60s timeout, keeping the
existing for attempt 1..=5 retry logic and preserving handling of
execution_result and error logging.

In `@backend/src/vm_lifecycle.rs`:
- Around line 347-368: The code currently sorts candidate paths
lexicographically (candidates.sort_by(...)) and picks the last via
candidates.into_iter().next_back(), which misorders multi-digit semantic
versions (e.g., 1.10 < 1.9 lexicographically); change the selection to parse the
version component from file_name() (matching prefix + "<version>.cpio.gz") and
compare versions numerically (e.g., split by dots and compare each component as
integers or use the semver crate) in the sort_by closure (or use max_by with the
parsed-version key) so the highest semantic version is chosen; ensure errors
still return the same VmError::InitramfsBuild message when no candidates remain.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0389cb31-3bab-47d7-991e-7f57da38c197

📥 Commits

Reviewing files that changed from the base of the PR and between 6af74e1 and 562c671.

📒 Files selected for processing (3)
  • backend/src/initramfs_manager.rs
  • backend/src/main.rs
  • backend/src/vm_lifecycle.rs

vmm/README.md Outdated
Comment on lines +27 to +30
- vCPU threads run while internal `running == true`.
- `run()` executes the event loop and joins vCPU threads at shutdown.
- SIGUSR1 is used internally to interrupt threads blocked in `KVM_RUN` during
shutdown.
Copy link
Copy Markdown
Contributor

@ThomasRubini ThomasRubini Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only a bunch of (seemingly random ?) implementation details. Plz remove

vmm/README.md Outdated
Comment on lines +32 to +36
## Debugging notes

- Guest serial output destination is chosen by the caller.
- Backend can forward to host stdout (verbose)
- or to sink (quiet mode)
Copy link
Copy Markdown
Contributor

@ThomasRubini ThomasRubini Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unecessary details ?

return Err(vm_err);
}
Err(std::sync::mpsc::TryRecvError::Disconnected) => {
if !vm_running.load(std::sync::atomic::Ordering::SeqCst) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure vm_running is needed ? Can't you use stop_handle ?

@@ -14,6 +15,7 @@ pub struct VmHandle {
pub tap_device: String,
vm_thread: Option<thread::JoinHandle<()>>,
vm_running: Arc<std::sync::atomic::AtomicBool>,
Copy link
Copy Markdown
Contributor

@ThomasRubini ThomasRubini Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz document vm_running if you want to keep it

@ThomasRubini
Copy link
Copy Markdown
Contributor

@Proxyfil there are conflicts

@Proxyfil Proxyfil force-pushed the feat/implement-full-flow branch from 562c671 to 7f1b4d4 Compare March 16, 2026 14:38
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/src/vm_lifecycle.rs (2)

250-269: ⚠️ Potential issue | 🔴 Critical

Startup failure paths can orphan a running VM thread.

On timeout/bridge-attach failure, the function returns after flipping vm_running, but may leave the VMM running because the internal stop flag is not signaled on all paths before returning.

Proposed fix
             if start.elapsed() >= max_wait {
                 error!(vm_id = %vm_id, tap = %tap_device, "Tap device not created within timeout");
+                if let Some(stop) = &vmm_stop_handle {
+                    stop.store(false, std::sync::atomic::Ordering::SeqCst);
+                }
                 vm_running.store(false, std::sync::atomic::Ordering::SeqCst);
                 let _ = Self::release_ip_internal(&vm_id, &ip_manager);
                 return Err(VmError::NetworkSetup(format!(
                     "Tap device {} not created within {:?}",
                     tap_device, max_wait
                 )));
             }
@@
         if let Err(e) = virt::network::setup_guest_iface(&tap_device, &config.bridge_name).await {
             error!(vm_id = %vm_id, "Failed to attach tap to bridge: {}", e);
+            if let Some(stop) = &vmm_stop_handle {
+                stop.store(false, std::sync::atomic::Ordering::SeqCst);
+            }
             vm_running.store(false, std::sync::atomic::Ordering::SeqCst);
             let _ = Self::release_ip_internal(&vm_id, &ip_manager);
             return Err(VmError::NetworkSetup(e.to_string()));
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/vm_lifecycle.rs` around lines 250 - 269, On both failure paths
(the tap-creation timeout branch and the bridge attach error after
virt::network::setup_guest_iface) ensure you signal the VM's internal stop
flag/mechanism used by the running VMM thread in addition to flipping vm_running
and calling Self::release_ip_internal; update the timeout branch and the Err(e)
branch to invoke the same stop-signal (channel send or atomic/store used by the
VM thread) before returning Err(VmError::NetworkSetup(...)) so the VMM thread is
reliably asked to exit and cannot be orphaned.

434-437: ⚠️ Potential issue | 🟠 Major

Move thread.join() to spawn_blocking to prevent blocking the async executor.

destroy() is an async function that calls thread.join() without timeout at line 436. Although the comment says "Don't wait forever", JoinHandle::join() actually blocks indefinitely until the thread completes. In a tokio runtime, this blocks a worker thread and reduces available capacity for other tasks. Use tokio::task::spawn_blocking() to move the join to a dedicated thread pool.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/vm_lifecycle.rs` around lines 434 - 437, In destroy(), avoid
blocking the async executor by moving the blocking JoinHandle::join() call onto
the blocking pool: when self.vm_thread.take() yields Some(thread), call
tokio::task::spawn_blocking(...) to run thread.join() there (await the
spawn_blocking handle), rather than calling thread.join() directly; update
references around destroy(), vm_thread and the JoinHandle::join() invocation to
use spawn_blocking so the async runtime isn’t stalled.
♻️ Duplicate comments (2)
backend/README.md (1)

41-43: ⚠️ Potential issue | 🟠 Major

Build/run target mismatch is reintroduced in setup instructions.

The docs build backend for x86_64-unknown-linux-musl but later run ../target/debug/backend (host path), so the documented run binary may not exist.

Proposed fix
 rustup target add x86_64-unknown-linux-musl
-cargo build -p backend -p agent --target x86_64-unknown-linux-musl
+cargo build -p backend
+cargo build -p agent --target x86_64-unknown-linux-musl
 cp ./target/x86_64-unknown-linux-musl/debug/agent ./backend/cloude-agentd
 chmod +x ./backend/cloude-agentd
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/README.md` around lines 41 - 43, The README shows building backend
and agent for the musl target (commands using x86_64-unknown-linux-musl and
producing ./target/x86_64-unknown-linux-musl/debug/agent) but later refers to
running ../target/debug/backend which won’t exist for the musl build; update the
instructions so the run step points to the same built binary path (e.g.,
./target/x86_64-unknown-linux-musl/debug/backend) or change the build step to
produce the host debug binary (remove the --target flag) so
../target/debug/backend matches; adjust references to agent/cloude-agentd
consistently (agent vs cloude-agentd) to avoid path/name mismatch.
backend/src/initramfs_manager.rs (1)

149-152: ⚠️ Potential issue | 🟠 Major

Cleanup predicate is over-broad and may delete non-initramfs files.

Current logic removes any same-prefix file, not only initramfs artifacts. In a shared/custom initramfs_dir, this can delete unrelated files.

Proposed fix
                 if let Some(fname) = path.file_name().and_then(|s| s.to_str()) {
+                    let is_initramfs_artifact =
+                        fname.ends_with(".cpio.gz") || fname.ends_with(".cpio.gz.meta");
                     if fname.starts_with(current_prefix)
+                        && is_initramfs_artifact
                         && fname != current_filename
                         && fname != current_metadata_filename
                     {
                         fs::remove_file(&path)
                             .map_err(|e| Error::new(ErrorKind::Other, e.to_string()))?;
                     }
                 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/initramfs_manager.rs` around lines 149 - 152, The current cleanup
predicate using fname.starts_with(current_prefix) && fname != current_filename
&& fname != current_metadata_filename is too broad and may remove unrelated
files; change the predicate to only target initramfs artifacts by additionally
checking that fname matches the repository's initramfs naming pattern (e.g.,
specific suffixes or exact constructed names) — use the existing symbols fname,
current_prefix, current_filename, current_metadata_filename to implement a
stricter test such as ensuring fname ends_with known initramfs suffix(es) or
equals one of the canonical generated names derived from current_prefix before
deleting; this ensures only real initramfs images/metadata are removed.
🧹 Nitpick comments (2)
backend/src/main.rs (2)

30-30: IpManager is double-synchronized; outer Mutex is redundant.

IpManager already serializes internally, so wrapping it in Arc<Mutex<...>> adds avoidable contention and lock layering. Prefer Arc<IpManager> and pass shared references directly.

Also applies to: 255-262

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/main.rs` at line 30, The ip_manager field is wrapped in
Arc<Mutex<IpManager>> which double-synchronizes because IpManager already
handles internal synchronization; change the type to Arc<IpManager> (replace
uses of ip_manager.lock()/mutex guards with direct Arc-cloned references or
direct method calls on IpManager) and remove the unnecessary Mutex
wraps/imports; also update all other places that currently call
ip_manager.lock() (e.g., the usages noted around the later block) to call
IpManager's thread-safe methods or clone the Arc before invoking methods so no
explicit locking remains.

424-445: Retry loop stops on first non-2xx, including transient server failures.

For resilience, consider retrying transient statuses (e.g., 429/5xx) with backoff instead of breaking immediately on any non-success response.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/main.rs` around lines 424 - 445, The current retry loop breaks on
any non-2xx response; update the loop around
state.client.post(&execute_url).json(&request_payload).send().await so that for
transient statuses (e.g., resp.status().as_u16() == 429 or
resp.status().is_server_error()) you do not set execution_result and break
immediately but instead perform an exponential backoff (use tokio::time::sleep
with a duration derived from attempt, e.g., base_ms * 2_u64.pow(attempt-1)) and
continue to the next iteration; only parse into AgentExecuteResponse and set
execution_result on successful responses, and for non-retryable client errors
set execution_result to an Err and break. Ensure you reference execute_url,
state.client.post, AgentExecuteResponse, execution_result and the for attempt in
1..=5 loop when making the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 347-355: The current selection sorts candidates lexicographically
using candidates.sort_by with file_name().and_then(|s| s.to_str()).cmp(...) and
then picks next_back(), which misorders versions like "python-3.9" vs
"python-3.10"; change the comparison to parse a semantic/numeric version from
the filename (e.g., via a regex extracting major/minor/patch or use the semver
crate) and sort by that numeric tuple instead, falling back to the original
filename string comparison if parsing fails; update the candidates.sort_by call
to compare parsed (major, minor, patch) integers so match
candidates.into_iter().next_back() reliably returns the highest semantic
version.

---

Outside diff comments:
In `@backend/src/vm_lifecycle.rs`:
- Around line 250-269: On both failure paths (the tap-creation timeout branch
and the bridge attach error after virt::network::setup_guest_iface) ensure you
signal the VM's internal stop flag/mechanism used by the running VMM thread in
addition to flipping vm_running and calling Self::release_ip_internal; update
the timeout branch and the Err(e) branch to invoke the same stop-signal (channel
send or atomic/store used by the VM thread) before returning
Err(VmError::NetworkSetup(...)) so the VMM thread is reliably asked to exit and
cannot be orphaned.
- Around line 434-437: In destroy(), avoid blocking the async executor by moving
the blocking JoinHandle::join() call onto the blocking pool: when
self.vm_thread.take() yields Some(thread), call tokio::task::spawn_blocking(...)
to run thread.join() there (await the spawn_blocking handle), rather than
calling thread.join() directly; update references around destroy(), vm_thread
and the JoinHandle::join() invocation to use spawn_blocking so the async runtime
isn’t stalled.

---

Duplicate comments:
In `@backend/README.md`:
- Around line 41-43: The README shows building backend and agent for the musl
target (commands using x86_64-unknown-linux-musl and producing
./target/x86_64-unknown-linux-musl/debug/agent) but later refers to running
../target/debug/backend which won’t exist for the musl build; update the
instructions so the run step points to the same built binary path (e.g.,
./target/x86_64-unknown-linux-musl/debug/backend) or change the build step to
produce the host debug binary (remove the --target flag) so
../target/debug/backend matches; adjust references to agent/cloude-agentd
consistently (agent vs cloude-agentd) to avoid path/name mismatch.

In `@backend/src/initramfs_manager.rs`:
- Around line 149-152: The current cleanup predicate using
fname.starts_with(current_prefix) && fname != current_filename && fname !=
current_metadata_filename is too broad and may remove unrelated files; change
the predicate to only target initramfs artifacts by additionally checking that
fname matches the repository's initramfs naming pattern (e.g., specific suffixes
or exact constructed names) — use the existing symbols fname, current_prefix,
current_filename, current_metadata_filename to implement a stricter test such as
ensuring fname ends_with known initramfs suffix(es) or equals one of the
canonical generated names derived from current_prefix before deleting; this
ensures only real initramfs images/metadata are removed.

---

Nitpick comments:
In `@backend/src/main.rs`:
- Line 30: The ip_manager field is wrapped in Arc<Mutex<IpManager>> which
double-synchronizes because IpManager already handles internal synchronization;
change the type to Arc<IpManager> (replace uses of ip_manager.lock()/mutex
guards with direct Arc-cloned references or direct method calls on IpManager)
and remove the unnecessary Mutex wraps/imports; also update all other places
that currently call ip_manager.lock() (e.g., the usages noted around the later
block) to call IpManager's thread-safe methods or clone the Arc before invoking
methods so no explicit locking remains.
- Around line 424-445: The current retry loop breaks on any non-2xx response;
update the loop around
state.client.post(&execute_url).json(&request_payload).send().await so that for
transient statuses (e.g., resp.status().as_u16() == 429 or
resp.status().is_server_error()) you do not set execution_result and break
immediately but instead perform an exponential backoff (use tokio::time::sleep
with a duration derived from attempt, e.g., base_ms * 2_u64.pow(attempt-1)) and
continue to the next iteration; only parse into AgentExecuteResponse and set
execution_result on successful responses, and for non-retryable client errors
set execution_result to an Err and break. Ensure you reference execute_url,
state.client.post, AgentExecuteResponse, execution_result and the for attempt in
1..=5 loop when making the changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f8d1de6b-f151-4880-b636-3ea755e1c07a

📥 Commits

Reviewing files that changed from the base of the PR and between 562c671 and 7f1b4d4.

📒 Files selected for processing (9)
  • QUICKSTART.md
  • agent/src/main.rs
  • backend/README.md
  • backend/src/initramfs_manager.rs
  • backend/src/main.rs
  • backend/src/vm_lifecycle.rs
  • vmm/README.md
  • vmm/src/devices/stdin.rs
  • vmm/src/lib.rs
🚧 Files skipped from review as they are similar to previous changes (4)
  • vmm/src/devices/stdin.rs
  • agent/src/main.rs
  • QUICKSTART.md
  • vmm/src/lib.rs

Comment on lines +347 to +355
candidates.sort_by(|a, b| {
a.file_name()
.and_then(|s| s.to_str())
.cmp(&b.file_name().and_then(|s| s.to_str()))
});

Ok(initramfs_path)
match candidates.into_iter().next_back() {
Some(path) => Ok(path),
None => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Initramfs selection is version-unsafe due to lexicographic sort.

Sorting by filename string can choose the wrong version (python-3.9 > python-3.10). This can boot an unexpected runtime.

Proposed fix
-        candidates.sort_by(|a, b| {
-            a.file_name()
-                .and_then(|s| s.to_str())
-                .cmp(&b.file_name().and_then(|s| s.to_str()))
-        });
-
-        match candidates.into_iter().next_back() {
+        let mut newest: Option<(PathBuf, std::time::SystemTime)> = None;
+        for path in candidates {
+            let modified = tokio::fs::metadata(&path)
+                .await
+                .ok()
+                .and_then(|m| m.modified().ok())
+                .unwrap_or(std::time::SystemTime::UNIX_EPOCH);
+            match &newest {
+                Some((_, t)) if &modified <= t => {}
+                _ => newest = Some((path, modified)),
+            }
+        }
+
+        match newest.map(|(p, _)| p) {
             Some(path) => Ok(path),
             None => {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/src/vm_lifecycle.rs` around lines 347 - 355, The current selection
sorts candidates lexicographically using candidates.sort_by with
file_name().and_then(|s| s.to_str()).cmp(...) and then picks next_back(), which
misorders versions like "python-3.9" vs "python-3.10"; change the comparison to
parse a semantic/numeric version from the filename (e.g., via a regex extracting
major/minor/patch or use the semver crate) and sort by that numeric tuple
instead, falling back to the original filename string comparison if parsing
fails; update the candidates.sort_by call to compare parsed (major, minor,
patch) integers so match candidates.into_iter().next_back() reliably returns the
highest semantic version.

@ThomasRubini ThomasRubini merged commit 70a14e2 into main Mar 16, 2026
4 checks passed
@Proxyfil Proxyfil requested a review from ThomasRubini March 16, 2026 16:24
@ThomasRubini ThomasRubini deleted the feat/implement-full-flow branch March 16, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants