Skip to content

Commit 28159fc

Browse files
G4614Gamnaam Songclaude
authored
fix(guest): exec timeout watcher must SIGKILL, not SIGALRM by GHSA-xjhv-pp2r-6f82 (#543)
* fix(guest): exec timeout watcher must SIGKILL, not SIGALRM The timeout watcher in src/guest/src/service/exec/timeout.rs sent Signal::SIGALRM (signal 14, catchable) when an exec exceeded its deadline. Any workload that installed `signal.signal(SIGALRM, SIG_IGN)` in Python or `trap '' ALRM` in shell absorbed the signal and ran to completion past the configured timeout, returning exit_code=0 — bypassing the deadline entirely. The inline comment already said "Kill process with SIGKILL"; only the code passed the wrong Signal. Use Signal::SIGKILL (signal 9, uncatchable) so the watcher guarantees termination regardless of in-process signal handlers. An audit of catchable-signal usage across the codebase found no other instances of this bug class. All graceful-shutdown paths (container/lifecycle.rs, exec/registry.rs, vmm/controller/shim.rs, runtime/rt_impl.rs, cli/serve/mod.rs) already escalate SIGTERM → wait → SIGKILL correctly. Two regression tests guard the fix: - sdks/python/tests/test_exec_timeout_sigalrm.py — Python pytest port of the reporter PoC: workload installs SIG_IGN for SIGALRM, sleeps 8s with timeout=3s. Asserts result.exit_code != 0 AND elapsed < TIMEOUT_S + 2.0. - src/boxlite/tests/exec_options.rs::test_timeout_kills_sigalrm_ignoring_process — canonical Rust integration test (counterpart to security_enforcement.rs). Shell workload uses `trap '' ALRM; sleep 15` with timeout=2s; same assertion shape. Verified via controlled experiment (reverting timeout.rs to SIGALRM): both new tests FAIL in the bug state with rich diagnostics (`exit_code=0`, elapsed near WORKLOAD_S), both PASS with the fix. Three independent control tests (test_api_key_credential_get_token, test_oci_symlink_escape_blocked, test_timeout_kills_long_command) PASS in both states — confirming the fix is precisely scoped and the new tests aren't trivially-passing or trivially-failing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(guest): two-stage SIGTERM/grace/SIGKILL on exec timeout Replaces the single-step SIGKILL with the SIGTERM → grace → SIGKILL pattern used elsewhere in the codebase (ExecRegistry::shutdown_all, Container::shutdown, vmm/controller/shim.rs, runtime/rt_impl.rs). Why two stages: - SIGKILL alone is correct but harsh — cooperative workloads have no chance to flush buffers, close files, or commit transactions. Sandbox consumers (AI agents running short-lived scripts) usually want both "let me clean up" AND "hard deadline still enforced". - Industry default (Docker --stop-timeout, K8s terminationGracePeriod, systemd TimeoutStopSec) is SIGTERM-first with SIGKILL fallback. Aligns with user expectations. - The watcher now logs whether grace was respected (info: exited cleanly) or escalation was required (warn: workload absorbed SIGTERM). The hard-deadline guarantee from the previous commit is preserved: any workload that ignores SIGTERM via SIG_IGN / signal trap is still killed by SIGKILL after the 2-second grace expires. Stage-3 regression test: sdks/python/tests/test_exec_timeout_sigalrm.py gains test_exec_timeout_sigkill_fallback_when_sigterm_ignored. The workload installs SIG_IGN for both SIGTERM and SIGALRM (every catchable signal the watcher might fire in stages 1+2), so only the uncatchable SIGKILL escalation after grace can stop it. Asserts exit_code != 0 AND elapsed < TIMEOUT_S + GRACE_S + 2.0. This complements the existing test_exec_timeout_kills_sigalrm_ignoring_process (which exercises stage-1 SIGTERM against a workload that only traps SIGALRM). Together the two tests cover all three stages of the watcher. Verified via controlled experiments (logs in ../log/): - 01-baseline-sigkill-*.log — single-step SIGKILL baseline, 5/5 PASS - 02-two-stage-*.log — after refactor, 5/5 PASS (no regression in existing tests) - 03-stage3-test-*.log — with new stage-3 test, 6/6 PASS - 04-control-sigterm-only-*.log — timeout.rs reverted to SIGTERM-only (no SIGKILL fallback); only the new stage-3 test FAILS with diagnostic "exec returned exit_code=0 after 8.08s", proving the test catches a genuine regression and is not always-pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Gamnaam Song <gamnaamsong@Airbus380-841-Main-Controller.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 7f8df26 commit 28159fc

3 files changed

Lines changed: 235 additions & 7 deletions

File tree

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
"""Regression test for exec timeout bypass via SIGALRM.
2+
3+
``box.exec(timeout=N)`` must terminate any process whose runtime exceeds N,
4+
including one that installs ``signal.signal(SIGALRM, SIG_IGN)``. The guest's
5+
timeout watcher must use SIGKILL (signal 9, uncatchable). If it sends SIGALRM
6+
(signal 14, catchable/ignorable) the workload absorbs the signal, runs to
7+
completion, and ``ExecResult.exit_code`` comes back as 0 — the deadline is
8+
bypassed.
9+
10+
The Python SDK does NOT raise ``boxlite.TimeoutError`` on exec timeout; it
11+
returns an ``ExecResult`` whose ``exit_code`` reflects how the process died
12+
(non-zero / signal). The PoC at ``boxlite_poc/poc_sigalrm_bypass.py``
13+
confirms this: it inspects ``exit_code`` and elapsed wall-time, never an
14+
exception.
15+
16+
Requirements:
17+
- make dev:python (build the Python SDK with native extension)
18+
- VM runtime for integration tests (libkrun + Hypervisor.framework)
19+
"""
20+
21+
from __future__ import annotations
22+
23+
import time
24+
25+
import pytest
26+
27+
import boxlite
28+
29+
pytestmark = [pytest.mark.integration, pytest.mark.asyncio]
30+
31+
# Workload: install SIG_IGN for SIGALRM, then sleep WORKLOAD_S seconds.
32+
# If the timeout watcher fires SIGALRM, the workload absorbs it and the
33+
# loop completes (exit_code=0). If it fires SIGKILL, the process dies
34+
# mid-loop (exit_code != 0, elapsed ~ TIMEOUT_S).
35+
IGNORE_SIGALRM = """
36+
import sys, time, signal
37+
seconds = int(sys.argv[1]) if len(sys.argv) > 1 else 8
38+
signal.signal(signal.SIGALRM, signal.SIG_IGN)
39+
for _ in range(seconds):
40+
time.sleep(1)
41+
"""
42+
43+
# Stage-3 workload: ignore every catchable signal the watcher might send
44+
# in stages 1/2 (SIGTERM is the cooperative ask; SIGALRM is the historical
45+
# bug we already fixed). Only SIGKILL — which is uncatchable — can stop
46+
# this process. Used to verify the SIGKILL fallback after the grace window.
47+
IGNORE_TERM_AND_ALRM = """
48+
import sys, time, signal
49+
seconds = int(sys.argv[1]) if len(sys.argv) > 1 else 8
50+
signal.signal(signal.SIGTERM, signal.SIG_IGN)
51+
signal.signal(signal.SIGALRM, signal.SIG_IGN)
52+
for _ in range(seconds):
53+
time.sleep(1)
54+
"""
55+
56+
TIMEOUT_S = 3.0
57+
WORKLOAD_S = 8
58+
59+
# Mirrors TIMEOUT_GRACE in src/guest/src/service/exec/timeout.rs.
60+
# Keep in sync if the guest grace constant changes.
61+
GRACE_S = 2.0
62+
63+
64+
async def test_exec_timeout_kills_sigalrm_ignoring_process():
65+
"""A workload that ignores SIGALRM must still be killed at the timeout.
66+
67+
Two-pronged assertion — both must hold:
68+
69+
1) ``exit_code != 0`` — the workload must NOT have completed normally.
70+
With the SIGALRM bug, SIG_IGN absorbs the timeout signal, the loop
71+
finishes, and Python exits cleanly with 0.
72+
73+
2) ``elapsed < TIMEOUT_S + 2.0`` — termination must happen near the
74+
configured deadline, not at the workload's natural end. Even if
75+
``exit_code`` happens to be non-zero for some other reason, a long
76+
elapsed time means the watcher did not fire promptly.
77+
78+
Fix is in ``src/guest/src/service/exec/timeout.rs``: send ``SIGKILL``
79+
(uncatchable) rather than ``SIGALRM``.
80+
"""
81+
async with boxlite.SimpleBox(image="python:3-alpine") as box:
82+
t0 = time.time()
83+
result = await box.exec(
84+
"python3",
85+
"-c",
86+
IGNORE_SIGALRM,
87+
str(WORKLOAD_S),
88+
timeout=TIMEOUT_S,
89+
)
90+
elapsed = time.time() - t0
91+
92+
assert result.exit_code != 0, (
93+
f"exec returned exit_code=0 after {elapsed:.2f}s — the {WORKLOAD_S}s "
94+
f"workload completed normally despite timeout={TIMEOUT_S}s. The "
95+
f"guest's timeout watcher is sending a catchable signal that the "
96+
f"workload absorbs via SIG_IGN; the kill must use SIGKILL."
97+
)
98+
assert elapsed < TIMEOUT_S + 2.0, (
99+
f"exec returned after {elapsed:.2f}s with exit_code={result.exit_code} "
100+
f"— expected termination near {TIMEOUT_S}s. The timeout watcher "
101+
f"is not killing the process promptly."
102+
)
103+
104+
105+
async def test_exec_timeout_sigkill_fallback_when_sigterm_ignored():
106+
"""Stage-3 SIGKILL must terminate a workload that ignores SIGTERM.
107+
108+
Companion to ``test_exec_timeout_kills_sigalrm_ignoring_process``: that
109+
test exercises stage-1 (cooperative SIGTERM kills the workload because
110+
it only traps SIGALRM). This test traps both SIGTERM and SIGALRM, so
111+
the watcher's stages 1+2 are absorbed and only the uncatchable SIGKILL
112+
after the grace window can stop the process.
113+
114+
A regression that drops stage-3 — e.g., a single-stage watcher that
115+
sends only SIGTERM and never escalates — would let this workload run
116+
to natural completion (exit_code=0 after ~WORKLOAD_S).
117+
118+
Expected timing with the two-stage watcher:
119+
- t = TIMEOUT_S: SIGTERM sent, absorbed by SIG_IGN
120+
- t = TIMEOUT_S + GRACE_S: SIGKILL sent, process dies (exit_code=-9)
121+
"""
122+
async with boxlite.SimpleBox(image="python:3-alpine") as box:
123+
t0 = time.time()
124+
result = await box.exec(
125+
"python3",
126+
"-c",
127+
IGNORE_TERM_AND_ALRM,
128+
str(WORKLOAD_S),
129+
timeout=TIMEOUT_S,
130+
)
131+
elapsed = time.time() - t0
132+
133+
assert result.exit_code != 0, (
134+
f"stage-3 SIGKILL fallback broken: exec returned exit_code=0 "
135+
f"after {elapsed:.2f}s. The workload ignores SIGTERM AND was not "
136+
f"SIGKILL'd by the watcher — either grace expired without sending "
137+
f"SIGKILL, or stage-3 escalation was removed from "
138+
f"src/guest/src/service/exec/timeout.rs."
139+
)
140+
# Expected ~ TIMEOUT_S + GRACE_S = 5.0s. Allow +2.0s headroom for VM
141+
# / SDK overhead. If elapsed approaches WORKLOAD_S, the watcher is
142+
# not enforcing the deadline at all.
143+
assert elapsed < TIMEOUT_S + GRACE_S + 2.0, (
144+
f"stage-3 SIGKILL fallback fired too late: elapsed={elapsed:.2f}s "
145+
f"(expected ~{TIMEOUT_S + GRACE_S:.1f}s, workload was {WORKLOAD_S}s) "
146+
f"— grace period drift, or the watcher is not killing promptly."
147+
)

src/boxlite/tests/exec_options.rs

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,51 @@ async fn test_timeout_kills_long_command() {
9393
tb.teardown().await;
9494
}
9595

96+
/// Regression test for exec timeout bypass via SIGALRM.
97+
///
98+
/// Companion to the Python-SDK PoC at
99+
/// `sdks/python/tests/test_exec_timeout_sigalrm.py`. The guest's timeout
100+
/// watcher must use SIGKILL (uncatchable). If it sends SIGALRM (catchable),
101+
/// the workload below — a shell that installs `trap '' ALRM` and then
102+
/// sleeps for 15 seconds — absorbs the signal, the underlying `sleep`
103+
/// runs to its natural end, and exec returns `exit_code=0` after ~15s,
104+
/// bypassing the 2-second deadline.
105+
///
106+
/// The fix lives in `src/guest/src/service/exec/timeout.rs`.
107+
#[tokio::test]
108+
async fn test_timeout_kills_sigalrm_ignoring_process() {
109+
let tb = TestBox::new().await;
110+
111+
let start = std::time::Instant::now();
112+
let execution = tb
113+
.handle
114+
.exec(
115+
BoxCommand::new("sh")
116+
.args(["-c", "trap '' ALRM; sleep 15"])
117+
.timeout(Duration::from_secs(2)),
118+
)
119+
.await
120+
.expect("exec failed");
121+
122+
let result = execution.wait().await.expect("wait failed");
123+
let elapsed = start.elapsed();
124+
125+
assert_ne!(
126+
result.exit_code, 0,
127+
"timeout bypass: shell exited with exit_code=0 after {elapsed:?} \
128+
despite timeout=2s — the guest is sending a catchable signal that \
129+
the shell absorbs via `trap '' ALRM`; the kill must use SIGKILL"
130+
);
131+
assert!(
132+
elapsed < Duration::from_secs(8),
133+
"timeout did not curtail the workload: elapsed={elapsed:?} \
134+
(expected near 2s, workload was 15s) — the watcher is not killing \
135+
the process promptly"
136+
);
137+
138+
tb.teardown().await;
139+
}
140+
96141
/// Combine working_dir and user in a single command.
97142
#[tokio::test]
98143
async fn test_working_dir_with_user() {
Lines changed: 43 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,28 @@
11
//! Timeout management.
22
//!
3-
//! Kills process if execution exceeds timeout duration.
3+
//! Two-stage termination when execution exceeds its deadline:
4+
//! SIGTERM first (cooperative cleanup), then SIGKILL after a grace
5+
//! period to enforce the hard deadline against workloads that ignore
6+
//! or trap SIGTERM.
47
58
use crate::service::exec::state::ExecutionState;
69
use std::time::Duration;
7-
use tracing::info;
10+
use tracing::{info, warn};
11+
12+
/// Grace period between SIGTERM and SIGKILL on exec timeout.
13+
///
14+
/// Short enough that sandboxed execs still see a near-deadline kill,
15+
/// long enough that cooperative workloads can flush buffers, close
16+
/// files, and exit cleanly. Mirrors the SIGTERM→wait→SIGKILL pattern
17+
/// used by `ExecRegistry::shutdown_all` and `Container::shutdown`.
18+
const TIMEOUT_GRACE: Duration = Duration::from_secs(2);
819

920
/// Start timeout watcher.
1021
///
11-
/// After duration elapses, marks execution as timed out and kills
12-
/// the process if it's still running.
22+
/// After `timeout` elapses, sends SIGTERM and waits up to `TIMEOUT_GRACE`
23+
/// for the process to exit, then escalates to SIGKILL. SIGKILL is
24+
/// uncatchable, so a workload that installs `SIG_IGN`/handlers for
25+
/// SIGTERM (or SIGALRM, etc.) cannot outlive its deadline.
1326
pub(super) fn start_timeout_watcher(
1427
exec_state: ExecutionState,
1528
exec_id: String,
@@ -18,10 +31,33 @@ pub(super) fn start_timeout_watcher(
1831
tokio::spawn(async move {
1932
tokio::time::sleep(timeout).await;
2033

21-
// Kill process with SIGKILL
2234
use nix::sys::signal::Signal;
23-
if exec_state.kill(Signal::SIGALRM).await {
24-
info!(execution_id = %exec_id, "killed on timeout");
35+
36+
// Stage 1: SIGTERM — polite termination request.
37+
if !exec_state.kill(Signal::SIGTERM).await {
38+
// Process already exited on its own; nothing more to do.
39+
return;
40+
}
41+
info!(
42+
execution_id = %exec_id,
43+
grace_ms = TIMEOUT_GRACE.as_millis() as u64,
44+
"SIGTERM on timeout; grace before SIGKILL"
45+
);
46+
47+
// Stage 2: wait for the grace window. `exec_state.kill` already
48+
// returns false once the process is reaped, so we do not need to
49+
// poll — we just sleep the full grace then ask for SIGKILL.
50+
tokio::time::sleep(TIMEOUT_GRACE).await;
51+
52+
// Stage 3: SIGKILL fallback. Returns false if SIGTERM was honored
53+
// during the grace window (clean exit, no escalation needed).
54+
if exec_state.kill(Signal::SIGKILL).await {
55+
warn!(
56+
execution_id = %exec_id,
57+
"SIGKILL after grace expired; workload did not exit on SIGTERM"
58+
);
59+
} else {
60+
info!(execution_id = %exec_id, "exited within grace after SIGTERM");
2561
}
2662
});
2763
}

0 commit comments

Comments
 (0)