Sandboxed code execution for RL rollouts

rLLM's `@rllm.rollout` decorator makes it easy to turn any agent into an RL training loop. But when rollouts execute LLM-generated code (e.g., coding tasks, tool use, reward evaluation), each episode runs with full host access -- filesystem, network, environment variables.

At scale this creates two problems:

1. **Safety**: A single malicious or buggy rollout can corrupt the host, leak data, or interfere with other rollouts running in parallel.
2. **Reproducibility**: Rollouts that depend on shared mutable state (temp files, network, system time) produce non-deterministic rewards.

Container-based isolation (Docker, Firecracker) works but adds significant overhead per episode:

| Approach | Startup per episode | 100K episodes overhead |
|----------|-------------------|----------------------|
| Docker | ~200ms | 5.6 hours |
| Firecracker (E2B) | ~150ms | 4.2 hours |
| sandlock | ~5ms | 8 minutes |
| sandlock COW fork | ~0.5ms | 50 seconds |

[sandlock](https://github.com/multikernel/sandlock) provides kernel-level process isolation using Landlock + seccomp. No containers, no VMs, no root. Each rollout gets its own filesystem, network, memory, and process limits enforced by the kernel.

Example integration with a rollout function:

```python
from sandlock import Sandbox, Policy

policy = Policy(
    fs_readable=["/usr", "/lib", "/lib64", "/etc"],
    fs_writable=["/tmp/rollout"],
    net_allow_hosts=[],       # no network
    max_memory="512M",
    max_processes=10,
    random_seed=episode_id,   # deterministic randomness
    time_start="2025-01-01T00:00:00",  # frozen time
)
result = Sandbox(policy).run(["python3", "-c", generated_code])
# result.stdout, result.exit_code
```

For batch rollouts, COW fork initializes once and clones cheaply:

```python
mapper = Sandbox(policy, init_fn=load_env, work_fn=run_episode)
clones = mapper.fork(1000)  # ~530ms for 1000 clones, shared memory pages
```

Each clone inherits kernel-enforced isolation. `CLONE_ID=0..N-1` is set automatically.

Would the rLLM team be interested in a sandbox backend for rollout execution? Happy to help with integration or answer questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sandboxed code execution for RL rollouts #468

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Approach	Startup per episode	100K episodes overhead
Docker	~200ms	5.6 hours
Firecracker (E2B)	~150ms	4.2 hours
sandlock	~5ms	8 minutes
sandlock COW fork	~0.5ms	50 seconds

Sandboxed code execution for RL rollouts #468

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions