fix(supervisor): prevent file descriptor leaks in SSE streaming and IPC#267
Conversation
Fix 'too many open files' error caused by FD leaks: 1. SSE log streaming (logs.rs): Reuse file handle between iterations instead of reopening every 500ms. Prevents FD exhaustion from rapid file open/close cycles in tight loops. 2. IPC connection tasks (server.rs): Add proper cleanup on send failures to prevent task accumulation from dead connections. These changes should significantly reduce FD usage during long-running supervisor sessions with active web UI log streaming. Co-authored-by: Kimi K2.5 <kimi-k2.5@opencode.ai>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the stability and resource management of the supervisor by addressing two key areas prone to file descriptor leaks. By optimizing SSE log streaming to reuse file handles and implementing robust cleanup for IPC connection tasks, the changes prevent 'too many open files' errors and ensure more efficient resource utilization during extended operation. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses potential file descriptor leaks by ensuring spawned tasks in src/ipc/server.rs terminate on IPC channel breaks and by refactoring SSE log streaming in src/web/routes/logs.rs to reuse a single file handle. These changes improve resource management and stability. However, a medium-severity security issue was identified in the log streaming logic where unbounded memory allocation could lead to a denial of service if log files grow excessively fast, which requires attention.
Address security and code quality concerns from code review: 1. Security fix: Limit log reads to 1MB per iteration to prevent memory exhaustion from rapidly-growing log files. Uses bounded read_exact instead of unbounded read_to_end. 2. Code style: Replace unwrap() with Option::insert() for cleaner, more idiomatic Rust when managing the file handle. Both changes improve robustness of the SSE log streaming endpoint.
Greptile SummaryThis PR addresses
The core FD-leak fix is well-structured and directly addresses the reported Confidence Score: 4/5
Last reviewed commit: d19e09e |
Additional Comments (1)
The helper function logs write failures at if let Err(err) = send.write_all(&msg).await {
trace!("Failed to send message: {err:?}");
}
Ok(()) // ← error never propagatedAs a result, the if let Err(err) = Self::send(&mut send, msg).await {
warn!("Failed to send message: {err:?}");
break; // ← can only trigger on serialize() errors, not write failures
}The task will continue looping and attempting writes to a dead connection. To make the cleanup effective, propagate the write error: |
|
I don't know enough about file system SSE architecture to confidently continue with this whack-a-mole game but I would like some feedback @jdx if you are comfortable sharing? |
jdx
left a comment
There was a problem hiding this comment.
A few suggestions on the logs.rs changes:
1. last_size should initialize from file metadata
Initializing last_size = 0 means the first iteration will stream up to 1MB of existing log content on every new SSE connection. The original code initialized from the current file size so it only streamed new content. This is a behavior regression for large log files.
let mut last_size = std::fs::metadata(&log_path).map(|m| m.len()).unwrap_or(0);2. Manual read loop → take().read_to_end()
The manual read loop reimplements what Read::take already does:
let bytes_to_read = (current_size - ls).min(MAX_READ_SIZE);
let mut buffer = Vec::new();
match file.take(bytes_to_read).read_to_end(&mut buffer) {
Ok(0) | Err(_) => return (None, ls, Some(FileOpResult::ReadFailed), current_ino),
Ok(n) => {
ls += n as u64;
return (Some(file), ls, Some(FileOpResult::Data(buffer)), current_ino);
}
}Note: take() consumes the File by value, but you can work around this by calling it on &mut file (i.e. (&mut file).take(bytes_to_read).read_to_end(&mut buffer)), since Read is implemented for &File.
3. Use a struct instead of a 4-tuple for spawn_blocking result
The return type (Option<File>, u64, Option<FileOpResult>, Option<u64>) is hard to follow. A struct would make the code much more readable:
struct FileOpOutput {
file: Option<std::fs::File>,
size: u64,
result: Option<FileOpResult>,
inode: Option<u64>,
}This comment was generated by Claude Code.
Initialize SSE streams from the current EOF so the web log viewer only receives new data while still recovering cleanly from truncation and rotation. Add e2e coverage for SSE connect and rotation behavior; leave the pre-existing duplicated initial log rendering paths unchanged in this change.
Stabilize the SSE end-to-end tests by tolerating chunk delays, synchronizing rotation setup with the live stream, and discovering the actual web port chosen by the supervisor. Also skip the fresh-open inode check when reusing an existing file handle so steady-state polling avoids unnecessary metadata work.
Keep SSE web tests from leaking supervisors on panic and discover the actual port from supervisor startup logs when using dynamic binding. Reduce path-based inode polling during steady-state streams so the held-handle path does less redundant metadata work.
Join the SSE test stderr reader thread during supervisor cleanup so helper threads do not outlive the child process. Also skip the path-based inode stat immediately after a fresh open since the file metadata is already available from that open path.
Gate the rotation-specific SSE regression test to Unix where inode-based rotation detection is implemented, clarify the web-start readiness helper's return path, and declare Rust 1.87 so the SSE polling code's standard-library APIs are part of the repo's documented toolchain.
|
@jdx I think we are all done now? Please let me know if you disagree with anything I have addressed (after arguing with the ol AI). |
## 🤖 New release * `pitchfork-cli`: 2.0.0 -> 2.1.0 <details><summary><i><b>Changelog</b></i></summary><p> <blockquote> ## [2.1.0](v2.0.0...v2.1.0) - 2026-03-08 ### Added - add `settings.toml` ([#275](#275)) ### Fixed - correct json schema for DaemonId ([#277](#277)) - *(supervisor)* prevent file descriptor leaks in SSE streaming and IPC ([#267](#267)) - fixed scroll disabled when log <20 lines ([#268](#268)) ### Other - Support .config/pitchfork.toml ([#265](#265)) - *(README)* update broken link ([#270](#270)) </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/release-plz/release-plz/). <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Release bookkeeping only (version/changelog/lockfile) with no functional code changes in this PR. > > **Overview** > Bumps `pitchfork-cli` from `2.0.0` to `2.1.0` in `Cargo.toml` and `Cargo.lock`. > > Updates `CHANGELOG.md` with the new `2.1.0` release entry (noting the new `settings.toml`, several fixes, and minor documentation/config support updates). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c5f40ab. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Fix 'too many open files' error caused by FD leaks:
SSE log streaming (logs.rs): Reuse file handle between iterations instead of reopening every 500ms. Prevents FD exhaustion from rapid file open/close cycles in tight loops.
IPC connection tasks (server.rs): Add proper cleanup on send failures to prevent task accumulation from dead connections.
These changes should significantly reduce FD usage during long-running supervisor sessions with active web UI log streaming.