Skip to content

[nextest-runner] correctly dup the fd with combined output#2316

Merged
sunshowers merged 1 commit into
mainfrom
dup-fd
Apr 29, 2025
Merged

[nextest-runner] correctly dup the fd with combined output#2316
sunshowers merged 1 commit into
mainfrom
dup-fd

Conversation

@sunshowers

@sunshowers sunshowers commented Apr 29, 2025

Copy link
Copy Markdown
Member

The bug

Previously, for combined output, we were using the same file descriptor and storing it three times:

  1. In .stdout
  2. In .stderr
  3. In state.theirs.

Stdio consumes ownership of the fd, so this effectively meant that three OwnedFd instances were present, and fds would be dropped three times.

Now in a single-threaded program this is relatively harmless -- Rust just gets EBADF for the latter two calls. But in multithreaded programs this was completely broken, with a bit of a horrific race condition on Linux. Since Tokio uses pidfds for process event handling on Linux, this caused missed wakeups in this scenario:

  1. Let's say we allocated fd 20 as the theirs end provided to the child process Stdio.
  2. In thread 1, we dropped the child, closing fd 20 the first time.
  3. In thread 2, Tokio created a new pidfd and the kernel allocated fd 20 to this pidfd.
  4. In thread 1, we continued cleaning up the child, closing fd 20 two more times.
  5. The second fd 20 close caused epoll to no longer dispatch events to the pidfd.
  6. After the test process exited, nextest never got a notification about that happening.

The fix

To fix the bug:

  1. Don't store theirs in State -- we don't need it, and in fact had an expect(dead_code) on it.
  2. Call try_clone to ensure that stdout and stderr get two separate fd numbers that refer to the same pipe.

Notes

  • This wasn't originally visible because we used a single task for all threads (so there was no multithreaded race), but 5f010c7 caused each test to be in its own task. This change exposed the bug.
  • This is particularly visible with pidfd because it uses the same fd namespace as regular descriptors. With more traditional SIGCHLD-based handling, the bug wouldn't be as visible.

Fixes #2295.

@codecov

codecov Bot commented Apr 29, 2025

Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.

Project coverage is 78.47%. Comparing base (a1333b1) to head (4ef0253).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
nextest-runner/src/test_command/unix.rs 0.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2316      +/-   ##
==========================================
- Coverage   78.49%   78.47%   -0.02%     
==========================================
  Files         105      105              
  Lines       24480    24484       +4     
==========================================
- Hits        19216    19215       -1     
- Misses       5264     5269       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Previously, for combined output, we were using the same file descriptor and
storing it three times:

1. In `.stdout`
2. In `.stderr`
3. In `state.theirs`.

`Stdio` consumes ownership of the fd, so this effectively meant that three `OwnedFd`
instances were present, and fds would be dropped three times.

Now in a single-threaded program this is relatively harmless -- Rust just gets
`EBADF` for the latter two calls. But in multithreaded programs this was
completely broken, with a fun failure mode on Linux. Since Tokio uses pidfds
for process event handling on Linux, this caused missed wakeups in this
scenario:

1. Let's say we allocated fd 20 as the `theirs` end provided to the child process `Stdio`.
2. In thread 1, we dropped the child, closing fd 20 the first time.
3. In thread 2, Tokio created a new pidfd and the kernel allocated fd 20 to this pidfd.
4. In thread 1, we continued cleaning up the child, closing fd 20 two more times.
5. The second fd 20 close caused epoll to no longer dispatch events to it.
6. Nextest then assumed the process was hung.

To fix this:

1. Don't store `theirs` in `State` -- we don't need to.
2. Call `try_clone` to ensure that stdout and stderr get two separate fd numbers that
   refer to the same pipe.
@sunshowers sunshowers merged commit f8dc435 into main Apr 29, 2025
@sunshowers sunshowers deleted the dup-fd branch April 29, 2025 04:45
sunshowers added a commit that referenced this pull request Apr 29, 2025
Some improvements with made #2316 easier to debug.

I'd like to also write more detailed logs into a log file at some point.
@trunk-io

trunk-io Bot commented Apr 29, 2025

Copy link
Copy Markdown

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: nextest hangs when invoked programmatically with stderr: Some(MakePipe/Null) on linux

1 participant