[SPARK-54868][PYTHON][INFRA] Fail hanging tests and log the tracebacks #53528
[SPARK-54868][PYTHON][INFRA] Fail hanging tests and log the tracebacks #53528zhengruifeng wants to merge 10 commits intoapache:masterfrom
Conversation
python/run-tests.py
Outdated
| if timeout == 0: | ||
| timeout = None | ||
| else: | ||
| env["PYSPARK_TEST_TIMEOUT"] = str(timeout) |
There was a problem hiding this comment.
Is this used by anything else?
There was a problem hiding this comment.
no, it is a new env, and only used in this PR
python/run-tests.py
Outdated
| LOGGER.exception("Got TimeoutExpired while running %s with %s", test_name, pyspark_python) | ||
| proc.send_signal(signal.SIGUSR1) | ||
| time.sleep(1) | ||
| proc.terminate() |
There was a problem hiding this comment.
If you catch SIGTERM above, you don't need to send SIGUSR1 and sleep here.
91e49f6 to
0471710
Compare
0471710 to
8c9b56c
Compare
| SKIP_PACKAGING: true | ||
| METASPACE_SIZE: 1g | ||
| BRANCH: ${{ inputs.branch }} | ||
| PYSPARK_TEST_TIMEOUT: 300 |
There was a problem hiding this comment.
A test file will be failed if the duration is greater than this env
| env, | ||
| per_test_output | ||
| ).run() | ||
| if timeout: |
There was a problem hiding this comment.
I cannot find a good way to support the checkpoint() feature and this PR together, do you have any ideas? @gaogaotiantian
If it is doable, we can merge the two code path in follow up PRs.
There was a problem hiding this comment.
It's doable. I can make the follow up PR. We need to do it as an async task.
| if timeout and proc: | ||
| LOGGER.exception("Got TimeoutExpired while running %s with %s", test_name, pyspark_python) | ||
| proc.terminate() | ||
| proc.communicate(timeout=60) |
There was a problem hiding this comment.
In one of the runs (in my local), it seems the proc.communicate itself gets stuck so I also add a timeout here.
|
I've recently added some tricks to dump trace stack of all child processes in viztracer. It's Linux only and requires some 3rd party library. Not sure if it's related here. However, 3.14 has some new mechanism that we can utilize. I think we can do some interesting stuff for debugging. |
|
merged to master |
|
Is it intentional that this utility is only enabled for connect tests? |
it was introduced to debug flaky |
|
probably we should introduce a superclass for all pyspark tests and setup faulthandler in it, will take a look |
|
working on a follow up #53651 to setup faulthandler in classic |
What changes were proposed in this pull request?
Fail hanging tests and log the tracebacks
The timeout is set by env
PYSPARK_TEST_TIMEOUTWhy are the changes needed?
when a test gets stuck, there is no useful information
Does this PR introduce any user-facing change?
no, dev-only
How was this patch tested?
1, PR builder with
https://github.com/zhengruifeng/spark/actions/runs/20522703690/job/58962106131
2, manually check
Was this patch authored or co-authored using generative AI tooling?
no