Skip to content

[SPARK-54868][PYTHON][INFRA] Fail hanging tests and log the tracebacks #53528

Closed
zhengruifeng wants to merge 10 commits intoapache:masterfrom
zhengruifeng:py_test_timeout
Closed

[SPARK-54868][PYTHON][INFRA] Fail hanging tests and log the tracebacks #53528
zhengruifeng wants to merge 10 commits intoapache:masterfrom
zhengruifeng:py_test_timeout

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented Dec 18, 2025

What changes were proposed in this pull request?

Fail hanging tests and log the tracebacks
The timeout is set by env PYSPARK_TEST_TIMEOUT

Why are the changes needed?

when a test gets stuck, there is no useful information

Does this PR introduce any user-facing change?

no, dev-only

How was this patch tested?

1, PR builder with

PYSPARK_TEST_TIMEOUT: 100

https://github.com/zhengruifeng/spark/actions/runs/20522703690/job/58962106131

2, manually check

(spark_dev_313) ➜  spark git:(py_test_timeout) PYSPARK_TEST_TIMEOUT=15 python/run-tests -k --python-executables python3 --testnames 'pyspark.ml.tests.connect.test_parity_clustering'
Running PySpark tests. Output is in /Users/ruifeng.zheng/spark/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python tests: ['pyspark.ml.tests.connect.test_parity_clustering']
python3 python_implementation is CPython
python3 version is: Python 3.13.5
Starting test(python3): pyspark.ml.tests.connect.test_parity_clustering (temp output: /Users/ruifeng.zheng/spark/python/target/c014880c-80d2-49db-8fb1-a26ab4e5246d/python3__pyspark.ml.tests.connect.test_parity_clustering__u8n7t6zc.log)
Got TimeoutExpired while running pyspark.ml.tests.connect.test_parity_clustering with python3
Traceback (most recent call last):
  File "/Users/ruifeng.zheng/spark/./python/run-tests.py", line 157, in run_individual_python_test
    retcode = proc.wait(timeout=timeout)
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/subprocess.py", line 1280, in wait
    return self._wait(timeout=timeout)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/subprocess.py", line 2058, in _wait
    raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['/Users/ruifeng.zheng/spark/bin/pyspark', 'pyspark.ml.tests.connect.test_parity_clustering']' timed out after 15 seconds

Running tests...
----------------------------------------------------------------------
WARNING: Using incubator modules: jdk.incubator.vector
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64: UserWarning: Failed to set spark.connect.execute.reattachable.senderMaxStreamDuration to Some(1s) due to [CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static Spark config: "spark.connect.execute.reattachable.senderMaxStreamDuration". SQLSTATE: 46110
  warnings.warn(warn)
/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64: UserWarning: Failed to set spark.connect.execute.reattachable.senderMaxStreamSize to Some(123) due to [CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static Spark config: "spark.connect.execute.reattachable.senderMaxStreamSize". SQLSTATE: 46110
  warnings.warn(warn)
/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64: UserWarning: Failed to set spark.connect.authenticate.token to Some(deadbeef) due to [CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static Spark config: "spark.connect.authenticate.token". SQLSTATE: 46110
  warnings.warn(warn)
  test_assert_remote_mode (pyspark.ml.tests.connect.test_parity_clustering.ClusteringParityTests.test_assert_remote_mode) ... ok (0.450s)
/Users/ruifeng.zheng/spark/python/pyspark/ml/clustering.py:1016: FutureWarning: Deprecated in 3.0.0. It will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
  warnings.warn(
ok (6.541s)
  test_distributed_lda (pyspark.ml.tests.connect.test_parity_clustering.ClusteringParityTests.test_distributed_lda) ... Thread 0x0000000173083000 (most recent call first):
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py", line 1727 in channel_spin
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 994 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1043 in _bootstrap_inner
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1014 in _bootstrap

Thread 0x000000017509b000 (most recent call first):
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/concurrent/futures/thread.py", line 90 in _worker
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 994 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1043 in _bootstrap_inner
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1014 in _bootstrap

Thread 0x000000017408f000 (most recent call first):
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/concurrent/futures/thread.py", line 90 in _worker
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 994 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1043 in _bootstrap_inner
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1014 in _bootstrap

Thread 0x00000001719e7000 (most recent call first):
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/selectors.py", line 398 in select
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/socketserver.py", line 235 in serve_forever
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 994 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1043 in _bootstrap_inner
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1014 in _bootstrap

Thread 0x00000001709db000 (most recent call first):
  File "/Users/ruifeng.zheng/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py", line 58 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1043 in _bootstrap_inner
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1014 in _bootstrap

Current thread 0x00000001f372e200 (most recent call first):
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 363 in wait
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_common.py", line 114 in _wait_once
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_common.py", line 154 in wait
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py", line 953 in _next
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py", line 538 in __next__
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", line 164 in <lambda>
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", line 266 in _call_iter
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", line 163 in _has_next
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", line 139 in send
  File "<frozen _collections_abc>", line 360 in __next__
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line 1625 in _execute_and_fetch_as_iterator
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line 1664 in _execute_and_fetch
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line 1162 in execute_command
  File "/Users/ruifeng.zheng/spark/python/pyspark/ml/util.py", line 308 in remote_call
  File "/Users/ruifeng.zheng/spark/python/pyspark/ml/util.py", line 322 in wrapped
  File "/Users/ruifeng.zheng/spark/python/pyspark/ml/clustering.py", line 1548 in toLocal
  File "/Users/ruifeng.zheng/spark/python/pyspark/ml/tests/test_clustering.py", line 449 in test_distributed_lda
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py", line 606 in _callTestMethod
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py", line 651 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py", line 707 in __call__
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py", line 122 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py", line 84 in __call__
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py", line 122 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py", line 84 in __call__
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/xmlrunner/runner.py", line 67 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/main.py", line 270 in runTests
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/main.py", line 104 in __init__
  File "/Users/ruifeng.zheng/spark/python/pyspark/testing/__init__.py", line 30 in unittest_main
  File "/Users/ruifeng.zheng/spark/python/pyspark/ml/tests/connect/test_parity_clustering.py", line 37 in <module>
  File "<frozen runpy>", line 88 in _run_code
  File "<frozen runpy>", line 198 in _run_module_as_main

Had test failures in pyspark.ml.tests.connect.test_parity_clustering with python3; see logs.

Was this patch authored or co-authored using generative AI tooling?

no

if timeout == 0:
timeout = None
else:
env["PYSPARK_TEST_TIMEOUT"] = str(timeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used by anything else?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it is a new env, and only used in this PR

LOGGER.exception("Got TimeoutExpired while running %s with %s", test_name, pyspark_python)
proc.send_signal(signal.SIGUSR1)
time.sleep(1)
proc.terminate()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you catch SIGTERM above, you don't need to send SIGUSR1 and sleep here.

@github-actions github-actions bot removed the CORE label Dec 26, 2025
@zhengruifeng zhengruifeng changed the title [WIP][PYTHON][INFRA] Fail hanging tests and log the tracebacks [SPARK-54868][PYTHON][INFRA] Fail hanging tests and log the tracebacks Dec 30, 2025
@zhengruifeng zhengruifeng marked this pull request as ready for review December 30, 2025 01:01
SKIP_PACKAGING: true
METASPACE_SIZE: 1g
BRANCH: ${{ inputs.branch }}
PYSPARK_TEST_TIMEOUT: 300
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test file will be failed if the duration is greater than this env

env,
per_test_output
).run()
if timeout:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find a good way to support the checkpoint() feature and this PR together, do you have any ideas? @gaogaotiantian

If it is doable, we can merge the two code path in follow up PRs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's doable. I can make the follow up PR. We need to do it as an async task.

if timeout and proc:
LOGGER.exception("Got TimeoutExpired while running %s with %s", test_name, pyspark_python)
proc.terminate()
proc.communicate(timeout=60)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of the runs (in my local), it seems the proc.communicate itself gets stuck so I also add a timeout here.

@gaogaotiantian
Copy link
Copy Markdown
Contributor

I've recently added some tricks to dump trace stack of all child processes in viztracer. It's Linux only and requires some 3rd party library. Not sure if it's related here. However, 3.14 has some new mechanism that we can utilize. I think we can do some interesting stuff for debugging.

@zhengruifeng
Copy link
Copy Markdown
Contributor Author

merged to master

@zhengruifeng zhengruifeng deleted the py_test_timeout branch December 30, 2025 09:42
@gaogaotiantian
Copy link
Copy Markdown
Contributor

Is it intentional that this utility is only enabled for connect tests?

@zhengruifeng
Copy link
Copy Markdown
Contributor Author

Is it intentional that this utility is only enabled for connect tests?

it was introduced to debug flaky pyspark.ml.tests.connect.test_parity_clustering, but yes, we should also enable it on classic tests (ReusedSQLTestCase).

@zhengruifeng
Copy link
Copy Markdown
Contributor Author

probably we should introduce a superclass for all pyspark tests and setup faulthandler in it, will take a look

@zhengruifeng
Copy link
Copy Markdown
Contributor Author

working on a follow up #53651 to setup faulthandler in classic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants