[SPARK-54868][PYTHON][INFRA] Fail hanging tests and log the tracebacks by zhengruifeng · Pull Request #53528 · apache/spark

zhengruifeng · 2025-12-18T13:31:12Z

What changes were proposed in this pull request?

Fail hanging tests and log the tracebacks
The timeout is set by env PYSPARK_TEST_TIMEOUT

Why are the changes needed?

when a test gets stuck, there is no useful information

Does this PR introduce any user-facing change?

no, dev-only

How was this patch tested?

1, PR builder with

PYSPARK_TEST_TIMEOUT: 100

https://github.com/zhengruifeng/spark/actions/runs/20522703690/job/58962106131

2, manually check

(spark_dev_313) ➜  spark git:(py_test_timeout) PYSPARK_TEST_TIMEOUT=15 python/run-tests -k --python-executables python3 --testnames 'pyspark.ml.tests.connect.test_parity_clustering'
Running PySpark tests. Output is in /Users/ruifeng.zheng/spark/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python tests: ['pyspark.ml.tests.connect.test_parity_clustering']
python3 python_implementation is CPython
python3 version is: Python 3.13.5
Starting test(python3): pyspark.ml.tests.connect.test_parity_clustering (temp output: /Users/ruifeng.zheng/spark/python/target/c014880c-80d2-49db-8fb1-a26ab4e5246d/python3__pyspark.ml.tests.connect.test_parity_clustering__u8n7t6zc.log)
Got TimeoutExpired while running pyspark.ml.tests.connect.test_parity_clustering with python3
Traceback (most recent call last):
  File "/Users/ruifeng.zheng/spark/./python/run-tests.py", line 157, in run_individual_python_test
    retcode = proc.wait(timeout=timeout)
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/subprocess.py", line 1280, in wait
    return self._wait(timeout=timeout)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/subprocess.py", line 2058, in _wait
    raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['/Users/ruifeng.zheng/spark/bin/pyspark', 'pyspark.ml.tests.connect.test_parity_clustering']' timed out after 15 seconds

Running tests...
----------------------------------------------------------------------
WARNING: Using incubator modules: jdk.incubator.vector
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64: UserWarning: Failed to set spark.connect.execute.reattachable.senderMaxStreamDuration to Some(1s) due to [CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static Spark config: "spark.connect.execute.reattachable.senderMaxStreamDuration". SQLSTATE: 46110
  warnings.warn(warn)
/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64: UserWarning: Failed to set spark.connect.execute.reattachable.senderMaxStreamSize to Some(123) due to [CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static Spark config: "spark.connect.execute.reattachable.senderMaxStreamSize". SQLSTATE: 46110
  warnings.warn(warn)
/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/conf.py:64: UserWarning: Failed to set spark.connect.authenticate.token to Some(deadbeef) due to [CANNOT_MODIFY_STATIC_CONFIG] Cannot modify the value of the static Spark config: "spark.connect.authenticate.token". SQLSTATE: 46110
  warnings.warn(warn)
  test_assert_remote_mode (pyspark.ml.tests.connect.test_parity_clustering.ClusteringParityTests.test_assert_remote_mode) ... ok (0.450s)
/Users/ruifeng.zheng/spark/python/pyspark/ml/clustering.py:1016: FutureWarning: Deprecated in 3.0.0. It will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
  warnings.warn(
ok (6.541s)
  test_distributed_lda (pyspark.ml.tests.connect.test_parity_clustering.ClusteringParityTests.test_distributed_lda) ... Thread 0x0000000173083000 (most recent call first):
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py", line 1727 in channel_spin
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 994 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1043 in _bootstrap_inner
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1014 in _bootstrap

Thread 0x000000017509b000 (most recent call first):
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/concurrent/futures/thread.py", line 90 in _worker
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 994 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1043 in _bootstrap_inner
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1014 in _bootstrap

Thread 0x000000017408f000 (most recent call first):
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/concurrent/futures/thread.py", line 90 in _worker
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 994 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1043 in _bootstrap_inner
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1014 in _bootstrap

Thread 0x00000001719e7000 (most recent call first):
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/selectors.py", line 398 in select
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/socketserver.py", line 235 in serve_forever
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 994 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1043 in _bootstrap_inner
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1014 in _bootstrap

Thread 0x00000001709db000 (most recent call first):
  File "/Users/ruifeng.zheng/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py", line 58 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1043 in _bootstrap_inner
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 1014 in _bootstrap

Current thread 0x00000001f372e200 (most recent call first):
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/threading.py", line 363 in wait
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_common.py", line 114 in _wait_once
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_common.py", line 154 in wait
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py", line 953 in _next
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/grpc/_channel.py", line 538 in __next__
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", line 164 in <lambda>
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", line 266 in _call_iter
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", line 163 in _has_next
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/reattach.py", line 139 in send
  File "<frozen _collections_abc>", line 360 in __next__
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line 1625 in _execute_and_fetch_as_iterator
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line 1664 in _execute_and_fetch
  File "/Users/ruifeng.zheng/spark/python/pyspark/sql/connect/client/core.py", line 1162 in execute_command
  File "/Users/ruifeng.zheng/spark/python/pyspark/ml/util.py", line 308 in remote_call
  File "/Users/ruifeng.zheng/spark/python/pyspark/ml/util.py", line 322 in wrapped
  File "/Users/ruifeng.zheng/spark/python/pyspark/ml/clustering.py", line 1548 in toLocal
  File "/Users/ruifeng.zheng/spark/python/pyspark/ml/tests/test_clustering.py", line 449 in test_distributed_lda
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py", line 606 in _callTestMethod
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py", line 651 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/case.py", line 707 in __call__
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py", line 122 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py", line 84 in __call__
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py", line 122 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/suite.py", line 84 in __call__
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/site-packages/xmlrunner/runner.py", line 67 in run
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/main.py", line 270 in runTests
  File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_313/lib/python3.13/unittest/main.py", line 104 in __init__
  File "/Users/ruifeng.zheng/spark/python/pyspark/testing/__init__.py", line 30 in unittest_main
  File "/Users/ruifeng.zheng/spark/python/pyspark/ml/tests/connect/test_parity_clustering.py", line 37 in <module>
  File "<frozen runpy>", line 88 in _run_code
  File "<frozen runpy>", line 198 in _run_module_as_main

Had test failures in pyspark.ml.tests.connect.test_parity_clustering with python3; see logs.

Was this patch authored or co-authored using generative AI tooling?

no

python/pyspark/testing/connectutils.py

python/run-tests.py

gaogaotiantian · 2025-12-18T19:05:49Z

python/run-tests.py

+    if timeout == 0:
+        timeout = None
+    else:
+        env["PYSPARK_TEST_TIMEOUT"] = str(timeout)


Is this used by anything else?

no, it is a new env, and only used in this PR

gaogaotiantian · 2025-12-18T19:09:10Z

python/run-tests.py

+        LOGGER.exception("Got TimeoutExpired while running %s with %s", test_name, pyspark_python)
+        proc.send_signal(signal.SIGUSR1)
+        time.sleep(1)
+        proc.terminate()


If you catch SIGTERM above, you don't need to send SIGUSR1 and sleep here.

# Conflicts: # python/run-tests.py

zhengruifeng · 2025-12-30T01:02:08Z

.github/workflows/build_and_test.yml

      SKIP_PACKAGING: true
      METASPACE_SIZE: 1g
      BRANCH: ${{ inputs.branch }}
+      PYSPARK_TEST_TIMEOUT: 300


A test file will be failed if the duration is greater than this env

zhengruifeng · 2025-12-30T01:03:57Z

python/run-tests.py

-            env,
-            per_test_output
-        ).run()
+        if timeout:


I cannot find a good way to support the checkpoint() feature and this PR together, do you have any ideas? @gaogaotiantian

If it is doable, we can merge the two code path in follow up PRs.

It's doable. I can make the follow up PR. We need to do it as an async task.

zhengruifeng · 2025-12-30T01:04:58Z

python/run-tests.py

+        if timeout and proc:
+            LOGGER.exception("Got TimeoutExpired while running %s with %s", test_name, pyspark_python)
+            proc.terminate()
+            proc.communicate(timeout=60)


In one of the runs (in my local), it seems the proc.communicate itself gets stuck so I also add a timeout here.

gaogaotiantian · 2025-12-30T02:33:26Z

I've recently added some tricks to dump trace stack of all child processes in viztracer. It's Linux only and requires some 3rd party library. Not sure if it's related here. However, 3.14 has some new mechanism that we can utilize. I think we can do some interesting stuff for debugging.

zhengruifeng · 2025-12-30T09:42:25Z

merged to master

gaogaotiantian · 2025-12-30T19:25:28Z

Is it intentional that this utility is only enabled for connect tests?

zhengruifeng · 2025-12-31T00:42:33Z

Is it intentional that this utility is only enabled for connect tests?

it was introduced to debug flaky pyspark.ml.tests.connect.test_parity_clustering, but yes, we should also enable it on classic tests (ReusedSQLTestCase).

zhengruifeng · 2025-12-31T00:51:19Z

probably we should introduce a superclass for all pyspark tests and setup faulthandler in it, will take a look

zhengruifeng · 2025-12-31T01:20:55Z

working on a follow up #53651 to setup faulthandler in classic

github-actions bot added the PYTHON label Dec 18, 2025

gaogaotiantian reviewed Dec 18, 2025

View reviewed changes

zhengruifeng force-pushed the py_test_timeout branch from 91e49f6 to 0471710 Compare December 19, 2025 13:49

github-actions bot added the INFRA label Dec 19, 2025

zhengruifeng added 7 commits December 26, 2025 19:29

test

f9a8662

# Conflicts: # python/run-tests.py

test

3e9a30e

test 100

71216f5

address comments

a2d0424

resolve conflicts

4e29a89

test

0e7b491

test

8c9b56c

zhengruifeng force-pushed the py_test_timeout branch from 0471710 to 8c9b56c Compare December 26, 2025 11:39

github-actions bot added the CORE label Dec 26, 2025

test

862f80c

github-actions bot removed the CORE label Dec 26, 2025

zhengruifeng changed the title ~~[WIP][PYTHON][INFRA] Fail hanging tests and log the tracebacks~~ [SPARK-54868][PYTHON][INFRA] Fail hanging tests and log the tracebacks Dec 30, 2025

test 300

ef341c9

zhengruifeng marked this pull request as ready for review December 30, 2025 01:01

zhengruifeng requested review from HyukjinKwon and dongjoon-hyun December 30, 2025 01:01

zhengruifeng commented Dec 30, 2025

View reviewed changes

HyukjinKwon approved these changes Dec 30, 2025

View reviewed changes

zhengruifeng commented Dec 30, 2025

View reviewed changes

linter

89816bc

zhengruifeng closed this in 4b77986 Dec 30, 2025

zhengruifeng deleted the py_test_timeout branch December 30, 2025 09:42

Conversation

zhengruifeng commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

gaogaotiantian Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian commented Dec 30, 2025

Uh oh!

zhengruifeng commented Dec 30, 2025

Uh oh!

gaogaotiantian commented Dec 30, 2025

Uh oh!

zhengruifeng commented Dec 31, 2025

Uh oh!

zhengruifeng commented Dec 31, 2025

Uh oh!

zhengruifeng commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhengruifeng commented Dec 18, 2025 •

edited

Loading