Skip to content

Commit 1472ceb

Browse files
committed
[SPARK-51755][INFRA][PYTHON] Set up a scheduled builder for free-threaded Python 3.13
### What changes were proposed in this pull request? Set up a scheduled (every 3 days) builder for free-threaded Python 3.13 ### Why are the changes needed? https://docs.python.org/3/howto/free-threading-python.html since 3.13, Python supports free-threaded execution, in which GIL is optional. Test againt this latest feature. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? PR builder with ``` default: '{"PYSPARK_IMAGE_TO_TEST": "python-313-nogil", "PYTHON_TO_TEST": "python3.13t", "PYTHON_GIL": "0" }' ``` see https://github.com/zhengruifeng/spark/actions/runs/14355716572/job/40244827286 [PYTHON_GIL=0](https://docs.python.org/3/using/cmdline.html#envvar-PYTHON_GIL) is to forces the GIL off - For PySpark Classic, some streaming test and ml tests are skipped; - For PySpark Connect, it is not supported at all (one blocker is [GRPC](grpc/grpc#38762) which is not compatible), so **ALL** tests are skipped ### Was this patch authored or co-authored using generative AI tooling? no Closes #50534 from zhengruifeng/infra_py313_nogil. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
1 parent 99b7d2a commit 1472ceb

File tree

16 files changed

+227
-42
lines changed

16 files changed

+227
-42
lines changed

.github/workflows/build_infra_images_cache.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ on:
3838
- 'dev/spark-test-image/python-311/Dockerfile'
3939
- 'dev/spark-test-image/python-312/Dockerfile'
4040
- 'dev/spark-test-image/python-313/Dockerfile'
41+
- 'dev/spark-test-image/python-313-nogil/Dockerfile'
4142
- 'dev/spark-test-image/numpy-213/Dockerfile'
4243
- '.github/workflows/build_infra_images_cache.yml'
4344
# Create infra image when cutting down branches/tags
@@ -216,6 +217,19 @@ jobs:
216217
- name: Image digest (PySpark with Python 3.13)
217218
if: hashFiles('dev/spark-test-image/python-313/Dockerfile') != ''
218219
run: echo ${{ steps.docker_build_pyspark_python_313.outputs.digest }}
220+
- name: Build and push (PySpark with Python 3.13 no GIL)
221+
if: hashFiles('dev/spark-test-image/python-313-nogil/Dockerfile') != ''
222+
id: docker_build_pyspark_python_313_nogil
223+
uses: docker/build-push-action@v6
224+
with:
225+
context: ./dev/spark-test-image/python-313-nogil/
226+
push: true
227+
tags: ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-313-nogil-cache:${{ github.ref_name }}-static
228+
cache-from: type=registry,ref=ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-313-nogil-cache:${{ github.ref_name }}
229+
cache-to: type=registry,ref=ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-313-nogil-cache:${{ github.ref_name }},mode=max
230+
- name: Image digest (PySpark with Python 3.13 no GIL)
231+
if: hashFiles('dev/spark-test-image/python-313-nogil/Dockerfile') != ''
232+
run: echo ${{ steps.docker_build_pyspark_python_313_nogil.outputs.digest }}
219233
- name: Build and push (PySpark with Numpy 2.1.3)
220234
if: hashFiles('dev/spark-test-image/numpy-213/Dockerfile') != ''
221235
id: docker_build_pyspark_numpy_213
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing,
13+
# software distributed under the License is distributed on an
14+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
# KIND, either express or implied. See the License for the
16+
# specific language governing permissions and limitations
17+
# under the License.
18+
#
19+
20+
name: "Build / Python-only (master, Python 3.13 no GIL)"
21+
22+
on:
23+
schedule:
24+
- cron: '0 19 */3 * *'
25+
workflow_dispatch:
26+
27+
jobs:
28+
run-build:
29+
permissions:
30+
packages: write
31+
name: Run
32+
uses: ./.github/workflows/build_and_test.yml
33+
if: github.repository == 'apache/spark'
34+
with:
35+
java: 17
36+
branch: master
37+
hadoop: hadoop3
38+
envs: >-
39+
{
40+
"PYSPARK_IMAGE_TO_TEST": "python-313-nogil",
41+
"PYTHON_TO_TEST": "python3.13t",
42+
"PYTHON_GIL": "0"
43+
}
44+
jobs: >-
45+
{
46+
"pyspark": "true",
47+
"pyspark-pandas": "true"
48+
}
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
# Image for building and testing Spark branches. Based on Ubuntu 22.04.
19+
# See also in https://hub.docker.com/_/ubuntu
20+
FROM ubuntu:jammy-20240911.1
21+
LABEL org.opencontainers.image.authors="Apache Spark project <[email protected]>"
22+
LABEL org.opencontainers.image.licenses="Apache-2.0"
23+
LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For PySpark with Python 3.13 (no GIL)"
24+
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
25+
LABEL org.opencontainers.image.version=""
26+
27+
ENV FULL_REFRESH_DATE=20250407
28+
29+
ENV DEBIAN_FRONTEND=noninteractive
30+
ENV DEBCONF_NONINTERACTIVE_SEEN=true
31+
32+
RUN apt-get update && apt-get install -y \
33+
build-essential \
34+
ca-certificates \
35+
curl \
36+
gfortran \
37+
git \
38+
gnupg \
39+
libcurl4-openssl-dev \
40+
libfontconfig1-dev \
41+
libfreetype6-dev \
42+
libfribidi-dev \
43+
libgit2-dev \
44+
libharfbuzz-dev \
45+
libjpeg-dev \
46+
liblapack-dev \
47+
libopenblas-dev \
48+
libpng-dev \
49+
libpython3-dev \
50+
libssl-dev \
51+
libtiff5-dev \
52+
libxml2-dev \
53+
openjdk-17-jdk-headless \
54+
pkg-config \
55+
qpdf \
56+
tzdata \
57+
software-properties-common \
58+
wget \
59+
zlib1g-dev
60+
61+
# Install Python 3.13 (no GIL)
62+
RUN add-apt-repository ppa:deadsnakes/ppa
63+
RUN apt-get update && apt-get install -y \
64+
python3.13-nogil \
65+
&& apt-get autoremove --purge -y \
66+
&& apt-get clean \
67+
&& rm -rf /var/lib/apt/lists/*
68+
69+
70+
ARG BASIC_PIP_PKGS="numpy pyarrow>=19.0.0 six==1.16.0 pandas==2.2.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2"
71+
ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 protobuf==5.29.1 googleapis-common-protos==1.65.0 graphviz==0.20.3"
72+
73+
74+
# Install Python 3.13 packages
75+
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.13t
76+
# TODO: Add BASIC_PIP_PKGS and CONNECT_PIP_PKGS when it supports Python 3.13 free threaded
77+
# TODO: Add lxml, grpcio, grpcio-status back when they support Python 3.13 free threaded
78+
RUN python3.13t -m pip install --ignore-installed blinker>=1.6.2 # mlflow needs this
79+
RUN python3.13t -m pip install numpy>=2.1 pyarrow>=19.0.0 six==1.16.0 pandas==2.2.3 scipy coverage matplotlib openpyxl jinja2 && \
80+
python3.13t -m pip cache purge

python/pyspark/errors/exceptions/connect.py

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616
#
17-
import pyspark.sql.connect.proto as pb2
1817
import json
1918
from typing import Dict, List, Optional, TYPE_CHECKING
2019

@@ -43,6 +42,7 @@
4342
)
4443

4544
if TYPE_CHECKING:
45+
import pyspark.sql.connect.proto as pb2
4646
from google.rpc.error_details_pb2 import ErrorInfo
4747

4848

@@ -55,7 +55,7 @@ class SparkConnectException(PySparkException):
5555
def convert_exception(
5656
info: "ErrorInfo",
5757
truncated_message: str,
58-
resp: Optional[pb2.FetchErrorDetailsResponse],
58+
resp: Optional["pb2.FetchErrorDetailsResponse"],
5959
display_server_stacktrace: bool = False,
6060
) -> SparkConnectException:
6161
converted = _convert_exception(info, truncated_message, resp, display_server_stacktrace)
@@ -65,9 +65,11 @@ def convert_exception(
6565
def _convert_exception(
6666
info: "ErrorInfo",
6767
truncated_message: str,
68-
resp: Optional[pb2.FetchErrorDetailsResponse],
68+
resp: Optional["pb2.FetchErrorDetailsResponse"],
6969
display_server_stacktrace: bool = False,
7070
) -> SparkConnectException:
71+
import pyspark.sql.connect.proto as pb2
72+
7173
raw_classes = info.metadata.get("classes")
7274
classes: List[str] = json.loads(raw_classes) if raw_classes else []
7375
sql_state = info.metadata.get("sqlState")
@@ -139,13 +141,13 @@ def _convert_exception(
139141
)
140142

141143

142-
def _extract_jvm_stacktrace(resp: pb2.FetchErrorDetailsResponse) -> str:
144+
def _extract_jvm_stacktrace(resp: "pb2.FetchErrorDetailsResponse") -> str:
143145
if len(resp.errors[resp.root_error_idx].stack_trace) == 0:
144146
return ""
145147

146148
lines: List[str] = []
147149

148-
def format_stacktrace(error: pb2.FetchErrorDetailsResponse.Error) -> None:
150+
def format_stacktrace(error: "pb2.FetchErrorDetailsResponse.Error") -> None:
149151
message = f"{error.error_type_hierarchy[0]}: {error.message}"
150152
if len(lines) == 0:
151153
lines.append(error.error_type_hierarchy[0])
@@ -404,7 +406,7 @@ class PickleException(SparkConnectGrpcException, BasePickleException):
404406

405407

406408
class SQLQueryContext(BaseQueryContext):
407-
def __init__(self, q: pb2.FetchErrorDetailsResponse.QueryContext):
409+
def __init__(self, q: "pb2.FetchErrorDetailsResponse.QueryContext"):
408410
self._q = q
409411

410412
def contextType(self) -> QueryContextType:
@@ -441,7 +443,7 @@ def summary(self) -> str:
441443

442444

443445
class DataFrameQueryContext(BaseQueryContext):
444-
def __init__(self, q: pb2.FetchErrorDetailsResponse.QueryContext):
446+
def __init__(self, q: "pb2.FetchErrorDetailsResponse.QueryContext"):
445447
self._q = q
446448

447449
def contextType(self) -> QueryContextType:

python/pyspark/ml/connect/__init__.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,6 @@
1616
#
1717

1818
"""Spark Connect Python Client - ML module"""
19-
from pyspark.sql.connect.utils import check_dependencies
20-
21-
check_dependencies(__name__)
22-
2319
from pyspark.ml.connect.base import (
2420
Estimator,
2521
Transformer,

python/pyspark/ml/connect/base.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,6 @@
3939
HasFeaturesCol,
4040
HasPredictionCol,
4141
)
42-
from pyspark.ml.connect.util import transform_dataframe_column
4342

4443
if TYPE_CHECKING:
4544
from pyspark.ml._typing import ParamMap
@@ -188,6 +187,8 @@ def transform(
188187
return self._transform(dataset)
189188

190189
def _transform(self, dataset: Union[DataFrame, pd.DataFrame]) -> Union[DataFrame, pd.DataFrame]:
190+
from pyspark.ml.connect.util import transform_dataframe_column
191+
191192
input_cols = self._input_columns()
192193
transform_fn = self._get_transform_fn()
193194
output_cols = self._output_columns()

python/pyspark/ml/connect/evaluation.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@
2424
from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol, HasProbabilityCol
2525
from pyspark.ml.connect.base import Evaluator
2626
from pyspark.ml.connect.io_utils import ParamsReadWrite
27-
from pyspark.ml.connect.util import aggregate_dataframe
2827
from pyspark.sql import DataFrame
2928

3029

@@ -56,6 +55,8 @@ def _get_metric_update_inputs(self, dataset: "pd.DataFrame") -> Tuple[Any, Any]:
5655
raise NotImplementedError()
5756

5857
def _evaluate(self, dataset: Union["DataFrame", "pd.DataFrame"]) -> float:
58+
from pyspark.ml.connect.util import aggregate_dataframe
59+
5960
torch_metric = self._get_torch_metric()
6061

6162
def local_agg_fn(pandas_df: "pd.DataFrame") -> "pd.DataFrame":

python/pyspark/ml/connect/feature.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@
3535
)
3636
from pyspark.ml.connect.base import Estimator, Model, Transformer
3737
from pyspark.ml.connect.io_utils import ParamsReadWrite, CoreModelReadWrite
38-
from pyspark.ml.connect.summarizer import summarize_dataframe
3938

4039

4140
class MaxAbsScaler(Estimator, HasInputCol, HasOutputCol, ParamsReadWrite):
@@ -81,6 +80,8 @@ def __init__(self, *, inputCol: Optional[str] = None, outputCol: Optional[str] =
8180
self._set(**kwargs)
8281

8382
def _fit(self, dataset: Union["pd.DataFrame", "DataFrame"]) -> "MaxAbsScalerModel":
83+
from pyspark.ml.connect.summarizer import summarize_dataframe
84+
8485
input_col = self.getInputCol()
8586

8687
stat_res = summarize_dataframe(dataset, input_col, ["min", "max", "count"])
@@ -197,6 +198,8 @@ def __init__(self, inputCol: Optional[str] = None, outputCol: Optional[str] = No
197198
self._set(**kwargs)
198199

199200
def _fit(self, dataset: Union[DataFrame, pd.DataFrame]) -> "StandardScalerModel":
201+
from pyspark.ml.connect.summarizer import summarize_dataframe
202+
200203
input_col = self.getInputCol()
201204

202205
stat_result = summarize_dataframe(dataset, input_col, ["mean", "std", "count"])

python/pyspark/ml/connect/functions.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,20 +19,24 @@
1919

2020
from pyspark.ml import functions as PyMLFunctions
2121
from pyspark.sql.column import Column
22-
from pyspark.sql.connect.functions.builtin import _invoke_function, _to_col, lit
22+
2323

2424
if TYPE_CHECKING:
2525
from pyspark.sql._typing import UserDefinedFunctionLike
2626

2727

2828
def vector_to_array(col: Column, dtype: str = "float64") -> Column:
29+
from pyspark.sql.connect.functions.builtin import _invoke_function, _to_col, lit
30+
2931
return _invoke_function("vector_to_array", _to_col(col), lit(dtype))
3032

3133

3234
vector_to_array.__doc__ = PyMLFunctions.vector_to_array.__doc__
3335

3436

3537
def array_to_vector(col: Column) -> Column:
38+
from pyspark.sql.connect.functions.builtin import _invoke_function, _to_col
39+
3640
return _invoke_function("array_to_vector", _to_col(col))
3741

3842

@@ -49,6 +53,11 @@ def predict_batch_udf(*args: Any, **kwargs: Any) -> "UserDefinedFunctionLike":
4953
def _test() -> None:
5054
import os
5155
import sys
56+
57+
if os.environ.get("PYTHON_GIL", "?") == "0":
58+
print("Not supported in no-GIL mode", file=sys.stderr)
59+
sys.exit(0)
60+
5261
import doctest
5362
from pyspark.sql import SparkSession as PySparkSession
5463
import pyspark.ml.connect.functions

python/pyspark/ml/connect/proto.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616
#
17+
from pyspark.sql.connect.utils import check_dependencies
18+
19+
check_dependencies(__name__)
20+
1721
from typing import Optional, TYPE_CHECKING, List
1822

1923
import pyspark.sql.connect.proto as pb2

0 commit comments

Comments
 (0)