Skip to content

Commit a275356

Browse files
gaogaotiantianrpnkv
authored andcommitted
[SPARK-55367][PYTHON] Use venv for run-pip-tests
### What changes were proposed in this pull request? Use `venv` instead of `conda` or `virtualenv` for `run-pip-tests`. Remove the `conda` dependency in our CI. ### Why are the changes needed? `run-pip-tests` require a virtual environment which we used to achieve with `conda` or `virtualenv`. However, `venv`(https://docs.python.org/3/library/venv.html) is the recommended way to create a virtual environment since python 3.5. It's a standard library so we don't need any new dependency. It just require python to work. In this way we can just remove the conda part which is messing with our CI when it installs the same version of python as our docker image. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I tried it locally and it worked. Let's wait for CI results. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54154 from gaogaotiantian/redo-pip-test. Authored-by: Tian Gao <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
1 parent aeae649 commit a275356

File tree

13 files changed

+83
-140
lines changed

13 files changed

+83
-140
lines changed

.github/workflows/build_and_test.yml

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -639,20 +639,12 @@ jobs:
639639
$py -m pip list
640640
echo ""
641641
done
642-
- name: Install Conda for pip packaging test
643-
if: contains(matrix.modules, 'pyspark-errors')
644-
uses: conda-incubator/setup-miniconda@v3
645-
with:
646-
miniforge-version: latest
647-
activate-environment: ""
648-
auto-activate: false
649642
# Run the tests.
650643
- name: Run tests
651644
env: ${{ fromJSON(inputs.envs) }}
652645
shell: 'script -q -e -c "bash {0}"'
653646
run: |
654647
if [[ "$MODULES_TO_TEST" == *"pyspark-errors"* ]]; then
655-
export PATH=$CONDA/bin:$PATH
656648
export SKIP_PACKAGING=false
657649
echo "Python Packaging Tests Enabled!"
658650
fi

dev/run-pip-tests

Lines changed: 50 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -43,32 +43,22 @@ function delete_virtualenv() {
4343
}
4444
trap delete_virtualenv EXIT
4545

46-
PYTHON_EXECS=()
47-
# Some systems don't have pip or virtualenv - in those cases our tests won't work.
48-
if hash virtualenv 2>/dev/null && [ ! -n "$USE_CONDA" ]; then
49-
echo "virtualenv installed - using. Note if this is a conda virtual env you may wish to set USE_CONDA"
50-
# test only against python3
51-
if hash python3 2>/dev/null; then
52-
PYTHON_EXECS=('python3')
53-
else
54-
echo "Python3 not installed on system, skipping pip installability tests"
55-
exit 0
56-
fi
57-
elif hash conda 2>/dev/null; then
58-
echo "Using conda virtual environments"
59-
PYTHON_EXECS=('3.10')
60-
USE_CONDA=1
46+
47+
if [ -z "${PYTHON_TO_TEST}" ]; then
48+
PYTHON_EXECUTABLE="python3"
6149
else
62-
echo "Missing virtualenv & conda, skipping pip installability tests"
63-
exit 0
50+
PYTHON_EXECUTABLE="${PYTHON_TO_TEST}"
6451
fi
65-
if ! hash pip 2>/dev/null; then
66-
echo "Missing pip, skipping pip installability tests."
52+
53+
if ! hash "$PYTHON_EXECUTABLE" 2>/dev/null; then
54+
echo "Python executable $PYTHON_EXECUTABLE not installed on system, skipping pip installability tests"
6755
exit 0
6856
fi
6957

58+
echo "Using Python executable: $PYTHON_EXECUTABLE"
59+
7060
# Determine which version of PySpark we are building for archive name
71-
PYSPARK_VERSION=$(python3 -c "exec(open('python/pyspark/version.py').read());print(__version__)")
61+
PYSPARK_VERSION=$($PYTHON_EXECUTABLE -c "exec(open('python/pyspark/version.py').read());print(__version__)")
7262
PYSPARK_DIST="$FWDIR/python/dist/pyspark-$PYSPARK_VERSION.tar.gz"
7363
# The pip install options we use for all the pip commands
7464
PIP_OPTIONS="--upgrade --no-cache-dir --force-reinstall --use-pep517"
@@ -80,64 +70,46 @@ PIP_COMMANDS=("pip install $PIP_OPTIONS $PYSPARK_DIST"
8070
# In this test, explicitly exclude user sitepackages to prevent side effects
8171
export PYTHONNOUSERSITE=1
8272

83-
for python in "${PYTHON_EXECS[@]}"; do
84-
for install_command in "${PIP_COMMANDS[@]}"; do
85-
echo "Testing pip installation with python $python"
86-
# Create a temp directory for us to work in and save its name to a file for cleanup
87-
echo "Using $VIRTUALENV_BASE for virtualenv"
88-
VIRTUALENV_PATH="$VIRTUALENV_BASE"/$python
89-
rm -rf "$VIRTUALENV_PATH"
90-
if [ -n "$USE_CONDA" ]; then
91-
conda create -y -p "$VIRTUALENV_PATH" python=$python numpy pandas pip setuptools
92-
source activate "$VIRTUALENV_PATH" || conda activate "$VIRTUALENV_PATH"
93-
else
94-
mkdir -p "$VIRTUALENV_PATH"
95-
virtualenv --python=$python "$VIRTUALENV_PATH"
96-
source "$VIRTUALENV_PATH"/bin/activate
97-
fi
98-
# Upgrade pip & friends if using virtual env
99-
if [ ! -n "$USE_CONDA" ]; then
100-
pip install --upgrade pip wheel numpy
101-
fi
102-
103-
echo "Creating pip installable source dist"
104-
cd "$FWDIR"/python
105-
# Delete the egg info file if it exists, this can cache the setup file.
106-
rm -rf pyspark.egg-info || echo "No existing egg info file, skipping deletion"
107-
python3 packaging/classic/setup.py sdist
108-
109-
110-
echo "Installing dist into virtual env"
111-
cd dist
112-
# Verify that the dist directory only contains one thing to install
113-
sdists=(*.tar.gz)
114-
if [ ${#sdists[@]} -ne 1 ]; then
115-
echo "Unexpected number of targets found in dist directory - please cleanup existing sdists first."
116-
exit -1
117-
fi
118-
# Do the actual installation
119-
cd "$FWDIR"
120-
$install_command
121-
122-
cd /
123-
124-
echo "Run basic sanity check on pip installed version with spark-submit"
125-
spark-submit "$FWDIR"/dev/pip-sanity-check.py
126-
echo "Run basic sanity check with import based"
127-
python3 "$FWDIR"/dev/pip-sanity-check.py
128-
echo "Run the tests for context.py"
129-
python3 "$FWDIR"/python/pyspark/core/context.py
130-
131-
cd "$FWDIR"
132-
133-
# conda / virtualenv environments need to be deactivated differently
134-
if [ -n "$USE_CONDA" ]; then
135-
source deactivate || conda deactivate
136-
else
137-
deactivate
138-
fi
139-
140-
done
73+
for install_command in "${PIP_COMMANDS[@]}"; do
74+
# Create a temp directory for us to work in and save its name to a file for cleanup
75+
echo "Using $VIRTUALENV_BASE for virtualenv"
76+
VIRTUALENV_PATH="$VIRTUALENV_BASE"/$python
77+
rm -rf "$VIRTUALENV_PATH"
78+
$PYTHON_EXECUTABLE -m venv "$VIRTUALENV_PATH"
79+
source "$VIRTUALENV_PATH"/bin/activate
80+
pip install --upgrade pip wheel numpy setuptools
81+
82+
echo "Creating pip installable source dist"
83+
cd "$FWDIR"/python
84+
# Delete the egg info file if it exists, this can cache the setup file.
85+
rm -rf pyspark.egg-info || echo "No existing egg info file, skipping deletion"
86+
python3 packaging/classic/setup.py sdist
87+
88+
echo "Installing dist into virtual env"
89+
cd dist
90+
# Verify that the dist directory only contains one thing to install
91+
sdists=(*.tar.gz)
92+
if [ ${#sdists[@]} -ne 1 ]; then
93+
echo "Unexpected number of targets found in dist directory - please cleanup existing sdists first."
94+
exit -1
95+
fi
96+
# Do the actual installation
97+
cd "$FWDIR"
98+
$install_command
99+
100+
cd /
101+
102+
echo "Run basic sanity check on pip installed version with spark-submit"
103+
spark-submit "$FWDIR"/dev/pip-sanity-check.py
104+
echo "Run basic sanity check with import based"
105+
python3 "$FWDIR"/dev/pip-sanity-check.py
106+
echo "Run the tests for context.py"
107+
python3 "$FWDIR"/python/pyspark/core/context.py
108+
109+
cd "$FWDIR"
110+
111+
deactivate
112+
141113
done
142114

143115
exit 0

dev/spark-test-image/lint/Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image for Linter"
2424
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
2525
LABEL org.opencontainers.image.version=""
2626

27-
ENV FULL_REFRESH_DATE=20260208
27+
ENV FULL_REFRESH_DATE=20260210
2828

2929
ENV DEBIAN_FRONTEND=noninteractive
3030
ENV DEBCONF_NONINTERACTIVE_SEEN=true
@@ -52,6 +52,7 @@ RUN apt-get update && apt-get install -y \
5252
npm \
5353
pkg-config \
5454
python3.12 \
55+
python3.12-venv \
5556
qpdf \
5657
tzdata \
5758
r-base \
@@ -72,10 +73,9 @@ ENV R_LIBS_SITE="/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library
7273

7374
# Setup virtual environment
7475
ENV VIRTUAL_ENV=/opt/spark-venv
75-
RUN python3.12 -m venv --without-pip $VIRTUAL_ENV
76+
RUN python3.12 -m venv $VIRTUAL_ENV
7677
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
7778

78-
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.12
7979
RUN python3.12 -m pip install \
8080
'black==23.12.1' \
8181
'flake8==3.9.0' \

dev/spark-test-image/python-310/Dockerfile

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For PySpark wi
2424
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
2525
LABEL org.opencontainers.image.version=""
2626

27-
ENV FULL_REFRESH_DATE=20260206
27+
ENV FULL_REFRESH_DATE=20260210
2828

2929
ENV DEBIAN_FRONTEND=noninteractive
3030
ENV DEBCONF_NONINTERACTIVE_SEEN=true
@@ -53,18 +53,16 @@ RUN apt-get update && apt-get install -y \
5353
RUN add-apt-repository ppa:deadsnakes/ppa
5454
RUN apt-get update && apt-get install -y \
5555
python3.10 \
56+
python3.10-venv \
5657
&& apt-get autoremove --purge -y \
5758
&& apt-get clean \
5859
&& rm -rf /var/lib/apt/lists/*
5960

6061
# Setup virtual environment
6162
ENV VIRTUAL_ENV=/opt/spark-venv
62-
RUN python3.10 -m venv --without-pip $VIRTUAL_ENV
63+
RUN python3.10 -m venv $VIRTUAL_ENV
6364
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
6465

65-
# Install Python 3.10 packages
66-
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
67-
6866
ARG BASIC_PIP_PKGS="numpy pyarrow>=22.0.0 six==1.16.0 pandas==2.3.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2 pystack>=1.6.0 psutil"
6967
ARG CONNECT_PIP_PKGS="grpcio==1.76.0 grpcio-status==1.76.0 protobuf==6.33.5 googleapis-common-protos==1.71.0 zstandard==0.25.0 graphviz==0.20.3"
7068

dev/spark-test-image/python-311/Dockerfile

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For PySpark wi
2424
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
2525
LABEL org.opencontainers.image.version=""
2626

27-
ENV FULL_REFRESH_DATE=20260206
27+
ENV FULL_REFRESH_DATE=20260210
2828

2929
ENV DEBIAN_FRONTEND=noninteractive
3030
ENV DEBCONF_NONINTERACTIVE_SEEN=true
@@ -50,18 +50,16 @@ RUN apt-get update && apt-get install -y \
5050
RUN add-apt-repository ppa:deadsnakes/ppa
5151
RUN apt-get update && apt-get install -y \
5252
python3.11 \
53+
python3.11-venv \
5354
&& apt-get autoremove --purge -y \
5455
&& apt-get clean \
5556
&& rm -rf /var/lib/apt/lists/*
5657

5758
# Setup virtual environment
5859
ENV VIRTUAL_ENV=/opt/spark-venv
59-
RUN python3.11 -m venv --without-pip $VIRTUAL_ENV
60+
RUN python3.11 -m venv $VIRTUAL_ENV
6061
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
6162

62-
# Install Python 3.11 packages
63-
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11
64-
6563
ARG BASIC_PIP_PKGS="numpy pyarrow>=22.0.0 six==1.16.0 pandas==2.3.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2 pystack>=1.6.0 psutil"
6664
ARG CONNECT_PIP_PKGS="grpcio==1.76.0 grpcio-status==1.76.0 protobuf==6.33.5 googleapis-common-protos==1.71.0 zstandard==0.25.0 graphviz==0.20.3"
6765

dev/spark-test-image/python-312-classic-only/Dockerfile

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For PySpark Cl
2424
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
2525
LABEL org.opencontainers.image.version=""
2626

27-
ENV FULL_REFRESH_DATE=20260207
27+
ENV FULL_REFRESH_DATE=20260210
2828

2929
ENV DEBIAN_FRONTEND=noninteractive
3030
ENV DEBCONF_NONINTERACTIVE_SEEN=true
@@ -42,6 +42,7 @@ RUN apt-get update && apt-get install -y \
4242
libssl-dev \
4343
openjdk-17-jdk-headless \
4444
python3.12 \
45+
python3.12-venv \
4546
pkg-config \
4647
tzdata \
4748
software-properties-common \
@@ -52,12 +53,9 @@ RUN apt-get update && apt-get install -y \
5253

5354
# Setup virtual environment
5455
ENV VIRTUAL_ENV=/opt/spark-venv
55-
RUN python3.12 -m venv --without-pip $VIRTUAL_ENV
56+
RUN python3.12 -m venv $VIRTUAL_ENV
5657
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
5758

58-
# Install Python 3.12 packages
59-
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.12
60-
6159
ARG BASIC_PIP_PKGS="numpy pyarrow>=22.0.0 pandas==2.3.3 plotly<6.0.0 matplotlib openpyxl memory-profiler>=0.61.0 mlflow>=2.8.1 scipy scikit-learn>=1.3.2 pystack>=1.6.0 psutil"
6260
ARG TEST_PIP_PKGS="coverage unittest-xml-reporting"
6361

dev/spark-test-image/python-312-pandas-3/Dockerfile

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For PySpark wi
2727
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
2828
LABEL org.opencontainers.image.version=""
2929

30-
ENV FULL_REFRESH_DATE=20260207
30+
ENV FULL_REFRESH_DATE=20260210
3131

3232
ENV DEBIAN_FRONTEND=noninteractive
3333
ENV DEBCONF_NONINTERACTIVE_SEEN=true
@@ -45,6 +45,7 @@ RUN apt-get update && apt-get install -y \
4545
libssl-dev \
4646
openjdk-17-jdk-headless \
4747
python3.12 \
48+
python3.12-venv \
4849
pkg-config \
4950
tzdata \
5051
software-properties-common \
@@ -55,13 +56,9 @@ RUN apt-get update && apt-get install -y \
5556

5657
# Setup virtual environment
5758
ENV VIRTUAL_ENV=/opt/spark-venv
58-
RUN python3.12 -m venv --without-pip $VIRTUAL_ENV
59+
RUN python3.12 -m venv $VIRTUAL_ENV
5960
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
6061

61-
# Install Python 3.12 packages
62-
# Note that mlflow is execluded since it requires pandas<3
63-
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.12
64-
6562
ARG BASIC_PIP_PKGS="numpy pyarrow>=22.0.0 six==1.16.0 pandas>=3 scipy plotly<6.0.0 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2"
6663
ARG CONNECT_PIP_PKGS="grpcio==1.76.0 grpcio-status==1.76.0 protobuf==6.33.5 googleapis-common-protos==1.71.0 zstandard==0.25.0 graphviz==0.20.3"
6764

dev/spark-test-image/python-312/Dockerfile

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For PySpark wi
2424
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
2525
LABEL org.opencontainers.image.version=""
2626

27-
ENV FULL_REFRESH_DATE=20260204
27+
ENV FULL_REFRESH_DATE=20260210
2828

2929
ENV DEBIAN_FRONTEND=noninteractive
3030
ENV DEBCONF_NONINTERACTIVE_SEEN=true
@@ -42,6 +42,7 @@ RUN apt-get update && apt-get install -y \
4242
libssl-dev \
4343
openjdk-17-jdk-headless \
4444
python3.12 \
45+
python3.12-venv \
4546
pkg-config \
4647
tzdata \
4748
software-properties-common \
@@ -52,12 +53,9 @@ RUN apt-get update && apt-get install -y \
5253

5354
# Setup virtual environment
5455
ENV VIRTUAL_ENV=/opt/spark-venv
55-
RUN python3.12 -m venv --without-pip $VIRTUAL_ENV
56+
RUN python3.12 -m venv $VIRTUAL_ENV
5657
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
5758

58-
# Install Python 3.12 packages
59-
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.12
60-
6159
ARG BASIC_PIP_PKGS="numpy pyarrow>=22.0.0 six==1.16.0 pandas==2.3.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2 pystack>=1.6.0 psutil"
6260
ARG CONNECT_PIP_PKGS="grpcio==1.76.0 grpcio-status==1.76.0 protobuf==6.33.5 googleapis-common-protos==1.71.0 zstandard==0.25.0 graphviz==0.20.3"
6361

dev/spark-test-image/python-313/Dockerfile

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For PySpark wi
2424
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
2525
LABEL org.opencontainers.image.version=""
2626

27-
ENV FULL_REFRESH_DATE=20260206
27+
ENV FULL_REFRESH_DATE=20260210
2828

2929
ENV DEBIAN_FRONTEND=noninteractive
3030
ENV DEBCONF_NONINTERACTIVE_SEEN=true
@@ -50,18 +50,16 @@ RUN apt-get update && apt-get install -y \
5050
RUN add-apt-repository ppa:deadsnakes/ppa
5151
RUN apt-get update && apt-get install -y \
5252
python3.13 \
53+
python3.13-venv \
5354
&& apt-get autoremove --purge -y \
5455
&& apt-get clean \
5556
&& rm -rf /var/lib/apt/lists/*
5657

5758
# Setup virtual environment
5859
ENV VIRTUAL_ENV=/opt/spark-venv
59-
RUN python3.13 -m venv --without-pip $VIRTUAL_ENV
60+
RUN python3.13 -m venv $VIRTUAL_ENV
6061
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
6162

62-
# Install Python 3.13 packages
63-
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.13
64-
6563
ARG BASIC_PIP_PKGS="numpy pyarrow>=22.0.0 six==1.16.0 pandas==2.3.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2 pystack>=1.6.0 psutil"
6664
ARG CONNECT_PIP_PKGS="grpcio==1.76.0 grpcio-status==1.76.0 protobuf==6.33.5 googleapis-common-protos==1.71.0 zstandard==0.25.0 graphviz==0.20.3"
6765

0 commit comments

Comments
 (0)