Commit 4966fe9
committed
[SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in
### What changes were proposed in this pull request?
Avoid intermediate pandas dataframe creation in `df.toPandas`
before: batches -> table -> intermediate pdf -> result pdf (based on `pa.Table.to_pandas`)
after: batches -> table -> result pdf (based on `pa.ChunkedArray.to_pandas`)
### Why are the changes needed?
the intermediate pandas dataframe can be skipped
simple benchmark in my local
```
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "false")
spark.conf.set("spark.sql.execution.arrow.pyspark.selfDestruct.enabled", "true")
import time
from pyspark.sql import functions as sf
df = spark.range(1000000).select(
(sf.col("id") % 2).alias("key"), sf.col("id").alias("v")
)
cols = {f"col_{i}": sf.lit(f"c{i}") for i in range(100)}
df = df.withColumns(cols)
df.cache()
df.count()
pdf = df.toPandas() # warm up
start_arrow = time.perf_counter()
for i in range(100):
pdf = df.toPandas()
time.perf_counter() - start_arrow
```
master: 304.49954012501985 secs
this PR: 285.2997682078276 secs
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #52680 from zhengruifeng/avoid_unnecessary_pdf_creation.
Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>df.toPandas
1 parent 9de0a27 commit 4966fe9
1 file changed
+43
-37
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
83 | | - | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
84 | 86 | | |
85 | 87 | | |
86 | 88 | | |
| |||
112 | 114 | | |
113 | 115 | | |
114 | 116 | | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
133 | | - | |
134 | | - | |
135 | | - | |
136 | | - | |
137 | | - | |
138 | | - | |
139 | | - | |
140 | | - | |
141 | | - | |
142 | | - | |
143 | 117 | | |
144 | | - | |
145 | | - | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
146 | 122 | | |
147 | | - | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
148 | 149 | | |
149 | | - | |
| 150 | + | |
150 | 151 | | |
151 | 152 | | |
152 | 153 | | |
| |||
155 | 156 | | |
156 | 157 | | |
157 | 158 | | |
158 | | - | |
| 159 | + | |
159 | 160 | | |
160 | 161 | | |
161 | 162 | | |
162 | 163 | | |
163 | 164 | | |
164 | 165 | | |
165 | 166 | | |
166 | | - | |
167 | | - | |
| 167 | + | |
| 168 | + | |
168 | 169 | | |
169 | 170 | | |
170 | 171 | | |
171 | 172 | | |
172 | | - | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
173 | 179 | | |
174 | 180 | | |
175 | 181 | | |
| |||
0 commit comments