Skip to content

Commit 964e578

Browse files
authored
perf: avoid redundant columnar shuffle when both parent and child are non-Comet (#4010)
1 parent 5076f63 commit 964e578

94 files changed

Lines changed: 8924 additions & 8941 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

common/src/main/scala/org/apache/comet/CometConf.scala

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -427,6 +427,20 @@ object CometConf extends ShimCometConf {
427427
"The maximum number of columns to hash for round robin partitioning must be non-negative.")
428428
.createWithDefault(0)
429429

430+
val COMET_EXEC_SHUFFLE_REVERT_REDUNDANT_COLUMNAR_ENABLED: ConfigEntry[Boolean] =
431+
conf(s"$COMET_EXEC_CONFIG_PREFIX.shuffle.revertRedundantColumnar.enabled")
432+
.category(CATEGORY_SHUFFLE)
433+
.doc(
434+
"When enabled, Comet reverts a `CometShuffleExchangeExec` with `CometColumnarShuffle` " +
435+
"back to Spark's `ShuffleExchangeExec` when both its parent and child are non-Comet " +
436+
"hash aggregate operators. This avoids a redundant " +
437+
"row -> Arrow -> shuffle -> Arrow -> row conversion when no Comet operator on either " +
438+
"side can consume columnar output. Disable to keep Comet columnar shuffle even in " +
439+
"that case, which preserves Comet's off-heap shuffle memory accounting at the cost of " +
440+
"the extra conversion.")
441+
.booleanConf
442+
.createWithDefault(true)
443+
430444
val COMET_EXEC_SHUFFLE_COMPRESSION_CODEC: ConfigEntry[String] =
431445
conf(s"$COMET_EXEC_CONFIG_PREFIX.shuffle.compression.codec")
432446
.category(CATEGORY_SHUFFLE)

docs/source/user-guide/latest/tuning.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,24 @@ partitioning keys. Columns that are not partitioning keys may contain complex ty
154154
Comet Columnar shuffle is JVM-based and supports `HashPartitioning`, `RoundRobinPartitioning`, `RangePartitioning`, and
155155
`SinglePartitioning`. This shuffle implementation supports complex data types as partitioning keys.
156156

157+
#### Automatic Revert to Spark Shuffle
158+
159+
When a Comet columnar shuffle ends up between two non-Comet operators (for example, a partial/final hash aggregate
160+
pair that Comet could not convert), Comet reverts it to Spark's built-in shuffle. Keeping columnar shuffle between
161+
two row-based operators would add `row -> Arrow -> shuffle -> Arrow -> row` conversions with no Comet consumer on
162+
either side to benefit from columnar output.
163+
164+
This shifts the affected shuffles from Comet's off-heap memory pool back to the JVM execution memory pool. Clusters
165+
tuned for a small JVM heap may see `ExternalSorter` spills on queries where this revert fires. Shuffle I/O may also
166+
grow marginally because Spark's row-based serializer generally compresses less well than Comet's Arrow IPC format.
167+
168+
Each revert is logged at `INFO` level on the driver as `Reverting Comet columnar shuffle to Spark shuffle between
169+
<parent> and <child>`, which lets you correlate any unexpected behavior with this optimization.
170+
171+
This optimization is enabled by default and can be disabled by setting
172+
`spark.comet.exec.shuffle.revertRedundantColumnar.enabled=false`, in which case Comet will keep the columnar shuffle
173+
even when both its parent and child are non-Comet operators.
174+
157175
### Shuffle Compression
158176

159177
By default, Spark compresses shuffle files using LZ4 compression. Comet overrides this behavior with ZSTD compression.

spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala

Lines changed: 88 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,13 @@ object CometExecRule {
8989

9090
val allExecs: Map[Class[_ <: SparkPlan], CometOperatorSerde[_]] = nativeExecs ++ sinks
9191

92+
/**
93+
* Tag set on a `ShuffleExchangeExec` that should be left as a plain Spark shuffle rather than
94+
* wrapped in `CometShuffleExchangeExec`. See `tagRedundantColumnarShuffle`.
95+
*/
96+
val SKIP_COMET_SHUFFLE_TAG: org.apache.spark.sql.catalyst.trees.TreeNodeTag[Unit] =
97+
org.apache.spark.sql.catalyst.trees.TreeNodeTag[Unit]("comet.skipCometShuffle")
98+
9299
}
93100

94101
/**
@@ -100,19 +107,78 @@ case class CometExecRule(session: SparkSession)
100107

101108
private lazy val showTransformations = CometConf.COMET_EXPLAIN_TRANSFORMATIONS.get()
102109

110+
/**
111+
* Revert any `CometShuffleExchangeExec` with `CometColumnarShuffle` whose parent and child are
112+
* both non-Comet `HashAggregateExec` / `ObjectHashAggregateExec` operators back to the original
113+
* Spark `ShuffleExchangeExec`. This is the partial-final-aggregate pattern where Comet couldn't
114+
* convert either aggregate; keeping a columnar shuffle between them only adds
115+
* row->arrow->shuffle->arrow->row conversion overhead with no Comet consumer on either side.
116+
* See https://github.com/apache/datafusion-comet/issues/4004.
117+
*
118+
* The match is intentionally narrow (both sides must be row-based aggregates that remained JVM
119+
* after the main transform pass). Running the revert post-transform means we only fire when the
120+
* main conversion already decided to keep both aggregates JVM - we never create the dangerous
121+
* mixed mode where a Comet partial feeds a JVM final (see issue #1389).
122+
*
123+
* Correctness depends on running as part of `preColumnarTransitions`: if the revert ran after
124+
* Spark inserted `ColumnarToRowExec` between the aggregate and the columnar shuffle, the
125+
* pattern would no longer match (the shuffle would be separated from the aggregate by the
126+
* transition) and the unnecessary conversion could not be eliminated.
127+
*
128+
* The reverted shuffle is tagged with `SKIP_COMET_SHUFFLE_TAG` so both the AQE
129+
* `QueryStagePrepRule` pass and the `ColumnarRule` `preColumnarTransitions` pass leave it alone
130+
* on re-entry - AQE in particular re-runs the rule on each stage in isolation, where the outer
131+
* aggregate context is no longer visible and the shuffle would otherwise be re-wrapped as a
132+
* Comet columnar shuffle.
133+
*/
134+
private def revertRedundantColumnarShuffle(plan: SparkPlan): SparkPlan = {
135+
def isAggregate(p: SparkPlan): Boolean =
136+
p.isInstanceOf[HashAggregateExec] || p.isInstanceOf[ObjectHashAggregateExec]
137+
138+
def isRedundantShuffle(child: SparkPlan): Boolean = child match {
139+
case s: CometShuffleExchangeExec =>
140+
s.shuffleType == CometColumnarShuffle && isAggregate(s.child)
141+
case _ => false
142+
}
143+
144+
plan.transform {
145+
case op if isAggregate(op) && op.children.exists(isRedundantShuffle) =>
146+
val newChildren = op.children.map {
147+
case s: CometShuffleExchangeExec
148+
if s.shuffleType == CometColumnarShuffle && isAggregate(s.child) =>
149+
val reverted =
150+
s.originalPlan.withNewChildren(Seq(s.child)).asInstanceOf[ShuffleExchangeExec]
151+
reverted.setTagValue(CometExecRule.SKIP_COMET_SHUFFLE_TAG, ())
152+
logInfo(
153+
"Reverting Comet columnar shuffle to Spark shuffle between " +
154+
s"${op.getClass.getSimpleName} and ${s.child.getClass.getSimpleName} " +
155+
"(no Comet operator on either side to consume columnar output)")
156+
reverted
157+
case other => other
158+
}
159+
op.withNewChildren(newChildren)
160+
}
161+
}
162+
163+
private def shouldSkipCometShuffle(s: ShuffleExchangeExec): Boolean =
164+
s.getTagValue(CometExecRule.SKIP_COMET_SHUFFLE_TAG).isDefined
165+
103166
private def applyCometShuffle(plan: SparkPlan): SparkPlan = {
104-
plan.transformUp { case s: ShuffleExchangeExec =>
105-
CometShuffleExchangeExec.shuffleSupported(s) match {
106-
case Some(CometNativeShuffle) =>
107-
// Switch to use Decimal128 regardless of precision, since Arrow native execution
108-
// doesn't support Decimal32 and Decimal64 yet.
109-
conf.setConfString(CometConf.COMET_USE_DECIMAL_128.key, "true")
110-
CometShuffleExchangeExec(s, shuffleType = CometNativeShuffle)
111-
case Some(CometColumnarShuffle) =>
112-
CometShuffleExchangeExec(s, shuffleType = CometColumnarShuffle)
113-
case None =>
114-
s
115-
}
167+
plan.transformUp {
168+
case s: ShuffleExchangeExec if shouldSkipCometShuffle(s) =>
169+
s
170+
case s: ShuffleExchangeExec =>
171+
CometShuffleExchangeExec.shuffleSupported(s) match {
172+
case Some(CometNativeShuffle) =>
173+
// Switch to use Decimal128 regardless of precision, since Arrow native execution
174+
// doesn't support Decimal32 and Decimal64 yet.
175+
conf.setConfString(CometConf.COMET_USE_DECIMAL_128.key, "true")
176+
CometShuffleExchangeExec(s, shuffleType = CometNativeShuffle)
177+
case Some(CometColumnarShuffle) =>
178+
CometShuffleExchangeExec(s, shuffleType = CometColumnarShuffle)
179+
case None =>
180+
s
181+
}
116182
}
117183
}
118184

@@ -261,6 +327,9 @@ case class CometExecRule(session: SparkSession)
261327
case s @ ShuffleQueryStageExec(_, ReusedExchangeExec(_, _: CometShuffleExchangeExec), _) =>
262328
convertToComet(s, CometExchangeSink).getOrElse(s)
263329

330+
case s: ShuffleExchangeExec if shouldSkipCometShuffle(s) =>
331+
s
332+
264333
case s: ShuffleExchangeExec =>
265334
convertToComet(s, CometShuffleExchangeExec).getOrElse(s)
266335

@@ -464,6 +533,13 @@ case class CometExecRule(session: SparkSession)
464533
case CometScanWrapper(_, s) => s
465534
}
466535

536+
// Revert CometColumnarShuffle to Spark's ShuffleExchangeExec when both its parent and child
537+
// are non-Comet HashAggregate/ObjectHashAggregate operators that remained JVM after the main
538+
// transform pass. See https://github.com/apache/datafusion-comet/issues/4004.
539+
if (CometConf.COMET_EXEC_SHUFFLE_REVERT_REDUNDANT_COLUMNAR_ENABLED.get()) {
540+
newPlan = revertRedundantColumnarShuffle(newPlan)
541+
}
542+
467543
// Set up logical links
468544
newPlan = newPlan.transform {
469545
case op: CometExec =>
Lines changed: 58 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,61 @@
11
TakeOrderedAndProject
22
+- HashAggregate
3-
+- CometNativeColumnarToRow
4-
+- CometColumnarExchange
5-
+- HashAggregate
6-
+- Project
7-
+- BroadcastHashJoin
8-
:- Project
9-
: +- BroadcastHashJoin
10-
: :- Project
11-
: : +- Filter
12-
: : +- BroadcastHashJoin
13-
: : :- BroadcastHashJoin [COMET: Unsupported join type ExistenceJoin(exists#1)]
14-
: : : :- CometNativeColumnarToRow
15-
: : : : +- CometBroadcastHashJoin
16-
: : : : :- CometFilter
17-
: : : : : +- CometNativeScan parquet spark_catalog.default.customer
18-
: : : : +- CometBroadcastExchange
19-
: : : : +- CometProject
20-
: : : : +- CometBroadcastHashJoin
21-
: : : : :- CometNativeScan parquet spark_catalog.default.store_sales
22-
: : : : : +- CometSubqueryBroadcast
23-
: : : : : +- CometBroadcastExchange
24-
: : : : : +- CometProject
25-
: : : : : +- CometFilter
26-
: : : : : +- CometNativeScan parquet spark_catalog.default.date_dim
27-
: : : : +- CometBroadcastExchange
28-
: : : : +- CometProject
29-
: : : : +- CometFilter
30-
: : : : +- CometNativeScan parquet spark_catalog.default.date_dim
31-
: : : +- BroadcastExchange
32-
: : : +- CometNativeColumnarToRow
33-
: : : +- CometProject
34-
: : : +- CometBroadcastHashJoin
35-
: : : :- CometNativeScan parquet spark_catalog.default.web_sales
36-
: : : : +- ReusedSubquery
37-
: : : +- CometBroadcastExchange
38-
: : : +- CometProject
39-
: : : +- CometFilter
40-
: : : +- CometNativeScan parquet spark_catalog.default.date_dim
41-
: : +- BroadcastExchange
42-
: : +- CometNativeColumnarToRow
43-
: : +- CometProject
44-
: : +- CometBroadcastHashJoin
45-
: : :- CometNativeScan parquet spark_catalog.default.catalog_sales
46-
: : : +- ReusedSubquery
47-
: : +- CometBroadcastExchange
48-
: : +- CometProject
49-
: : +- CometFilter
50-
: : +- CometNativeScan parquet spark_catalog.default.date_dim
51-
: +- BroadcastExchange
52-
: +- CometNativeColumnarToRow
53-
: +- CometProject
54-
: +- CometFilter
55-
: +- CometNativeScan parquet spark_catalog.default.customer_address
56-
+- BroadcastExchange
57-
+- CometNativeColumnarToRow
58-
+- CometProject
59-
+- CometFilter
60-
+- CometNativeScan parquet spark_catalog.default.customer_demographics
3+
+- Exchange
4+
+- HashAggregate
5+
+- Project
6+
+- BroadcastHashJoin
7+
:- Project
8+
: +- BroadcastHashJoin
9+
: :- Project
10+
: : +- Filter
11+
: : +- BroadcastHashJoin
12+
: : :- BroadcastHashJoin [COMET: Unsupported join type ExistenceJoin(exists#1)]
13+
: : : :- CometNativeColumnarToRow
14+
: : : : +- CometBroadcastHashJoin
15+
: : : : :- CometFilter
16+
: : : : : +- CometNativeScan parquet spark_catalog.default.customer
17+
: : : : +- CometBroadcastExchange
18+
: : : : +- CometProject
19+
: : : : +- CometBroadcastHashJoin
20+
: : : : :- CometNativeScan parquet spark_catalog.default.store_sales
21+
: : : : : +- CometSubqueryBroadcast
22+
: : : : : +- CometBroadcastExchange
23+
: : : : : +- CometProject
24+
: : : : : +- CometFilter
25+
: : : : : +- CometNativeScan parquet spark_catalog.default.date_dim
26+
: : : : +- CometBroadcastExchange
27+
: : : : +- CometProject
28+
: : : : +- CometFilter
29+
: : : : +- CometNativeScan parquet spark_catalog.default.date_dim
30+
: : : +- BroadcastExchange
31+
: : : +- CometNativeColumnarToRow
32+
: : : +- CometProject
33+
: : : +- CometBroadcastHashJoin
34+
: : : :- CometNativeScan parquet spark_catalog.default.web_sales
35+
: : : : +- ReusedSubquery
36+
: : : +- CometBroadcastExchange
37+
: : : +- CometProject
38+
: : : +- CometFilter
39+
: : : +- CometNativeScan parquet spark_catalog.default.date_dim
40+
: : +- BroadcastExchange
41+
: : +- CometNativeColumnarToRow
42+
: : +- CometProject
43+
: : +- CometBroadcastHashJoin
44+
: : :- CometNativeScan parquet spark_catalog.default.catalog_sales
45+
: : : +- ReusedSubquery
46+
: : +- CometBroadcastExchange
47+
: : +- CometProject
48+
: : +- CometFilter
49+
: : +- CometNativeScan parquet spark_catalog.default.date_dim
50+
: +- BroadcastExchange
51+
: +- CometNativeColumnarToRow
52+
: +- CometProject
53+
: +- CometFilter
54+
: +- CometNativeScan parquet spark_catalog.default.customer_address
55+
+- BroadcastExchange
56+
+- CometNativeColumnarToRow
57+
+- CometProject
58+
+- CometFilter
59+
+- CometNativeScan parquet spark_catalog.default.customer_demographics
6160

62-
Comet accelerated 36 out of 54 eligible operators (66%). Final plan contains 6 transitions between Spark and Comet.
61+
Comet accelerated 35 out of 54 eligible operators (64%). Final plan contains 5 transitions between Spark and Comet.

0 commit comments

Comments
 (0)