Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The current support for UNION vs UNION ALL is confusing to me, in the context of trying to map this to another query engine's support for these operators.
- The logical plan simply has the
Union operator and this is currently assumed to represent UNION ALL.
- The SQL planner has logic to wrap a union with an aggregate query and that seems like something we would want in the physical plan but not in the logical plan
- The DataFrame API does not have a function for performing a regular
UNION. It is possible to manually wrap a union in a distinct to achieve this but that might not be obvious to users. Also, the documentation for distinct is incorrect and says that it performs a union
DataFusion Logical Plan
❯ explain select * from foo union select * from foo;
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: #a, #b |
| | Aggregate: groupBy=[[#a, #b]], aggr=[[]] |
| | Union |
| | Projection: #foo.a, #foo.b |
| | TableScan: foo projection=Some([0, 1]) |
| | Projection: #foo.a, #foo.b |
| | TableScan: foo projection=Some([0, 1]) |
+---------------+----------------------------------------------------------------------------------------------------------------+
Spark Logical Plan
'Distinct
+- 'Union false, false
:- 'Project ['a]
: +- 'UnresolvedRelation [t1], [], false
+- 'Project [unresolvedalias(1, None)]
+- 'UnresolvedRelation [t1], [], false
Spark Physical Plan
*(2) HashAggregate(keys=[a#4], functions=[], output=[a#4])
+- Exchange hashpartitioning(a#4, 200), ENSURE_REQUIREMENTS, [id=#26]
+- *(1) HashAggregate(keys=[a#4], functions=[], output=[a#4])
+- Union
:- LocalTableScan [a#4]
+- LocalTableScan [1#6]
Describe the solution you'd like
I think what we want is:
- Introduce a
Distinct operator in the logical plan
- Have the DataFrame
distinct function create the Distinct operator instead of the aggregate query
- The DataFrame API should have functions for both
union (existing method representing UNION ALL) and union_distinct
- The physical planner should translate
Distinct to an aggregate query
Describe alternatives you've considered
Leave things as they are and improve the documentation.
Additional context
For users of DataFusion for SQL query planning, it would be easier to map union/union all to other engines rather than trying to reverse engineer the aggregate query wrapping the union.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The current support for
UNIONvsUNION ALLis confusing to me, in the context of trying to map this to another query engine's support for these operators.Unionoperator and this is currently assumed to representUNION ALL.UNION. It is possible to manually wrap a union in a distinct to achieve this but that might not be obvious to users. Also, the documentation fordistinctis incorrect and says that it performs aunionDataFusion Logical Plan
Spark Logical Plan
Spark Physical Plan
Describe the solution you'd like
I think what we want is:
Distinctoperator in the logical plandistinctfunction create theDistinctoperator instead of the aggregate queryunion(existing method representingUNION ALL) andunion_distinctDistinctto an aggregate queryDescribe alternatives you've considered
Leave things as they are and improve the documentation.
Additional context
For users of DataFusion for SQL query planning, it would be easier to map union/union all to other engines rather than trying to reverse engineer the aggregate query wrapping the union.