Improve UX for `UNION` vs `UNION ALL` (introduce a LogicalPlan::Distinct)

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
The current support for `UNION` vs `UNION ALL` is confusing to me, in the context of trying to map this to another query engine's support for these operators.

- The logical plan simply has the `Union` operator and this is currently assumed to represent `UNION ALL`. 
- The SQL planner has logic to wrap a union with an aggregate query and that seems like something we would want in the physical plan but not in the logical plan
- The DataFrame API does not have a function for performing a regular `UNION`. It is possible to manually wrap a union in a distinct to achieve this but that might not be obvious to users. Also, the documentation for `distinct` is incorrect and says that it performs a `union`

## DataFusion Logical Plan

```
❯ explain select * from foo union select * from foo;
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                           |
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: #a, #b                                                                                             |
|               |   Aggregate: groupBy=[[#a, #b]], aggr=[[]]                                                                     |
|               |     Union                                                                                                      |
|               |       Projection: #foo.a, #foo.b                                                                               |
|               |         TableScan: foo projection=Some([0, 1])                                                                 |
|               |       Projection: #foo.a, #foo.b                                                                               |
|               |         TableScan: foo projection=Some([0, 1])                                                                 |
+---------------+----------------------------------------------------------------------------------------------------------------+
```

## Spark Logical Plan

```
'Distinct
+- 'Union false, false
   :- 'Project ['a]
   :  +- 'UnresolvedRelation [t1], [], false
   +- 'Project [unresolvedalias(1, None)]
      +- 'UnresolvedRelation [t1], [], false
```

## Spark Physical Plan

```
*(2) HashAggregate(keys=[a#4], functions=[], output=[a#4])
+- Exchange hashpartitioning(a#4, 200), ENSURE_REQUIREMENTS, [id=#26]
   +- *(1) HashAggregate(keys=[a#4], functions=[], output=[a#4])
      +- Union
         :- LocalTableScan [a#4]
         +- LocalTableScan [1#6]
```

**Describe the solution you'd like**
I think what we want is:
- Introduce a `Distinct` operator in the logical plan
- Have the DataFrame `distinct` function create the `Distinct` operator instead of the aggregate query
- The DataFrame API should have functions for both `union` (existing method representing `UNION ALL`) and `union_distinct`
- The physical planner should translate `Distinct` to an aggregate query

**Describe alternatives you've considered**
Leave things as they are and improve the documentation.

**Additional context**
For users of DataFusion for SQL query planning, it would be easier to map union/union all to other engines rather than trying to reverse engineer the aggregate query wrapping the union.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve UX for `UNION` vs `UNION ALL` (introduce a LogicalPlan::Distinct) #2573

DataFusion Logical Plan

Spark Logical Plan

Spark Physical Plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve UX for UNION vs UNION ALL (introduce a LogicalPlan::Distinct) #2573

Description

DataFusion Logical Plan

Spark Logical Plan

Spark Physical Plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Improve UX for `UNION` vs `UNION ALL` (introduce a LogicalPlan::Distinct) #2573