Skip to content

Simplify concat_ws(null, ..) to null#3608

Merged
alamb merged 4 commits intoapache:masterfrom
HaoYang670:3607_simpl_null_separator
Sep 26, 2022
Merged

Simplify concat_ws(null, ..) to null#3608
alamb merged 4 commits intoapache:masterfrom
HaoYang670:3607_simpl_null_separator

Conversation

@HaoYang670
Copy link
Copy Markdown
Contributor

@HaoYang670 HaoYang670 commented Sep 25, 2022

Signed-off-by: remzi 13716567376yh@gmail.com

Which issue does this PR close?

Closes #3607.

Rationale for this change

Before the optimization

❯ create table t as select 'aa' as a, 'bb' as b, 'cc' as c;
0 rows in set. Query took 0.008 seconds.
❯ explain select concat_ws(null, a, b, c) from t;
+---------------+------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                   |
+---------------+------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: concatwithseparator(NULL, #t.a, #t.b, #t.c)                                                                |
|               |   TableScan: t projection=[a, b, c]                                                                                    |
| physical_plan | ProjectionExec: expr=[concatwithseparator(CAST(NULL AS Utf8), a@0, b@1, c@2) as concatwithseparator(NULL,t.a,t.b,t.c)] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                                                        |
|               |                                                                                                                        |
+---------------+------------------------------------------------------------------------------------------------------------------------+

After the optimization

❯ create table t as select 'aa' as a, 'bb' as b, 'cc' as c;
0 rows in set. Query took 0.009 seconds.
❯ explain select concat_ws(null, a, b, c) from t;
+---------------+----------------------------------------------------------------------+
| plan_type     | plan                                                                 |
+---------------+----------------------------------------------------------------------+
| logical_plan  | Projection: Utf8(NULL) AS concatwithseparator(NULL,t.a,t.b,t.c)      |
|               |   TableScan: t projection=[a]                                        |
| physical_plan | ProjectionExec: expr=[NULL as concatwithseparator(NULL,t.a,t.b,t.c)] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                      |
|               |                                                                      |
+---------------+----------------------------------------------------------------------+

What changes are included in this PR?

Are there any user-facing changes?

Signed-off-by: remzi <13716567376yh@gmail.com>
@github-actions github-actions bot added the optimizer Optimizer rules label Sep 25, 2022
Signed-off-by: remzi <13716567376yh@gmail.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 25, 2022

Codecov Report

Merging #3608 (d14b01f) into master (abcb39d) will increase coverage by 0.01%.
The diff coverage is 94.88%.

@@            Coverage Diff             @@
##           master    #3608      +/-   ##
==========================================
+ Coverage   86.06%   86.07%   +0.01%     
==========================================
  Files         300      300              
  Lines       56328    56337       +9     
==========================================
+ Hits        48479    48493      +14     
+ Misses       7849     7844       -5     
Impacted Files Coverage Δ
datafusion/core/src/physical_plan/planner.rs 78.03% <ø> (+0.56%) ⬆️
datafusion/core/tests/sql/decimal.rs 100.00% <ø> (ø)
datafusion/proto/src/lib.rs 94.35% <ø> (ø)
datafusion/core/src/execution/context.rs 78.71% <33.33%> (-0.63%) ⬇️
datafusion/optimizer/src/simplify_expressions.rs 82.67% <88.88%> (-0.13%) ⬇️
datafusion/optimizer/src/type_coercion.rs 96.01% <92.30%> (+3.61%) ⬆️
datafusion/physical-expr/src/expressions/binary.rs 97.42% <97.14%> (-0.21%) ⬇️
datafusion/core/src/physical_plan/sorts/sort.rs 94.10% <97.43%> (+0.24%) ⬆️
benchmarks/src/bin/tpch.rs 37.95% <100.00%> (-0.12%) ⬇️
datafusion/core/src/dataframe.rs 89.55% <100.00%> (-0.03%) ⬇️
... and 33 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @HaoYang670 -- this change makes sense to me and I think it could be merged as is. However, I think it is worth considering handling nulls in positions other than the first as well

out_expr.rewrite(self)?
}

// concat_ws
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// concat_ws
// concat_ws
// If any of the arguments is NULL, the output will be null

Comment thread datafusion/optimizer/src/simplify_expressions.rs
@HaoYang670
Copy link
Copy Markdown
Contributor Author

HaoYang670 commented Sep 25, 2022

However, I think it is worth considering handling nulls in positions other than the first as well

@alamb, if the separator is null, the concat_ws expression will always have the value null,

If null appears in other positions, then the null value will be ignored:

concat_ws("|", col1, null, col2) --> concat_ws("|", col1, col2)

These is the behavior in Datafusion and Postgresql: https://pgpedia.info/c/concat_ws.html#:~:text=concat_ws%20%28%29%20is%20a%20system%20function%20for%20concatenating,%20%20val2%20any%20%5B%2C...%5D%20%5D%29%20%E2%86%92%20text

So we should only check null at the first position

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Sep 25, 2022

So we should only check null at the first position

Perhaps we can add a test for that behavior explicitly (as in a test that a null at the second position does not result in a rewrite) to make it clear that is the expected behavior

@HaoYang670
Copy link
Copy Markdown
Contributor Author

Will add.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Sep 25, 2022

In case anyone else is interested in how concat_ws works, I tested it out with datafusion-cli

select concat_ws('--', 'foo', 'bar', 'baz');
+---------------------------------------------------------------------+
| concatwithseparator(Utf8("--"),Utf8("foo"),Utf8("bar"),Utf8("baz")) |
+---------------------------------------------------------------------+
| foo--bar--baz                                                       |
+---------------------------------------------------------------------+
1 row in set. Query took 0.000 seconds.
❯ select concat_ws('--', null, 'bar', 'baz');
+--------------------------------------------------------------+
| concatwithseparator(Utf8("--"),NULL,Utf8("bar"),Utf8("baz")) |
+--------------------------------------------------------------+
| bar--baz                                                     |
+--------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.
❯ select concat_ws(null, 'foo', 'bar', 'baz');
+---------------------------------------------------------------+
| concatwithseparator(NULL,Utf8("foo"),Utf8("bar"),Utf8("baz")) |
+---------------------------------------------------------------+
|                                                               |
+---------------------------------------------------------------+
1 row in set. Query took 0.000 seconds.

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks @HaoYang670

@alamb alamb merged commit c8a9ea0 into apache:master Sep 26, 2022
@ursabot
Copy link
Copy Markdown

ursabot commented Sep 26, 2022

Benchmark runs are scheduled for baseline = ebb28f5 and contender = c8a9ea0. c8a9ea0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@HaoYang670 HaoYang670 deleted the 3607_simpl_null_separator branch September 26, 2022 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Simplify CONCAT_WS(NULL, ..) to NULL

4 participants