You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
perf: Optimize translate() UDF for scalar inputs (apache#20305)
## Which issue does this PR close?
- Closesapache#20302.
## Rationale for this change
`translate()` is commonly invoked with constant values for its second
and third arguments. We can take advantage of that to significantly
optimize its performance by precomputing the translation lookup table,
rather than recomputing it for every row. For ASCII-only inputs, we can
further replace the hashmap lookup table with a fixed-size array that
maps ASCII byte values directly.
For scalar ASCII inputs, this yields roughly a 10x performance
improvement. For scalar UTF8 inputs, the performance improvement is more
like 50%, although less so for long strings.
Along the way, add support for `translate()` on `LargeUtf8` input, along
with an SLT test, and improve the docs.
## What changes are included in this PR?
* Add a benchmark for scalar/constant input to translate
* Add a missing test case
* Improve translate() docs
* Support translate() on LargeUtf8 input
* Optimize translate() for scalar inputs by precomputing lookup hashmap
* Optimize translate() for ASCII inputs by precomputing ASCII byte-wise
lookup table
## Are these changes tested?
Yes. Added an extra test case and did a bunch of benchmarking.
## Are there any user-facing changes?
No.
---------
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
argument(name = "chars", description = "Characters to translate."),
49
+
argument(name = "from", description = "The characters to be replaced."),
50
50
argument(
51
-
name = "translation",
52
-
description = "Translation characters. Translation characters replace only characters at the same position in the **chars** string."
51
+
name = "to",
52
+
description = "The characters to replace them with. Each character in **from** that is found in **str** is replaced by the character at the same index in **to**. Any characters in **from** that don't have a corresponding character in **to** are removed. If a character appears more than once in **from**, the first occurrence determines the mapping."
53
53
)
54
54
)]
55
55
#[derive(Debug,PartialEq,Eq,Hash)]
@@ -71,6 +71,7 @@ impl TranslateFunc {
71
71
vec![
72
72
Exact(vec![Utf8View,Utf8,Utf8]),
73
73
Exact(vec![Utf8,Utf8,Utf8]),
74
+
Exact(vec![LargeUtf8,Utf8,Utf8]),
74
75
],
75
76
Volatility::Immutable,
76
77
),
@@ -99,6 +100,61 @@ impl ScalarUDFImpl for TranslateFunc {
99
100
&self,
100
101
args: datafusion_expr::ScalarFunctionArgs,
101
102
) -> Result<ColumnarValue>{
103
+
// When from and to are scalars, pre-build the translation map once
104
+
iflet(Some(from_str),Some(to_str)) = (
105
+
try_as_scalar_str(&args.args[1]),
106
+
try_as_scalar_str(&args.args[2]),
107
+
){
108
+
let to_graphemes:Vec<&str> = to_str.graphemes(true).collect();
Copy file name to clipboardExpand all lines: docs/source/user-guide/sql/scalar_functions.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2068,17 +2068,17 @@ to_hex(int)
2068
2068
2069
2069
### `translate`
2070
2070
2071
-
Translates characters in a string to specified translation characters.
2071
+
Performs character-wise substitution based on a mapping.
2072
2072
2073
2073
```sql
2074
-
translate(str, chars, translation)
2074
+
translate(str, from, to)
2075
2075
```
2076
2076
2077
2077
#### Arguments
2078
2078
2079
2079
-**str**: String expression to operate on. Can be a constant, column, or function, and any combination of operators.
2080
-
-**chars**: Characters to translate.
2081
-
-**translation**: Translation characters. Translation characters replace only characters at the same position in the **chars**string.
2080
+
-**from**: The characters to be replaced.
2081
+
-**to**: The characters to replace them with. Each character in **from** that is found in **str** is replaced by the character at the same index in **to**. Any characters in **from**that don't have a corresponding character in **to** are removed. If a character appears more than once in **from**, the first occurrence determines the mapping.
0 commit comments