|
| 1 | +<!-- |
| 2 | +Licensed to the Apache Software Foundation (ASF) under one |
| 3 | +or more contributor license agreements. See the NOTICE file |
| 4 | +distributed with this work for additional information |
| 5 | +regarding copyright ownership. The ASF licenses this file |
| 6 | +to you under the Apache License, Version 2.0 (the |
| 7 | +"License"); you may not use this file except in compliance |
| 8 | +with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | +Unless required by applicable law or agreed to in writing, |
| 13 | +software distributed under the License is distributed on an |
| 14 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | +KIND, either express or implied. See the License for the |
| 16 | +specific language governing permissions and limitations |
| 17 | +under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# SQL File Tests |
| 21 | + |
| 22 | +`CometSqlFileTestSuite` is a test suite that automatically discovers `.sql` test files and |
| 23 | +runs each query through both Spark and Comet, comparing results. This provides a lightweight |
| 24 | +way to add expression and operator test coverage without writing Scala test code. |
| 25 | + |
| 26 | +## Running the tests |
| 27 | + |
| 28 | +```shell |
| 29 | +mvn test -pl spark -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none |
| 30 | +``` |
| 31 | + |
| 32 | +## Test file location |
| 33 | + |
| 34 | +SQL test files live under: |
| 35 | + |
| 36 | +``` |
| 37 | +spark/src/test/resources/sql-tests/expressions/ |
| 38 | +``` |
| 39 | + |
| 40 | +Files are organized into category subdirectories: |
| 41 | + |
| 42 | +``` |
| 43 | +expressions/ |
| 44 | + aggregate/ -- avg, sum, count, min_max, ... |
| 45 | + array/ -- array_contains, array_append, get_array_item, ... |
| 46 | + bitwise/ |
| 47 | + cast/ |
| 48 | + conditional/ -- case_when, coalesce, if_expr, ... |
| 49 | + datetime/ -- date_add, date_diff, unix_timestamp, ... |
| 50 | + decimal/ |
| 51 | + hash/ |
| 52 | + map/ -- get_map_value, map_keys, map_values, ... |
| 53 | + math/ -- abs, ceil, floor, round, sqrt, ... |
| 54 | + misc/ -- width_bucket, scalar_subquery, ... |
| 55 | + string/ -- concat, like, substring, lower, upper, ... |
| 56 | + struct/ -- create_named_struct, get_struct_field, ... |
| 57 | +``` |
| 58 | + |
| 59 | +The test suite recursively discovers all `.sql` files in these directories. Each file becomes |
| 60 | +one or more ScalaTest test cases. |
| 61 | + |
| 62 | +## File format |
| 63 | + |
| 64 | +A test file consists of SQL comments, directives, statements, and queries separated by blank |
| 65 | +lines. Here is a minimal example: |
| 66 | + |
| 67 | +```sql |
| 68 | +-- ConfigMatrix: parquet.enable.dictionary=false,true |
| 69 | + |
| 70 | +statement |
| 71 | +CREATE TABLE test_abs(v double) USING parquet |
| 72 | + |
| 73 | +statement |
| 74 | +INSERT INTO test_abs VALUES (1.5), (-2.5), (0.0), (NULL) |
| 75 | + |
| 76 | +query |
| 77 | +SELECT abs(v) FROM test_abs |
| 78 | +``` |
| 79 | + |
| 80 | +### Directives |
| 81 | + |
| 82 | +Directives are SQL comments at the top of the file that configure how the test runs. |
| 83 | + |
| 84 | +#### `Config` |
| 85 | + |
| 86 | +Sets a Spark SQL config for all queries in the file. |
| 87 | + |
| 88 | +```sql |
| 89 | +-- Config: spark.sql.ansi.enabled=true |
| 90 | +``` |
| 91 | + |
| 92 | +#### `ConfigMatrix` |
| 93 | + |
| 94 | +Runs the entire file once per combination of values. Multiple `ConfigMatrix` lines produce a |
| 95 | +cross product of all combinations. |
| 96 | + |
| 97 | +```sql |
| 98 | +-- ConfigMatrix: parquet.enable.dictionary=false,true |
| 99 | +``` |
| 100 | + |
| 101 | +This generates two test cases: |
| 102 | + |
| 103 | +``` |
| 104 | +sql-file: expressions/cast/cast.sql [parquet.enable.dictionary=false] |
| 105 | +sql-file: expressions/cast/cast.sql [parquet.enable.dictionary=true] |
| 106 | +``` |
| 107 | + |
| 108 | +#### `MinSparkVersion` |
| 109 | + |
| 110 | +Skips the file when running on a Spark version older than the specified version. |
| 111 | + |
| 112 | +```sql |
| 113 | +-- MinSparkVersion: 3.5 |
| 114 | +``` |
| 115 | + |
| 116 | +### Statements |
| 117 | + |
| 118 | +A `statement` block executes DDL or DML and does not check results. Use this for `CREATE TABLE` |
| 119 | +and `INSERT` commands. Table names are automatically extracted for cleanup after the test. |
| 120 | + |
| 121 | +```sql |
| 122 | +statement |
| 123 | +CREATE TABLE my_table(x int, y double) USING parquet |
| 124 | + |
| 125 | +statement |
| 126 | +INSERT INTO my_table VALUES (1, 2.0), (3, 4.0), (NULL, NULL) |
| 127 | +``` |
| 128 | + |
| 129 | +### Queries |
| 130 | + |
| 131 | +A `query` block executes a SELECT and compares results between Spark and Comet. The query |
| 132 | +mode controls how results are validated. |
| 133 | + |
| 134 | +#### `query` (default mode) |
| 135 | + |
| 136 | +Checks that the query runs natively on Comet (not falling back to Spark) and that results |
| 137 | +match Spark exactly. |
| 138 | + |
| 139 | +```sql |
| 140 | +query |
| 141 | +SELECT abs(v) FROM test_abs |
| 142 | +``` |
| 143 | + |
| 144 | +#### `query spark_answer_only` |
| 145 | + |
| 146 | +Only checks that Comet results match Spark. Does not assert that the query runs natively. |
| 147 | +Use this for expressions that Comet may not fully support yet but should still produce |
| 148 | +correct results. |
| 149 | + |
| 150 | +```sql |
| 151 | +query spark_answer_only |
| 152 | +SELECT some_expression(v) FROM test_table |
| 153 | +``` |
| 154 | + |
| 155 | +#### `query tolerance=<value>` |
| 156 | + |
| 157 | +Checks results with a numeric tolerance. Useful for floating-point functions where small |
| 158 | +differences are acceptable. |
| 159 | + |
| 160 | +```sql |
| 161 | +query tolerance=0.0001 |
| 162 | +SELECT cos(v) FROM test_trig |
| 163 | +``` |
| 164 | + |
| 165 | +#### `query expect_fallback(<reason>)` |
| 166 | + |
| 167 | +Asserts that the query falls back to Spark and verifies the fallback reason contains the |
| 168 | +given string. |
| 169 | + |
| 170 | +```sql |
| 171 | +query expect_fallback(unsupported expression) |
| 172 | +SELECT unsupported_func(v) FROM test_table |
| 173 | +``` |
| 174 | + |
| 175 | +#### `query ignore(<reason>)` |
| 176 | + |
| 177 | +Skips the query entirely. Use this for queries that hit known bugs. The reason should be a |
| 178 | +link to the tracking GitHub issue. |
| 179 | + |
| 180 | +```sql |
| 181 | +-- Comet bug: space(-1) causes native crash |
| 182 | +query ignore(https://github.com/apache/datafusion-comet/issues/3326) |
| 183 | +SELECT space(n) FROM test_space WHERE n < 0 |
| 184 | +``` |
| 185 | + |
| 186 | +## Adding a new test |
| 187 | + |
| 188 | +1. Create a `.sql` file under the appropriate subdirectory in |
| 189 | + `spark/src/test/resources/sql-tests/expressions/`. Create a new subdirectory if no |
| 190 | + existing category fits. |
| 191 | + |
| 192 | +2. Add the Apache license header as a SQL comment. |
| 193 | + |
| 194 | +3. Add a `ConfigMatrix` directive if the test should run with multiple Parquet configurations. |
| 195 | + Most expression tests use: |
| 196 | + |
| 197 | + ```sql |
| 198 | + -- ConfigMatrix: parquet.enable.dictionary=false,true |
| 199 | + ``` |
| 200 | + |
| 201 | +4. Create tables and insert test data using `statement` blocks. Include edge cases such as |
| 202 | + `NULL`, boundary values, and negative numbers. |
| 203 | + |
| 204 | +5. Add `query` blocks for each expression or behavior to test. Use the default `query` mode |
| 205 | + when you expect Comet to run the expression natively. Use `query spark_answer_only` when |
| 206 | + native execution is not yet expected. |
| 207 | + |
| 208 | +6. Run the tests to verify: |
| 209 | + |
| 210 | + ```shell |
| 211 | + mvn test -pl spark -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none |
| 212 | + ``` |
| 213 | + |
| 214 | +### Tips for writing thorough tests |
| 215 | + |
| 216 | +#### Cover all combinations of literal and column arguments |
| 217 | + |
| 218 | +Comet often uses different code paths for literal values versus column references. Tests |
| 219 | +should exercise both. For a function with multiple arguments, test every useful combination. |
| 220 | + |
| 221 | +For a single-argument function, test both a column reference and a literal: |
| 222 | + |
| 223 | +```sql |
| 224 | +-- column argument (reads from Parquet, goes through columnar evaluation) |
| 225 | +query |
| 226 | +SELECT ascii(s) FROM test_ascii |
| 227 | + |
| 228 | +-- literal arguments |
| 229 | +query |
| 230 | +SELECT ascii('A'), ascii(''), ascii(NULL) |
| 231 | +``` |
| 232 | + |
| 233 | +For a multi-argument function like `concat_ws(sep, str1, str2, ...)`, test with the separator |
| 234 | +as a column versus a literal, and similarly for the other arguments: |
| 235 | + |
| 236 | +```sql |
| 237 | +-- all columns |
| 238 | +query |
| 239 | +SELECT concat_ws(sep, a, b) FROM test_table |
| 240 | + |
| 241 | +-- literal separator, column values |
| 242 | +query |
| 243 | +SELECT concat_ws(',', a, b) FROM test_table |
| 244 | + |
| 245 | +-- all literals |
| 246 | +query |
| 247 | +SELECT concat_ws(',', 'hello', 'world') |
| 248 | +``` |
| 249 | + |
| 250 | +**Note on constant folding:** Normally Spark constant-folds all-literal expressions during |
| 251 | +planning, so Comet would never see them. However, `CometSqlFileTestSuite` automatically |
| 252 | +disables constant folding (by excluding `ConstantFolding` from the optimizer rules), so |
| 253 | +all-literal queries are evaluated by Comet's native engine. This means you can use the |
| 254 | +default `query` mode for all-literal cases and they will be tested natively just like |
| 255 | +column-based queries. |
| 256 | + |
| 257 | +#### Cover edge cases |
| 258 | + |
| 259 | +Include edge-case values in your test data. The exact cases depend on the function, but |
| 260 | +common ones include: |
| 261 | + |
| 262 | +- **NULL values** -- every test should include NULLs |
| 263 | +- **Empty strings** -- for string functions |
| 264 | +- **Zero, negative, and very large numbers** -- for numeric functions |
| 265 | +- **Boundary values** -- `INT_MIN`, `INT_MAX`, `NaN`, `Infinity`, `-Infinity` for numeric |
| 266 | + types |
| 267 | +- **Special characters and multibyte UTF-8** -- for string functions (e.g. `'é'`, `'中文'`, |
| 268 | + `'\t'`) |
| 269 | +- **Empty arrays/maps** -- for collection functions |
| 270 | +- **Single-element and multi-element collections** -- for aggregate and collection functions |
| 271 | + |
| 272 | +#### One file per expression |
| 273 | + |
| 274 | +Keep each `.sql` file focused on a single expression or a small group of closely related |
| 275 | +expressions. This makes failures easy to locate and keeps files readable. |
| 276 | + |
| 277 | +#### Use comments to label sections |
| 278 | + |
| 279 | +Add SQL comments before query blocks to describe what aspect of the expression is being |
| 280 | +tested. This helps reviewers and future maintainers understand the intent: |
| 281 | + |
| 282 | +```sql |
| 283 | +-- literal separator with NULL values |
| 284 | +query |
| 285 | +SELECT concat_ws(',', NULL, 'b', 'c') |
| 286 | + |
| 287 | +-- empty separator |
| 288 | +query |
| 289 | +SELECT concat_ws('', a, b, c) FROM test_table |
| 290 | +``` |
| 291 | + |
| 292 | +### Using agentic coding tools |
| 293 | + |
| 294 | +Writing thorough SQL test files is a task well suited to agentic coding tools such as |
| 295 | +Claude Code. You can point the tool at an existing test file as an example, describe the |
| 296 | +expression you want to test, and ask it to generate a complete `.sql` file covering all |
| 297 | +argument combinations and edge cases. This is significantly faster than writing the |
| 298 | +combinatorial test cases by hand and helps ensure nothing is missed. |
| 299 | + |
| 300 | +For example: |
| 301 | + |
| 302 | +``` |
| 303 | +Read the test file spark/src/test/resources/sql-tests/expressions/string/ascii.sql |
| 304 | +and the documentation in docs/source/contributor-guide/sql-file-tests.md. |
| 305 | +Then write a similar test file for the `reverse` function, covering column arguments, |
| 306 | +literal arguments, NULLs, empty strings, and multibyte characters. |
| 307 | +``` |
| 308 | + |
| 309 | +## Handling test failures |
| 310 | + |
| 311 | +When a query fails due to a known Comet bug: |
| 312 | + |
| 313 | +1. File a GitHub issue describing the problem. |
| 314 | +2. Change the query mode to `ignore(...)` with a link to the issue. |
| 315 | +3. Optionally add a SQL comment above the query explaining the problem. |
| 316 | + |
| 317 | +```sql |
| 318 | +-- GetArrayItem returns incorrect results with dynamic index |
| 319 | +query ignore(https://github.com/apache/datafusion-comet/issues/3332) |
| 320 | +SELECT arr[idx] FROM test_get_array_item |
| 321 | +``` |
| 322 | + |
| 323 | +When the bug is fixed, remove the `ignore(...)` and restore the original query mode. |
| 324 | + |
| 325 | +## Architecture |
| 326 | + |
| 327 | +The test infrastructure consists of two Scala files: |
| 328 | + |
| 329 | +- **`SqlFileTestParser`** (`spark/src/test/scala/org/apache/comet/SqlFileTestParser.scala`) -- |
| 330 | + Parses `.sql` files into a `SqlTestFile` data structure containing directives, statements, |
| 331 | + and queries. |
| 332 | + |
| 333 | +- **`CometSqlFileTestSuite`** (`spark/src/test/scala/org/apache/comet/CometSqlFileTestSuite.scala`) -- |
| 334 | + Discovers test files at suite initialization time, generates ScalaTest test cases for each |
| 335 | + file and config combination, and executes them using `CometTestBase` assertion methods. |
| 336 | + |
| 337 | +Tables created in test files are automatically cleaned up after each test. |
0 commit comments