Skip to content

Commit fe2de33

Browse files
authored
docs: Add contributor guide page for SQL file tests (#3333)
1 parent bfe4644 commit fe2de33

3 files changed

Lines changed: 348 additions & 0 deletions

File tree

docs/source/contributor-guide/development.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,16 @@ size limit to 16 MB.
140140
First make sure to install the Rust plugin in CLion or you can use the dedicated Rust IDE: RustRover.
141141
After that you can open the project in CLion. The IDE should automatically detect the project structure and import as a Cargo project.
142142

143+
### SQL file tests (recommended for expressions)
144+
145+
For testing expressions and operators, prefer using SQL file tests over writing Scala test
146+
code. SQL file tests are plain `.sql` files that are automatically discovered and executed --
147+
no Scala code to write, and no recompilation needed when tests change. This makes it easy to
148+
iterate quickly and to get good coverage of edge cases and argument combinations.
149+
150+
See the [SQL File Tests](sql-file-tests) guide for the full documentation on how to write
151+
and run these tests.
152+
143153
### Running Tests in IDEA
144154

145155
Like other Maven projects, you can run tests in IntelliJ IDEA by right-clicking on the test class or test method and selecting "Run" or "Debug".

docs/source/contributor-guide/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Adding a New Expression <adding_a_new_expression>
3737
Tracing <tracing>
3838
Profiling Native Code <profiling_native_code>
3939
Spark SQL Tests <spark-sql-tests.md>
40+
SQL File Tests <sql-file-tests.md>
4041
Roadmap <roadmap.md>
4142
Github and Issue Tracker <https://github.com/apache/datafusion-comet>
4243
```
Lines changed: 337 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,337 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# SQL File Tests
21+
22+
`CometSqlFileTestSuite` is a test suite that automatically discovers `.sql` test files and
23+
runs each query through both Spark and Comet, comparing results. This provides a lightweight
24+
way to add expression and operator test coverage without writing Scala test code.
25+
26+
## Running the tests
27+
28+
```shell
29+
mvn test -pl spark -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none
30+
```
31+
32+
## Test file location
33+
34+
SQL test files live under:
35+
36+
```
37+
spark/src/test/resources/sql-tests/expressions/
38+
```
39+
40+
Files are organized into category subdirectories:
41+
42+
```
43+
expressions/
44+
aggregate/ -- avg, sum, count, min_max, ...
45+
array/ -- array_contains, array_append, get_array_item, ...
46+
bitwise/
47+
cast/
48+
conditional/ -- case_when, coalesce, if_expr, ...
49+
datetime/ -- date_add, date_diff, unix_timestamp, ...
50+
decimal/
51+
hash/
52+
map/ -- get_map_value, map_keys, map_values, ...
53+
math/ -- abs, ceil, floor, round, sqrt, ...
54+
misc/ -- width_bucket, scalar_subquery, ...
55+
string/ -- concat, like, substring, lower, upper, ...
56+
struct/ -- create_named_struct, get_struct_field, ...
57+
```
58+
59+
The test suite recursively discovers all `.sql` files in these directories. Each file becomes
60+
one or more ScalaTest test cases.
61+
62+
## File format
63+
64+
A test file consists of SQL comments, directives, statements, and queries separated by blank
65+
lines. Here is a minimal example:
66+
67+
```sql
68+
-- ConfigMatrix: parquet.enable.dictionary=false,true
69+
70+
statement
71+
CREATE TABLE test_abs(v double) USING parquet
72+
73+
statement
74+
INSERT INTO test_abs VALUES (1.5), (-2.5), (0.0), (NULL)
75+
76+
query
77+
SELECT abs(v) FROM test_abs
78+
```
79+
80+
### Directives
81+
82+
Directives are SQL comments at the top of the file that configure how the test runs.
83+
84+
#### `Config`
85+
86+
Sets a Spark SQL config for all queries in the file.
87+
88+
```sql
89+
-- Config: spark.sql.ansi.enabled=true
90+
```
91+
92+
#### `ConfigMatrix`
93+
94+
Runs the entire file once per combination of values. Multiple `ConfigMatrix` lines produce a
95+
cross product of all combinations.
96+
97+
```sql
98+
-- ConfigMatrix: parquet.enable.dictionary=false,true
99+
```
100+
101+
This generates two test cases:
102+
103+
```
104+
sql-file: expressions/cast/cast.sql [parquet.enable.dictionary=false]
105+
sql-file: expressions/cast/cast.sql [parquet.enable.dictionary=true]
106+
```
107+
108+
#### `MinSparkVersion`
109+
110+
Skips the file when running on a Spark version older than the specified version.
111+
112+
```sql
113+
-- MinSparkVersion: 3.5
114+
```
115+
116+
### Statements
117+
118+
A `statement` block executes DDL or DML and does not check results. Use this for `CREATE TABLE`
119+
and `INSERT` commands. Table names are automatically extracted for cleanup after the test.
120+
121+
```sql
122+
statement
123+
CREATE TABLE my_table(x int, y double) USING parquet
124+
125+
statement
126+
INSERT INTO my_table VALUES (1, 2.0), (3, 4.0), (NULL, NULL)
127+
```
128+
129+
### Queries
130+
131+
A `query` block executes a SELECT and compares results between Spark and Comet. The query
132+
mode controls how results are validated.
133+
134+
#### `query` (default mode)
135+
136+
Checks that the query runs natively on Comet (not falling back to Spark) and that results
137+
match Spark exactly.
138+
139+
```sql
140+
query
141+
SELECT abs(v) FROM test_abs
142+
```
143+
144+
#### `query spark_answer_only`
145+
146+
Only checks that Comet results match Spark. Does not assert that the query runs natively.
147+
Use this for expressions that Comet may not fully support yet but should still produce
148+
correct results.
149+
150+
```sql
151+
query spark_answer_only
152+
SELECT some_expression(v) FROM test_table
153+
```
154+
155+
#### `query tolerance=<value>`
156+
157+
Checks results with a numeric tolerance. Useful for floating-point functions where small
158+
differences are acceptable.
159+
160+
```sql
161+
query tolerance=0.0001
162+
SELECT cos(v) FROM test_trig
163+
```
164+
165+
#### `query expect_fallback(<reason>)`
166+
167+
Asserts that the query falls back to Spark and verifies the fallback reason contains the
168+
given string.
169+
170+
```sql
171+
query expect_fallback(unsupported expression)
172+
SELECT unsupported_func(v) FROM test_table
173+
```
174+
175+
#### `query ignore(<reason>)`
176+
177+
Skips the query entirely. Use this for queries that hit known bugs. The reason should be a
178+
link to the tracking GitHub issue.
179+
180+
```sql
181+
-- Comet bug: space(-1) causes native crash
182+
query ignore(https://github.com/apache/datafusion-comet/issues/3326)
183+
SELECT space(n) FROM test_space WHERE n < 0
184+
```
185+
186+
## Adding a new test
187+
188+
1. Create a `.sql` file under the appropriate subdirectory in
189+
`spark/src/test/resources/sql-tests/expressions/`. Create a new subdirectory if no
190+
existing category fits.
191+
192+
2. Add the Apache license header as a SQL comment.
193+
194+
3. Add a `ConfigMatrix` directive if the test should run with multiple Parquet configurations.
195+
Most expression tests use:
196+
197+
```sql
198+
-- ConfigMatrix: parquet.enable.dictionary=false,true
199+
```
200+
201+
4. Create tables and insert test data using `statement` blocks. Include edge cases such as
202+
`NULL`, boundary values, and negative numbers.
203+
204+
5. Add `query` blocks for each expression or behavior to test. Use the default `query` mode
205+
when you expect Comet to run the expression natively. Use `query spark_answer_only` when
206+
native execution is not yet expected.
207+
208+
6. Run the tests to verify:
209+
210+
```shell
211+
mvn test -pl spark -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none
212+
```
213+
214+
### Tips for writing thorough tests
215+
216+
#### Cover all combinations of literal and column arguments
217+
218+
Comet often uses different code paths for literal values versus column references. Tests
219+
should exercise both. For a function with multiple arguments, test every useful combination.
220+
221+
For a single-argument function, test both a column reference and a literal:
222+
223+
```sql
224+
-- column argument (reads from Parquet, goes through columnar evaluation)
225+
query
226+
SELECT ascii(s) FROM test_ascii
227+
228+
-- literal arguments
229+
query
230+
SELECT ascii('A'), ascii(''), ascii(NULL)
231+
```
232+
233+
For a multi-argument function like `concat_ws(sep, str1, str2, ...)`, test with the separator
234+
as a column versus a literal, and similarly for the other arguments:
235+
236+
```sql
237+
-- all columns
238+
query
239+
SELECT concat_ws(sep, a, b) FROM test_table
240+
241+
-- literal separator, column values
242+
query
243+
SELECT concat_ws(',', a, b) FROM test_table
244+
245+
-- all literals
246+
query
247+
SELECT concat_ws(',', 'hello', 'world')
248+
```
249+
250+
**Note on constant folding:** Normally Spark constant-folds all-literal expressions during
251+
planning, so Comet would never see them. However, `CometSqlFileTestSuite` automatically
252+
disables constant folding (by excluding `ConstantFolding` from the optimizer rules), so
253+
all-literal queries are evaluated by Comet's native engine. This means you can use the
254+
default `query` mode for all-literal cases and they will be tested natively just like
255+
column-based queries.
256+
257+
#### Cover edge cases
258+
259+
Include edge-case values in your test data. The exact cases depend on the function, but
260+
common ones include:
261+
262+
- **NULL values** -- every test should include NULLs
263+
- **Empty strings** -- for string functions
264+
- **Zero, negative, and very large numbers** -- for numeric functions
265+
- **Boundary values** -- `INT_MIN`, `INT_MAX`, `NaN`, `Infinity`, `-Infinity` for numeric
266+
types
267+
- **Special characters and multibyte UTF-8** -- for string functions (e.g. `'é'`, `'中文'`,
268+
`'\t'`)
269+
- **Empty arrays/maps** -- for collection functions
270+
- **Single-element and multi-element collections** -- for aggregate and collection functions
271+
272+
#### One file per expression
273+
274+
Keep each `.sql` file focused on a single expression or a small group of closely related
275+
expressions. This makes failures easy to locate and keeps files readable.
276+
277+
#### Use comments to label sections
278+
279+
Add SQL comments before query blocks to describe what aspect of the expression is being
280+
tested. This helps reviewers and future maintainers understand the intent:
281+
282+
```sql
283+
-- literal separator with NULL values
284+
query
285+
SELECT concat_ws(',', NULL, 'b', 'c')
286+
287+
-- empty separator
288+
query
289+
SELECT concat_ws('', a, b, c) FROM test_table
290+
```
291+
292+
### Using agentic coding tools
293+
294+
Writing thorough SQL test files is a task well suited to agentic coding tools such as
295+
Claude Code. You can point the tool at an existing test file as an example, describe the
296+
expression you want to test, and ask it to generate a complete `.sql` file covering all
297+
argument combinations and edge cases. This is significantly faster than writing the
298+
combinatorial test cases by hand and helps ensure nothing is missed.
299+
300+
For example:
301+
302+
```
303+
Read the test file spark/src/test/resources/sql-tests/expressions/string/ascii.sql
304+
and the documentation in docs/source/contributor-guide/sql-file-tests.md.
305+
Then write a similar test file for the `reverse` function, covering column arguments,
306+
literal arguments, NULLs, empty strings, and multibyte characters.
307+
```
308+
309+
## Handling test failures
310+
311+
When a query fails due to a known Comet bug:
312+
313+
1. File a GitHub issue describing the problem.
314+
2. Change the query mode to `ignore(...)` with a link to the issue.
315+
3. Optionally add a SQL comment above the query explaining the problem.
316+
317+
```sql
318+
-- GetArrayItem returns incorrect results with dynamic index
319+
query ignore(https://github.com/apache/datafusion-comet/issues/3332)
320+
SELECT arr[idx] FROM test_get_array_item
321+
```
322+
323+
When the bug is fixed, remove the `ignore(...)` and restore the original query mode.
324+
325+
## Architecture
326+
327+
The test infrastructure consists of two Scala files:
328+
329+
- **`SqlFileTestParser`** (`spark/src/test/scala/org/apache/comet/SqlFileTestParser.scala`) --
330+
Parses `.sql` files into a `SqlTestFile` data structure containing directives, statements,
331+
and queries.
332+
333+
- **`CometSqlFileTestSuite`** (`spark/src/test/scala/org/apache/comet/CometSqlFileTestSuite.scala`) --
334+
Discovers test files at suite initialization time, generates ScalaTest test cases for each
335+
file and config combination, and executes them using `CometTestBase` assertion methods.
336+
337+
Tables created in test files are automatically cleaned up after each test.

0 commit comments

Comments
 (0)