You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Which issue does this PR close?
Add sorted data benchmark.
- Closes[ #18976](#18976)
## Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
## What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
## Are these changes tested?
Yes, test results for reverse parquet PR, it's 30X faster than main
branch for sorted data:
#18817
```rust
Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench clickbench --iterations 5 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data --sorted-by EventTime --sort-order ASC -o /Users/zhuqi/arrow-datafusion/benchmarks/results/reverse_parquet/data_sorted_clickbench.json`
Running benchmarks with the following options: RunOpt { query: None, pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet", queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data", output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/reverse_parquet/data_sorted_clickbench.json"), sorted_by: Some("EventTime"), sort_order: "ASC" }
⚠️ Forcing target_partitions=1 to preserve sort order
⚠️ (Because we want to get the pure performance benefit of sorted data to compare)
📊 Session config target_partitions: 1
Registering table with sort order: EventTime ASC
Executing: CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet' WITH ORDER ("EventTime" ASC)
Q0: -- Must set for ClickBench hits_partitioned dataset. See #16591
-- set datafusion.execution.parquet.binary_as_string = true
SELECT * FROM hits ORDER BY "EventTime" DESC limit 10;
Query 0 iteration 0 took 14.7 ms and returned 10 rows
Query 0 iteration 1 took 10.2 ms and returned 10 rows
Query 0 iteration 2 took 8.7 ms and returned 10 rows
Query 0 iteration 3 took 7.9 ms and returned 10 rows
Query 0 iteration 4 took 7.9 ms and returned 10 rows
Query 0 avg time: 9.85 ms
+ set +x
Done
```
And the main branch result:
```rust
Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench clickbench --iterations 5 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data --sorted-by EventTime --sort-order ASC -o /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_18976/data_sorted_clickbench.json`
Running benchmarks with the following options: RunOpt { query: None, pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet", queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data", output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_18976/data_sorted_clickbench.json"), sorted_by: Some("EventTime"), sort_order: "ASC" }
⚠️ Forcing target_partitions=1 to preserve sort order
⚠️ (Because we want to get the pure performance benefit of sorted data to compare)
📊 Session config target_partitions: 1
Registering table with sort order: EventTime ASC
Executing: CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet' WITH ORDER ("EventTime" ASC)
Q0: -- Must set for ClickBench hits_partitioned dataset. See #16591
-- set datafusion.execution.parquet.binary_as_string = true
SELECT * FROM hits ORDER BY "EventTime" DESC limit 10;
Query 0 iteration 0 took 331.1 ms and returned 10 rows
Query 0 iteration 1 took 286.0 ms and returned 10 rows
Query 0 iteration 2 took 283.3 ms and returned 10 rows
Query 0 iteration 3 took 283.8 ms and returned 10 rows
Query 0 iteration 4 took 286.5 ms and returned 10 rows
Query 0 avg time: 294.13 ms
+ set +x
Done
```
## Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
---------
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Yongting You <2010youy01@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Copy file name to clipboardExpand all lines: benchmarks/README.md
+38Lines changed: 38 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -832,3 +832,41 @@ Getting results...
832
832
cancelling thread
833
833
done dropping runtime in 83.531417ms
834
834
```
835
+
836
+
## Sorted Data Benchmarks
837
+
838
+
### Data Sorted ClickBench
839
+
840
+
Benchmark for queries on pre-sorted data to test sort order optimization.
841
+
This benchmark uses a subset of the ClickBench dataset (hits.parquet, ~14GB) that has been pre-sorted by the EventTime column. The queries are designed to test DataFusion's performance when the data is already sorted as is common in timeseries workloads.
842
+
843
+
The benchmark includes queries that:
844
+
- Scan pre-sorted data with ORDER BY clauses that match the sort order
845
+
- Test reverse scans on sorted data
846
+
- Verify the performance result
847
+
848
+
#### Generating Sorted Data
849
+
850
+
The sorted dataset is automatically generated from the ClickBench partitioned dataset. You can configure the memory used during the sorting process with the `DATAFUSION_MEMORY_GB` environment variable. The default memory limit is 12GB.
851
+
```bash
852
+
./bench.sh data data_sorted_clickbench
853
+
```
854
+
855
+
To create the sorted dataset, for example with 16GB of memory, run:
856
+
857
+
```bash
858
+
DATAFUSION_MEMORY_GB=16 ./bench.sh data data_sorted_clickbench
859
+
```
860
+
861
+
This command will:
862
+
1. Download the ClickBench partitioned dataset if not present
863
+
2. Sort hits.parquet by EventTime in ascending order
864
+
3. Save the sorted file as hits_sorted.parquet
865
+
866
+
#### Running the Benchmark
867
+
868
+
```bash
869
+
./bench.sh run data_sorted_clickbench
870
+
```
871
+
872
+
This runs queries against the pre-sorted dataset with the `--sorted-by EventTime` flag, which informs DataFusion that the data is pre-sorted, allowing it to optimize away redundant sort operations.
0 commit comments