You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Which issue does this PR close?
Add sorted data benchmark.
- Closes[ apache#18976](apache#18976)
## Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
## What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
## Are these changes tested?
Yes, test results for reverse parquet PR, it's 30X faster than main
branch for sorted data:
apache#18817
```rust
Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench clickbench --iterations 5 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data --sorted-by EventTime --sort-order ASC -o /Users/zhuqi/arrow-datafusion/benchmarks/results/reverse_parquet/data_sorted_clickbench.json`
Running benchmarks with the following options: RunOpt { query: None, pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet", queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data", output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/reverse_parquet/data_sorted_clickbench.json"), sorted_by: Some("EventTime"), sort_order: "ASC" }
⚠️ Forcing target_partitions=1 to preserve sort order
⚠️ (Because we want to get the pure performance benefit of sorted data to compare)
📊 Session config target_partitions: 1
Registering table with sort order: EventTime ASC
Executing: CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet' WITH ORDER ("EventTime" ASC)
Q0: -- Must set for ClickBench hits_partitioned dataset. See apache#16591
-- set datafusion.execution.parquet.binary_as_string = true
SELECT * FROM hits ORDER BY "EventTime" DESC limit 10;
Query 0 iteration 0 took 14.7 ms and returned 10 rows
Query 0 iteration 1 took 10.2 ms and returned 10 rows
Query 0 iteration 2 took 8.7 ms and returned 10 rows
Query 0 iteration 3 took 7.9 ms and returned 10 rows
Query 0 iteration 4 took 7.9 ms and returned 10 rows
Query 0 avg time: 9.85 ms
+ set +x
Done
```
And the main branch result:
```rust
Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench clickbench --iterations 5 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data --sorted-by EventTime --sort-order ASC -o /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_18976/data_sorted_clickbench.json`
Running benchmarks with the following options: RunOpt { query: None, pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet", queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data", output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_18976/data_sorted_clickbench.json"), sorted_by: Some("EventTime"), sort_order: "ASC" }
⚠️ Forcing target_partitions=1 to preserve sort order
⚠️ (Because we want to get the pure performance benefit of sorted data to compare)
📊 Session config target_partitions: 1
Registering table with sort order: EventTime ASC
Executing: CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet' WITH ORDER ("EventTime" ASC)
Q0: -- Must set for ClickBench hits_partitioned dataset. See apache#16591
-- set datafusion.execution.parquet.binary_as_string = true
SELECT * FROM hits ORDER BY "EventTime" DESC limit 10;
Query 0 iteration 0 took 331.1 ms and returned 10 rows
Query 0 iteration 1 took 286.0 ms and returned 10 rows
Query 0 iteration 2 took 283.3 ms and returned 10 rows
Query 0 iteration 3 took 283.8 ms and returned 10 rows
Query 0 iteration 4 took 286.5 ms and returned 10 rows
Query 0 avg time: 294.13 ms
+ set +x
Done
```
## Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
---------
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Yongting You <2010youy01@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
(cherry picked from commit cde6dfa)
Copy file name to clipboardExpand all lines: benchmarks/README.md
+38Lines changed: 38 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -804,3 +804,41 @@ Getting results...
804
804
cancelling thread
805
805
done dropping runtime in 83.531417ms
806
806
```
807
+
808
+
## Sorted Data Benchmarks
809
+
810
+
### Data Sorted ClickBench
811
+
812
+
Benchmark for queries on pre-sorted data to test sort order optimization.
813
+
This benchmark uses a subset of the ClickBench dataset (hits.parquet, ~14GB) that has been pre-sorted by the EventTime column. The queries are designed to test DataFusion's performance when the data is already sorted as is common in timeseries workloads.
814
+
815
+
The benchmark includes queries that:
816
+
- Scan pre-sorted data with ORDER BY clauses that match the sort order
817
+
- Test reverse scans on sorted data
818
+
- Verify the performance result
819
+
820
+
#### Generating Sorted Data
821
+
822
+
The sorted dataset is automatically generated from the ClickBench partitioned dataset. You can configure the memory used during the sorting process with the `DATAFUSION_MEMORY_GB` environment variable. The default memory limit is 12GB.
823
+
```bash
824
+
./bench.sh data data_sorted_clickbench
825
+
```
826
+
827
+
To create the sorted dataset, for example with 16GB of memory, run:
828
+
829
+
```bash
830
+
DATAFUSION_MEMORY_GB=16 ./bench.sh data data_sorted_clickbench
831
+
```
832
+
833
+
This command will:
834
+
1. Download the ClickBench partitioned dataset if not present
835
+
2. Sort hits.parquet by EventTime in ascending order
836
+
3. Save the sorted file as hits_sorted.parquet
837
+
838
+
#### Running the Benchmark
839
+
840
+
```bash
841
+
./bench.sh run data_sorted_clickbench
842
+
```
843
+
844
+
This runs queries against the pre-sorted dataset with the `--sorted-by EventTime` flag, which informs DataFusion that the data is pre-sorted, allowing it to optimize away redundant sort operations.
0 commit comments