Skip to content

Commit d173103

Browse files
committed
docs: refresh README with current features and Spark version support
- Add supported Spark versions section linking to compatibility matrix - Add 'What Comet Accelerates' features list (Parquet, Iceberg, shuffle, expressions, aggregations, joins, windows, metrics) - Fix heading hierarchy (Benefits demoted to h2 with h3 subsections) - Expand Getting Started with a concrete Spark config snippet - Split community links into their own section - Drop stale hard-coded speedup number; remove self-referential Acknowledgments section
1 parent b80a63d commit d173103

1 file changed

Lines changed: 60 additions & 30 deletions

File tree

README.md

Lines changed: 60 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -40,75 +40,110 @@ Apache DataFusion Comet is a high-performance accelerator for Apache Spark, buil
4040
performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
4141
Spark ecosystem without requiring any code changes.
4242

43-
Comet also accelerates Apache Iceberg, when performing Parquet scans from Spark.
44-
4543
[Apache DataFusion]: https://datafusion.apache.org
4644

47-
# Benefits of Using Comet
45+
## Supported Spark Versions
46+
47+
Comet supports Apache Spark 3.4 and 3.5, and provides experimental support for Spark 4.0. See the
48+
[installation guide](https://datafusion.apache.org/comet/user-guide/installation.html) for the detailed
49+
version, Java, and Scala compatibility matrix.
50+
51+
## What Comet Accelerates
52+
53+
Comet replaces Spark operators and expressions with native Rust implementations that run on Apache DataFusion.
54+
It uses Apache Arrow for zero-copy data transfer between the JVM and native code.
4855

49-
## Run Spark Queries at DataFusion Speeds
56+
- **Parquet scans** — native Parquet reader integrated with Spark's query planner
57+
- **Apache Iceberg** — accelerated Parquet scans when reading Iceberg tables from Spark
58+
(see the [Iceberg guide](https://datafusion.apache.org/comet/user-guide/iceberg.html))
59+
- **Shuffle** — native columnar shuffle with support for hash and range partitioning
60+
- **Expressions** — hundreds of supported Spark expressions across math, string, datetime, array,
61+
map, JSON, hash, and predicate categories
62+
- **Aggregations** — hash aggregate with support for `FILTER (WHERE ...)` clauses
63+
- **Joins** — hash join, sort-merge join, and broadcast join
64+
- **Window functions** — including `LEAD`/`LAG` with `IGNORE NULLS`
65+
- **Metrics** — Comet metrics are exposed through Spark's external monitoring system
5066

51-
Comet delivers a performance speedup for many queries, enabling faster data processing and shorter time-to-insights.
67+
For the authoritative lists, see the [supported expressions](https://datafusion.apache.org/comet/user-guide/expressions.html)
68+
and [supported operators](https://datafusion.apache.org/comet/user-guide/operators.html) pages.
69+
70+
## Benefits of Using Comet
71+
72+
### Run Spark Queries at DataFusion Speeds
73+
74+
Comet delivers a significant performance speedup for many queries, enabling faster data processing and shorter
75+
time-to-insights.
5276

5377
The following chart shows the time it takes to run the 22 TPC-H queries against 100 GB of data in Parquet format
5478
using a single executor with 8 cores. See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html)
5579
for details of the environment used for these benchmarks.
5680

57-
When using Comet, the overall run time is reduced from 687 seconds to 302 seconds, a 2.2x speedup.
58-
5981
![](docs/source/_static/images/benchmark-results/0.11.0/tpch_allqueries.png)
6082

6183
Here is a breakdown showing relative performance of Spark and Comet for each TPC-H query.
6284

6385
![](docs/source/_static/images/benchmark-results/0.11.0/tpch_queries_compare.png)
6486

65-
The following charts shows how much Comet currently accelerates each query from the benchmark.
87+
The following charts show how much Comet currently accelerates each query from the benchmark.
6688

67-
### Relative speedup
89+
#### Relative speedup
6890

6991
![](docs/source/_static/images/benchmark-results/0.11.0/tpch_queries_speedup_rel.png)
7092

71-
### Absolute speedup
93+
#### Absolute speedup
7294

7395
![](docs/source/_static/images/benchmark-results/0.11.0/tpch_queries_speedup_abs.png)
7496

97+
Results for our benchmark derived from TPC-DS are available in the
98+
[benchmarking guide](https://datafusion.apache.org/comet/contributor-guide/benchmark-results/tpc-ds.html).
99+
75100
These benchmarks can be reproduced in any environment using the documentation in the
76101
[Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html). We encourage
77102
you to run your own benchmarks.
78103

79-
Results for our benchmark derived from TPC-DS are available in the [benchmarking guide](https://datafusion.apache.org/comet/contributor-guide/benchmark-results/tpc-ds.html).
80-
81-
## Use Commodity Hardware
104+
### Use Commodity Hardware
82105

83106
Comet leverages commodity hardware, eliminating the need for costly hardware upgrades or
84-
specialized hardware accelerators, such as GPUs or FPGA. By maximizing the utilization of commodity hardware, Comet
107+
specialized hardware accelerators, such as GPUs or FPGAs. By maximizing the utilization of commodity hardware, Comet
85108
ensures cost-effectiveness and scalability for your Spark deployments.
86109

87-
## Spark Compatibility
110+
### Spark Compatibility
88111

89112
Comet aims for 100% compatibility with all supported versions of Apache Spark, allowing you to integrate Comet into
90113
your existing Spark deployments and workflows seamlessly. With no code changes required, you can immediately harness
91114
the benefits of Comet's acceleration capabilities without disrupting your Spark applications.
92115

93-
## Tight Integration with Apache DataFusion
116+
### Tight Integration with Apache DataFusion
94117

95118
Comet tightly integrates with the core Apache DataFusion project, leveraging its powerful execution engine. With
96119
seamless interoperability between Comet and DataFusion, you can achieve optimal performance and efficiency in your
97120
Spark workloads.
98121

99-
## Active Community
122+
## Getting Started
100123

101-
Comet boasts a vibrant and active community of developers, contributors, and users dedicated to advancing the
102-
capabilities of Apache DataFusion and accelerating the performance of Apache Spark.
124+
Install Comet by adding the jar for your Spark and Scala version to the Spark classpath and enabling the plugin.
125+
A typical configuration looks like:
103126

104-
## Getting Started
127+
```
128+
--jars /path/to/comet-spark-spark3.5_2.12-<version>.jar \
129+
--conf spark.plugins=org.apache.spark.CometPlugin \
130+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
131+
--conf spark.comet.enabled=true \
132+
--conf spark.comet.exec.enabled=true \
133+
--conf spark.comet.exec.shuffle.enabled=true
134+
```
105135

106-
To get started with Apache DataFusion Comet, follow the
107-
[installation instructions](https://datafusion.apache.org/comet/user-guide/installation.html). Join the
108-
[DataFusion Slack and Discord channels](https://datafusion.apache.org/contributor-guide/communication.html) to connect
109-
with other users, ask questions, and share your experiences with Comet.
136+
For full installation instructions, published jar downloads, and configuration reference, see the
137+
[installation guide](https://datafusion.apache.org/comet/user-guide/installation.html) and the
138+
[configuration reference](https://datafusion.apache.org/comet/user-guide/configs.html).
110139

111-
Follow [Apache DataFusion Comet Overview](https://datafusion.apache.org/comet/about/index.html#comet-overview) to get more detailed information
140+
Follow the [Apache DataFusion Comet Overview](https://datafusion.apache.org/comet/about/index.html#comet-overview)
141+
for more detailed information.
142+
143+
## Community
144+
145+
Join the [DataFusion Slack and Discord channels](https://datafusion.apache.org/contributor-guide/communication.html)
146+
to connect with other users, ask questions, and share your experiences with Comet.
112147

113148
## Contributing
114149

@@ -120,8 +155,3 @@ shaping the future of Comet. Check out our
120155
## License
121156

122157
Apache DataFusion Comet is licensed under the Apache License 2.0. See the [LICENSE.txt](LICENSE.txt) file for details.
123-
124-
## Acknowledgments
125-
126-
We would like to express our gratitude to the Apache DataFusion community for their support and contributions to
127-
Comet. Together, we're building a faster, more efficient future for big data processing with Apache Spark.

0 commit comments

Comments
 (0)