Feature Request / Improvement
Currently UpdateStatistics (org.apache.iceberg.Transaction#updateStatistics) allows adding statistics for an existing snapshot.
As a result, it is currently not possible publish a snapshot with statistics already collected.
Collecting statistics for an existing data is definitely an important use-case (like Trino's ANALYZE),
but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).
It's not difficult to
- publish data change snapshot (adding new files)
- take a note of new snapshot ID
- add statistics for that snapshot
however this has some drawbacks
- new data is published without stats, so other queries can be planned sub-optimally, leading to eg improper use of cluster resources, or even unexpected query failures (if data changed significantly)
- someone may run ANALYZE on the new snapshot (unknowingly or intentionally), and this will end up with two different threads wanting to add stats to it -- wasted work
We should make it possible to publish data change together with new stats.
This may will require API changes
It may also require spec changes, if we want to use "inherit snapshot ID" model.
(Maybe we don't have to, since stats are in metadata?)
Query engine
None
Feature Request / Improvement
Currently
UpdateStatistics(org.apache.iceberg.Transaction#updateStatistics) allows adding statistics for an existing snapshot.As a result, it is currently not possible publish a snapshot with statistics already collected.
Collecting statistics for an existing data is definitely an important use-case (like Trino's ANALYZE),
but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).
It's not difficult to
however this has some drawbacks
We should make it possible to publish data change together with new stats.
This may will require API changes
It may also require spec changes, if we want to use "inherit snapshot ID" model.
(Maybe we don't have to, since stats are in metadata?)
Query engine
None