Skip to content

parquet: Make page_index/pushdown metrics consistent with row_group metrics#12545

Merged
alamb merged 3 commits intoapache:mainfrom
progval:page-index-metrics
Sep 22, 2024
Merged

parquet: Make page_index/pushdown metrics consistent with row_group metrics#12545
alamb merged 3 commits intoapache:mainfrom
progval:page-index-metrics

Conversation

@progval
Copy link
Copy Markdown
Contributor

@progval progval commented Sep 20, 2024

Which issue does this PR close?

Closes #12543.
Closes #12544.

What changes are included in this PR?

  1. Rename {pushdown,page_index}_filtered to {pushdown,page_index}_pruned
  2. Add {pushdown,page_index}_matched
  3. Added documentation for existing pushdown-related metrics

Rationale for this change

The latter makes it clearer in EXPLAIN ANALYZE when the Page Index is not checked because their row groups were already eliminated (with a Bloom Filter or row group statistics).

Are these changes tested?

yes

Are there any user-facing changes?

New metrics in EXPLAIN ANALYZE, documented in docs/source/user-guide/explain-usage.md

…etrics

1. Rename `{pushdown,page_index}_filtered` to `{pushdown,page_index}_pruned`
2. Add `{pushdown,page_index}_matched`

The latter makes it clearer in EXPLAIN ANALYZE when the Page Index is
not checked because their row groups were already eliminated
(with a Bloom Filter or row group statistics).
@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate labels Sep 20, 2024
@alamb alamb added the api change Changes the API exposed to users of the crate label Sep 20, 2024
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @progval -- this looks like a very nice improvement to me. I left some small suggestions but I don't think they are required to merge this PR

Comment thread datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs Outdated
}

/// returns the number of rows not skipped in the selection
/// TODO should this be upstreamed to RowSelection?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks the same as https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowSelection.html#method.row_count

It would be great to upstream this and rows_skipped to parquet -- any chance you are willing to file a ticket to do so?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- `SortPreservingMergeExec`
- `output_rows=5`, `elapsed_compute=2.375µs`: Produced the final 5 rows in 2.375µs (microseconds)

When predicate pushdown is enabled, `ParquetExec` gains the following metrics:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Comment thread docs/source/user-guide/explain-usage.md Outdated
progval and others added 2 commits September 20, 2024 16:54
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Sep 22, 2024

Thanks agian @progval

bgjackma pushed a commit to bgjackma/datafusion that referenced this pull request Sep 25, 2024
…etrics (apache#12545)

* parquet: Make page_index/pushdown metrics consistent with row_group metrics

1. Rename `{pushdown,page_index}_filtered` to `{pushdown,page_index}_pruned`
2. Add `{pushdown,page_index}_matched`

The latter makes it clearer in EXPLAIN ANALYZE when the Page Index is
not checked because their row groups were already eliminated
(with a Bloom Filter or row group statistics).

* Add missing metric definitions in the docs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* s/pass/select/

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate core Core DataFusion crate documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parquet: Pushdown-related metrics are not documented parquet: page_index/pushdown metrics names are inconsistent with row_group metrics

2 participants