Skip to content

feat: pushdown OFFSET to parquet for RG-level skipping#21828

Open
zhuqi-lucas wants to merge 1 commit intoapache:mainfrom
zhuqi-lucas:feat/offset-pushdown
Open

feat: pushdown OFFSET to parquet for RG-level skipping#21828
zhuqi-lucas wants to merge 1 commit intoapache:mainfrom
zhuqi-lucas:feat/offset-pushdown

Conversation

@zhuqi-lucas
Copy link
Copy Markdown
Contributor

@zhuqi-lucas zhuqi-lucas commented Apr 24, 2026

Which issue does this PR close?

Closes #19654

Rationale for this change

SELECT * FROM table LIMIT 5 OFFSET 59000000 on a 60M-row parquet file takes 182ms because DataFusion reads 59M+ rows then discards them in GlobalLimitExec. The parquet reader has no knowledge of the offset.

What changes are included in this PR?

Push OFFSET from GlobalLimitExec down to the parquet reader, enabling RG-level skipping.

Architecture

                BEFORE                              AFTER (no WHERE)
                ======                              ================

  GlobalLimitExec(skip=59M, fetch=5)          DataSourceExec(limit=59M+5, offset=59M)
    │                                           │
    ▼                                           ▼ prune_by_offset
  DataSourceExec(limit=59M+5)                 Skip 590 RGs (zero I/O)
    │                                           │
    ▼                                           ▼ RowSelection
  Read 59M+ rows from parquet               Skip remaining offset rows
  (decode, decompress all)                      │
    │                                           ▼ effective_limit = 5
    ▼                                         Read 5 rows from 1 RG
  Discard 59M rows in GlobalLimitExec           │
    │                                           ▼
    ▼                                         Return 5 rows
  Return 5 rows
  
  Time: 182ms                                 Time: <1ms


                AFTER (with WHERE)
                ==================

  GlobalLimitExec(skip=59M, fetch=5)    ← kept for correctness
    │
    ▼
  DataSourceExec(limit=59M+5, offset=59M)
    │
    ▼ prune_by_offset
  Skip fully-matched RGs (WHERE stats
  prove all rows match → row count exact)
    │
    ▼
  Read remaining RGs (with WHERE filter)
    │
    ▼
  GlobalLimitExec handles remaining skip

Two modes based on filter presence

Scenario GlobalLimitExec Optimization Performance
No WHERE Eliminated RG prune + RowSelection + effective_limit <1ms
WHERE + fully matched RGs Kept Skip fully-matched RGs Partial improvement
WHERE + non-fully matched Kept No skip (unsafe) No change

Implementation details

  1. FileSource::supports_offset() — trait method, parquet returns true, CSV/JSON return false
  2. ExecutionPlan::offset_fully_handled() — returns true when offset is fully handled (parquet + no WHERE), optimizer uses this to decide whether to eliminate GlobalLimitExec
  3. FileScanConfig.offset — new field, set via with_offset() (only accepted when supports_offset())
  4. LimitPushdown optimizer — pushes offset to DataSourceExec. If offset_fully_handled() → eliminate GlobalLimitExec. Otherwise keep it.
  5. prune_by_offset() — skips leading fully-matched RGs whose cumulative rows fall within offset
  6. RowSelection — for partial RG skip (remaining offset within first surviving RG, no-filter case only)
  7. effective_limit — decoder reads only fetch rows (limit - offset, no-filter case only)

Benchmark (60M rows, 1.5GB parquet, single partition)

OFFSET Before After Speedup
0 3ms 2ms
1M 4ms 1ms 4x
30M 98ms 1ms 98x
59M 182ms <1ms >182x

Are these changes tested?

  • 6 unit tests for prune_by_offset (boundary, partial, non-fully-matched, zero, exceeds total, exact)
  • SLT Test N (10 sub-tests): EXPLAIN plan verification, correctness across RG boundaries, WHERE interaction, multi-partition, boundary cases
  • Updated explain_analyze, push_down_filter_parquet, limit_pruning SLTs for new offset_pruned_row_groups metric
  • All 463 SLT files pass (except pre-existing encrypted_parquet upstream bug)
  • All limit.slt tests pass (CSV/JSON offset handled by GlobalLimitExecsupports_offset() returns false)

Are there any user-facing changes?

Performance only — faster OFFSET queries on parquet. No API changes visible to SQL users.

@github-actions github-actions Bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Apr 24, 2026
Push OFFSET from GlobalLimitExec down to DataSourceExec/ParquetOpener.
Skips entire row groups + uses RowSelection for partial RG skip.

Two modes based on filter presence:
- No WHERE: offset fully handled, GlobalLimitExec eliminated
  (RG prune + RowSelection + effective_limit adjustment)
- With WHERE: offset as hint, GlobalLimitExec kept for correctness
  (only fully-matched RGs skipped, remaining offset by GlobalLimitExec)

Implementation:
- FileSource::supports_offset() trait method (parquet=true)
- ExecutionPlan::offset_fully_handled() for optimizer decision
- FileScanConfig: offset field + with_offset
- LimitPushdown: push offset, eliminate GlobalLimitExec when fully handled
- prune_by_offset: skip leading fully-matched RGs by cumulative row count
- RowSelection for remaining offset (no-filter case only)
- effective_limit = limit - offset (no-filter case only)

Benchmark (60M row, 1.5GB parquet, single partition):
  OFFSET 0:    3ms → 2ms
  OFFSET 1M:   4ms → 1ms (4x)
  OFFSET 30M:  98ms → 1ms (98x)
  OFFSET 59M:  182ms → <1ms (>182x)
@zhuqi-lucas zhuqi-lucas force-pushed the feat/offset-pushdown branch from 39621f4 to 60c508d Compare April 24, 2026 08:40
@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review April 24, 2026 12:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves performance for LIMIT .. OFFSET .. queries on Parquet by pushing OFFSET down into the Parquet scan so it can skip entire row groups (and partially skip within a row group via RowSelection), avoiding reading and discarding large numbers of rows in GlobalLimitExec.

Changes:

  • Introduces offset pushdown plumbing (with_offset, offset, offset_fully_handled) across ExecutionPlan, DataSource, FileSource, and FileScanConfig.
  • Implements Parquet-side offset handling (row-group pruning + optional RowSelection) and adds an offset_pruned_row_groups metric.
  • Adds/updates SLT coverage and expected EXPLAIN/metrics output to reflect offset pushdown behavior.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
datafusion/sqllogictest/test_files/sort_pushdown.slt Adds SLT “Test N” covering OFFSET pushdown behavior and correctness checks.
datafusion/sqllogictest/test_files/push_down_filter_parquet.slt Updates expected EXPLAIN ANALYZE metrics to include offset_pruned_row_groups.
datafusion/sqllogictest/test_files/limit_pruning.slt Updates expected metrics to include offset_pruned_row_groups.
datafusion/sqllogictest/test_files/explain_analyze.slt Updates expected metrics to include offset_pruned_row_groups.
datafusion/sqllogictest/test_files/dynamic_filter_pushdown_config.slt Updates expected metrics to include offset_pruned_row_groups.
datafusion/physical-plan/src/execution_plan.rs Adds offset-related extension points to ExecutionPlan.
datafusion/physical-optimizer/src/limit_pushdown.rs Extends limit pushdown rule to attempt offset pushdown and (sometimes) remove GlobalLimitExec.
datafusion/datasource/src/source.rs Wires offset methods through DataSource and DataSourceExec.
datafusion/datasource/src/file_scan_config/mod.rs Adds offset to FileScanConfig + builder and implements offset pushdown behavior.
datafusion/datasource/src/file.rs Adds supports_offset() to FileSource.
datafusion/datasource-parquet/src/source.rs Marks Parquet as supporting offset pushdown and propagates offset into morselizer config.
datafusion/datasource-parquet/src/row_group_filter.rs Adds prune_by_offset and unit tests for row-group pruning by offset.
datafusion/datasource-parquet/src/opener.rs Applies offset pruning and row selection during Parquet open; adjusts effective limit when offset is fully handled.
datafusion/datasource-parquet/src/metrics.rs Adds offset_pruned_row_groups metric.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 270 to +285
if global_skip > 0 {
add_global_limit(
plan_with_preserve_order,
global_skip,
Some(global_fetch),
)
// Push offset to the plan. If the plan fully handles
// offset (e.g. parquet without WHERE), eliminate
// GlobalLimitExec. Otherwise keep it for remaining skip.
if let Some(plan_with_offset) =
plan_with_preserve_order.with_offset(global_skip)
{
if plan_with_offset.offset_fully_handled() {
plan_with_offset
} else {
add_global_limit(
plan_with_offset,
global_skip,
Some(global_fetch),
)
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When with_offset(global_skip) returns Some but offset_fully_handled() is false, this code still keeps GlobalLimitExec with skip=global_skip while also pushing the same offset into the child plan. If the child actually skips any rows (e.g. parquet skipping fully-matched row groups), the offset will be applied twice and results will be incorrect. Consider only calling/using with_offset when the plan can fully handle the offset (and otherwise leave the child unchanged), or alternatively adjust the GlobalLimitExec skip to reflect only the remaining offset actually not handled by the child (requires a way to compute/report it).

Copilot uses AI. Check for mistakes.
Comment on lines +889 to +893
// Offset is fully handled when set AND no filter —
// raw row counts are accurate for offset calculation.
// With filters, only fully-matched RGs can be skipped,
// GlobalLimitExec handles the rest.
self.offset.is_some() && self.file_source.filter().is_none()
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset_fully_handled() currently returns true whenever offset is set and there is no filter, but it does not account for multi-partition scans. FileStream applies limit per output partition, so removing GlobalLimitExec when file_groups.len() > 1 can yield more than fetch rows (and make OFFSET/LIMIT semantics depend on partitioning). Consider requiring a single output partition (e.g. self.file_groups.len() == 1 / output_partitioning().partition_count() == 1) before reporting the offset as fully handled.

Suggested change
// Offset is fully handled when set AND no filter —
// raw row counts are accurate for offset calculation.
// With filters, only fully-matched RGs can be skipped,
// GlobalLimitExec handles the rest.
self.offset.is_some() && self.file_source.filter().is_none()
// Offset is fully handled only when set, no filter is present,
// and the scan has a single output partition.
//
// With filters, only fully-matched RGs can be skipped and
// GlobalLimitExec handles the rest. Likewise, multi-partition
// scans must retain global limit enforcement because FileStream
// applies limits per output partition.
self.offset.is_some()
&& self.file_source.filter().is_none()
&& self.file_groups.len() == 1

Copilot uses AI. Check for mistakes.
Comment on lines +1108 to +1122
// Prune by offset: skip leading fully-matched row groups that fall
// entirely within the offset, so they are never read from disk.
let remaining_offset = if let Some(offset) = prepared.offset {
if offset > 0 {
Some(row_groups.prune_by_offset(
offset,
prepared.predicate.is_some(),
rg_metadata,
&prepared.file_metrics,
))
} else {
None
}
} else {
None
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offset pruning is applied even when prepared.predicate.is_some() (via prune_by_offset(..., prepared.predicate.is_some(), ...)). Since the optimizer keeps GlobalLimitExec when predicates exist (offset not fully handled), skipping any fully-matched leading row groups here will change the stream output and can cause the global skip to be applied twice (wrong results when the offset spans ≥1 fully-matched row group). Consider only performing offset-based row group pruning / selection when offset is fully handled within parquet (currently the predicate.is_none() case), or otherwise ensure the remaining global skip is reduced accordingly.

Suggested change
// Prune by offset: skip leading fully-matched row groups that fall
// entirely within the offset, so they are never read from disk.
let remaining_offset = if let Some(offset) = prepared.offset {
if offset > 0 {
Some(row_groups.prune_by_offset(
offset,
prepared.predicate.is_some(),
rg_metadata,
&prepared.file_metrics,
))
} else {
None
}
} else {
None
// Prune by offset only when parquet can fully handle it. When a
// predicate exists, the optimizer keeps a global skip/offset above
// parquet, so pruning fully-matched leading row groups here would
// change the stream seen by that operator and can apply the offset
// twice.
let remaining_offset = match (prepared.offset, prepared.predicate.is_none()) {
(Some(offset), true) if offset > 0 => Some(row_groups.prune_by_offset(
offset,
false,
rg_metadata,
&prepared.file_metrics,
)),
_ => None,

Copilot uses AI. Check for mistakes.
8 80
9 90
10 100

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WHERE-clause coverage here uses OFFSET 2, which does not exercise the case where the offset spans at least one fully-matched row group under a predicate (the scenario that can break if offset is both pushed into parquet and also applied by GlobalLimitExec). Consider adding a variant like WHERE value > 50 LIMIT 3 OFFSET 7 (with 5-row row groups) to ensure correctness when the offset crosses a fully-matched row group boundary under filtering.

Suggested change
# Test N.9b: OFFSET with WHERE clause crossing a fully-matched row-group boundary
# Ensures correctness when OFFSET may be pushed into parquet and also applied by GlobalLimitExec
query II
SELECT * FROM tn_offset WHERE value > 50 LIMIT 3 OFFSET 7;
----
13 130
14 140
15 150

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate optimizer Optimizer rules physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Poor performance of parquet pushdown with offset

2 participants