Skip to content

planner: Allow Optimizer to Use Prefix Index for ORDER BY / LIMIT to Avoid Full Scan #65813

@elsa0520

Description

@elsa0520

Optimizer: Support Partial Order TopN Optimization with Prefix Index

Background

Current Limitation

Currently, TiDB's optimizer cannot effectively utilize prefix indexes for ORDER BY ... LIMIT queries. When a query uses ORDER BY on a column with a prefix index, the optimizer either:

  1. Performs a full table scan - Reading all rows, sorting them, and then applying the limit
  2. Ignores the prefix index entirely - Missing the opportunity to leverage the index's partial ordering property

This leads to significant performance issues, especially for large tables with high-cardinality columns like email addresses, URLs, or text descriptions.

Concrete Example

Table Schema:

CREATE TABLE users (
    id INT PRIMARY KEY,
    email VARCHAR(255),
    name VARCHAR(100),
    created_at TIMESTAMP,
    INDEX idx_email_prefix (email(10))  -- Prefix index on first 10 bytes
);

-- Sample data: 10 million rows
INSERT INTO users VALUES
    (1, 'alice@example.com', 'Alice', '2024-01-01'),
    (2, 'alice@company.com', 'Alice Smith', '2024-01-02'),
    (3, 'bob@example.com', 'Bob', '2024-01-03'),
    (4, 'bob@another.com', 'Bob Jones', '2024-01-04'),
    ...
    (10000000, 'zoe@test.com', 'Zoe', '2024-12-31');

Problematic Query:

SELECT * FROM users ORDER BY email LIMIT 10;

Current Execution Plan (Inefficient):

TopN_7                    | 10.00  | root      | offset:0, count:10
└─TableReader_15          | 10.00  | root      | data:TopN_14
  └─TopN_14               | 10.00  | cop[tikv] | offset:0, count:10
    └─TableFullScan_13    | 10000000.00 | cop[tikv] | table:users | keep order:false

Performance Issues:

  • Scans all 10 million rows from storage
  • Sorts all rows in memory (CPU and memory intensive)
  • Network transfer overhead for large dataset
  • Query time: 10+ seconds (depending on data volume)

Why Can't Current Optimizer Use the Prefix Index?

The prefix index idx_email_prefix (email(10)) provides partial ordering:

  • Rows with the same 10-byte prefix (e.g., alice@exam, alice@comp) are grouped together
  • Within each prefix group, rows are unordered
  • The optimizer doesn't recognize this partial ordering property

Real-World Impact

This limitation affects common use cases:

  1. Pagination queries:

    SELECT * FROM users ORDER BY email LIMIT 100 OFFSET 0;
    SELECT * FROM users ORDER BY email LIMIT 100 OFFSET 100;
  2. API responses with sorting:

    SELECT id, email, name FROM products ORDER BY description LIMIT 20;
  3. Admin dashboards showing sorted lists:

    SELECT * FROM logs ORDER BY url LIMIT 50;

All these queries suffer from full table scans when prefix indexes are used.


Proposed Solution: Partial Order TopN Optimization

Key Insight

When a prefix index exists on the ORDER BY column, we can leverage its partial ordering property:

  1. The prefix index provides global ordering by prefix
  2. For TopN queries, we only need to read N + X rows, where:
    • N = desired result count
    • X = extra rows with the same prefix as the Nth row

This avoids reading the entire table while still guaranteeing correct results.

Example of Partial Ordering

With idx_email_prefix (email(10)):

Prefix (10 bytes) | Full Email            | Logical Group
------------------|-----------------------|---------------
alice@exam        | alice@example.com     | Group A (2 rows)
alice@exam        | alice@example.org     | 
alice@comp        | alice@company.com     | Group B (1 row)
bob@exampl        | bob@example.com       | Group C (3 rows)
bob@exampl        | bob@example.org       |
bob@exampl        | bob@example.net       |
carol@test        | carol@test.com        | Group D (1 row)
...

For LIMIT 3:

  • Without optimization: Read all rows (millions)
  • With optimization: Read Group A + Group B3 + X rows (X = group size variance)

Optimized Execution Plan

Query:

SELECT * FROM users ORDER BY email LIMIT 10;

New Execution Plan (Two-Layer Mode):

TopN_7                                     | 10.00  | root      | order by email, offset:0, count:10, partial_order_limit:10, prefix_col_id:2, prefix_len:10
└─IndexLookUp_14                           | 10.00  | root      | 
  ├─Limit_13(Build)                        | 10.00  | cop[tikv] | count:10, partial_order_limit:10, prefix_col_id:2, prefix_len:10
  │ └─IndexRangeScan_11                    | 10.00  | cop[tikv] | table:users, index:idx_email_prefix(email), range:[NULL,+inf], keep order:true
  └─TableRowIDScan_12(Probe)               | 10.00  | cop[tikv] | table:users, keep order:false

Key Improvements:

  1. IndexRangeScan with KeepOrder - Reads index in sorted order
  2. Partial Ordered Limit on TiKV - Stops reading after N+X rows based on prefix matching
  3. Partial Ordered TopN on TiDB - Final sorting and short-circuiting based on prefix

Performance Gains:

  • Scans ~10-20 rows instead of 10 million rows
  • Query time: < 10ms (1000x improvement)
  • Reduced network transfer
  • Reduced memory usage

Algorithm Details

Short-Circuit Logic:

Read rows in index order:
  For each row R:
    If result set size < N:
      Add R to result set
    Else if R.prefix == result[N-1].prefix:
      Add R to result set (extra row X)
    Else:
      STOP reading (short-circuit)
      
Sort result set by full column value
Return top N rows

Example Execution:

SELECT * FROM users ORDER BY email LIMIT 3;
  1. Read from idx_email_prefix in order:

    Row 1: alice@example.com   (prefix: alice@exam) → Add to result [1]
    Row 2: alice@example.org   (prefix: alice@exam) → Add to result [2]
    Row 3: alice@company.com   (prefix: alice@comp) → Add to result [3]
    Row 4: bob@example.com     (prefix: bob@exampl) → STOP (different prefix, N=3 reached)
    
  2. Sort 3 rows by full email:

    alice@company.com
    alice@example.com
    alice@example.org
    
  3. Return top 3 results

Statistics-Based Estimation:

The optimizer estimates X using:

  • NDV (number of distinct values) in index statistics
  • TopN statistics (most frequent prefixes)
  • Safety factor: cap at 10% of total rows (configurable)

Two-Layer vs One-Layer Plans

Two-Layer Mode (when pushdown is possible):

  • TiKV Layer: Partial ordered Limit (with prefix-based short-circuit)
  • TiDB Layer: Partial ordered TopN (final sorting + short-circuit)
  • Condition: No Selection filters on table scan

One-Layer Mode (when pushdown is not possible):

  • TiDB Layer: Only partial ordered TopN
  • Condition: Selection filters exist on table scan (may reorder rows)

Benefits

1. Performance Improvement

For queries with prefix indexes on ORDER BY columns:

  • 100-1000x faster for large tables (millions of rows)
  • Reduced I/O from O(N) table scan to O(LIMIT + X)
  • Lower memory usage (sorting 10s of rows instead of millions)

2. Better Resource Utilization

  • Network bandwidth: Transfer ~10 rows instead of millions
  • CPU: Sort ~10 rows instead of millions
  • Memory: Buffer ~10 rows instead of millions

3. Backward Compatible

  • Controlled by system variable tidb_opt_partial_ordered_index_for_topn (default OFF)
  • No impact on existing queries when disabled
  • Gradually enable for specific workloads

4. Cost-Based Decision

  • Optimizer estimates X using statistics
  • Falls back to normal plan if estimated cost is higher
  • Configurable via tidb_opt_partial_ordered_max_limit_percent

Use Cases

1. Pagination APIs

-- First page
SELECT * FROM products ORDER BY name LIMIT 20 OFFSET 0;
-- Second page
SELECT * FROM products ORDER BY name LIMIT 20 OFFSET 20;

2. Admin Dashboards

-- Show recent users by email
SELECT id, email, created_at FROM users ORDER BY email LIMIT 100;

-- Show logs sorted by URL
SELECT * FROM access_logs ORDER BY url LIMIT 50;

3. Prefix Index Use Cases

Prefix indexes are commonly used for:

  • Long VARCHAR columns (email, URL, description)
  • Reducing index size while maintaining sort capability
  • Balancing storage vs query performance

This optimization makes prefix indexes much more practical for ORDER BY queries.

Metadata

Metadata

Assignees

Labels

sig/plannerSIG: Plannertype/enhancementThe issue or PR belongs to an enhancement.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions