Skip to content

perf: optimize V1 planInputPartitions() with batch API and eliminate S3 traversal#62

Open
Thor-ChenBiao wants to merge 1 commit intozilliztech:mainfrom
Thor-ChenBiao:cbMain
Open

perf: optimize V1 planInputPartitions() with batch API and eliminate S3 traversal#62
Thor-ChenBiao wants to merge 1 commit intozilliztech:mainfrom
Thor-ChenBiao:cbMain

Conversation

@Thor-ChenBiao
Copy link
Copy Markdown
Collaborator

Problem

planInputPartitions() runs on Driver single-threaded. For N segments:

  • N HTTP calls to get segment info (one per segment)
  • 2N S3 listStatus calls to traverse binlog directories

Solution

1. Batch Segment Info API

  • Add getSegmentsInfoBatch() in MilvusClient.scala
  • Single HTTP call fetches all segment insertLogIDs at once
  • N HTTP calls → 1 HTTP call

2. Eliminate S3 Traversal

Before: Even though API returned insertLogIDs, code still traversed S3 to "verify":
fs.listStatus(segmentPath) // 1st S3 call: list field directories
→ fs.listStatus(fieldPath) // 2nd S3 call: list binlog files per field
→ filter by insertLogIDs // then filter results

After: Trust API response, build paths directly via string concatenation:
insertLogIDs: ["100/123456", "101/789012"]
↓ split("/")
fieldID: "100", logID: "123456"
↓ concat
fullPath: s"${rootPath}/${fieldID}/${logID}"

Result: 2N S3 calls → 0 S3 calls

3. Parallel Segment Processing

  • Use .par.foreach with ConcurrentHashMap for thread-safe parallel processing
  • Note: After batch API optimization, this has minimal impact since remaining operations are pure memory/CPU (no I/O)

Changes

  • MilvusClient.scala: Add getSegmentsInfoBatch() method
  • MilvusDataSource.scala:
    • Batch fetch at start of planInputPartitions()
    • New buildFieldMapFromLogInfo() builds paths without S3
    • Remove SegmentInfoCache (no longer needed with batch API)
    • Fail-fast on batch API failure (no silent fallback)

…S3 traversal

  ## Problem
  planInputPartitions() runs on Driver single-threaded. For N segments:
  - N HTTP calls to get segment info (one per segment)
  - 2N S3 listStatus calls to traverse binlog directories

  ## Solution

  ### 1. Batch Segment Info API
  - Add `getSegmentsInfoBatch()` in MilvusClient.scala
  - Single HTTP call fetches all segment insertLogIDs at once
  - N HTTP calls → 1 HTTP call

 ### 2. Eliminate S3 Traversal
  Before: Even though API returned insertLogIDs, code still traversed S3 to "verify":
  fs.listStatus(segmentPath)     // 1st S3 call: list field directories
    → fs.listStatus(fieldPath)   // 2nd S3 call: list binlog files per field
      → filter by insertLogIDs   // then filter results

  After: Trust API response, build paths directly via string concatenation:
  insertLogIDs: ["100/123456", "101/789012"]
                      ↓ split("/")
  fieldID: "100", logID: "123456"
                      ↓ concat
  fullPath: s"${rootPath}/${fieldID}/${logID}"

  Result: 2N S3 calls → 0 S3 calls

  ### 3. Parallel Segment Processing
  - Use .par.foreach with ConcurrentHashMap for thread-safe parallel processing
  - Note: After batch API optimization, this has minimal impact since
    remaining operations are pure memory/CPU (no I/O)

  ## Changes
  - MilvusClient.scala: Add getSegmentsInfoBatch() method
  - MilvusDataSource.scala:
    - Batch fetch at start of planInputPartitions()
    - New buildFieldMapFromLogInfo() builds paths without S3
    - Remove SegmentInfoCache (no longer needed with batch API)
    - Fail-fast on batch API failure (no silent fallback)
@sre-ci-robot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Thor-ChenBiao
To complete the pull request process, please assign xiaofan-luan after the PR has been reviewed.
You can assign the PR to them by writing /assign @xiaofan-luan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot
Copy link
Copy Markdown
Collaborator

Welcome @Thor-ChenBiao! It looks like this is your first PR to zilliztech/spark-milvus 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants