perf: optimize V1 planInputPartitions() with batch API and eliminate S3 traversal by Thor-ChenBiao · Pull Request #62 · zilliztech/spark-milvus

Thor-ChenBiao · 2025-12-11T05:43:38Z

Problem

planInputPartitions() runs on Driver single-threaded. For N segments:

N HTTP calls to get segment info (one per segment)
2N S3 listStatus calls to traverse binlog directories

Solution

1. Batch Segment Info API

Add getSegmentsInfoBatch() in MilvusClient.scala
Single HTTP call fetches all segment insertLogIDs at once
N HTTP calls → 1 HTTP call

2. Eliminate S3 Traversal

Before: Even though API returned insertLogIDs, code still traversed S3 to "verify":
fs.listStatus(segmentPath) // 1st S3 call: list field directories
→ fs.listStatus(fieldPath) // 2nd S3 call: list binlog files per field
→ filter by insertLogIDs // then filter results

After: Trust API response, build paths directly via string concatenation:
insertLogIDs: ["100/123456", "101/789012"]
↓ split("/")
fieldID: "100", logID: "123456"
↓ concat
fullPath: s"${rootPath}/${fieldID}/${logID}"

Result: 2N S3 calls → 0 S3 calls

3. Parallel Segment Processing

Use .par.foreach with ConcurrentHashMap for thread-safe parallel processing
Note: After batch API optimization, this has minimal impact since remaining operations are pure memory/CPU (no I/O)

Changes

MilvusClient.scala: Add getSegmentsInfoBatch() method
MilvusDataSource.scala:
- Batch fetch at start of planInputPartitions()
- New buildFieldMapFromLogInfo() builds paths without S3
- Remove SegmentInfoCache (no longer needed with batch API)
- Fail-fast on batch API failure (no silent fallback)

…S3 traversal ## Problem planInputPartitions() runs on Driver single-threaded. For N segments: - N HTTP calls to get segment info (one per segment) - 2N S3 listStatus calls to traverse binlog directories ## Solution ### 1. Batch Segment Info API - Add `getSegmentsInfoBatch()` in MilvusClient.scala - Single HTTP call fetches all segment insertLogIDs at once - N HTTP calls → 1 HTTP call ### 2. Eliminate S3 Traversal Before: Even though API returned insertLogIDs, code still traversed S3 to "verify": fs.listStatus(segmentPath) // 1st S3 call: list field directories → fs.listStatus(fieldPath) // 2nd S3 call: list binlog files per field → filter by insertLogIDs // then filter results After: Trust API response, build paths directly via string concatenation: insertLogIDs: ["100/123456", "101/789012"] ↓ split("/") fieldID: "100", logID: "123456" ↓ concat fullPath: s"${rootPath}/${fieldID}/${logID}" Result: 2N S3 calls → 0 S3 calls ### 3. Parallel Segment Processing - Use .par.foreach with ConcurrentHashMap for thread-safe parallel processing - Note: After batch API optimization, this has minimal impact since remaining operations are pure memory/CPU (no I/O) ## Changes - MilvusClient.scala: Add getSegmentsInfoBatch() method - MilvusDataSource.scala: - Batch fetch at start of planInputPartitions() - New buildFieldMapFromLogInfo() builds paths without S3 - Remove SegmentInfoCache (no longer needed with batch API) - Fail-fast on batch API failure (no silent fallback)

sre-ci-robot · 2025-12-11T05:43:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Thor-ChenBiao
To complete the pull request process, please assign xiaofan-luan after the PR has been reviewed.
You can assign the PR to them by writing /assign @xiaofan-luan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sre-ci-robot · 2025-12-11T05:43:49Z

Welcome @Thor-ChenBiao! It looks like this is your first PR to zilliztech/spark-milvus 🎉

sre-ci-robot requested review from czs007 and liliu-z December 11, 2025 05:43

sre-ci-robot added the size/L label Dec 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize V1 planInputPartitions() with batch API and eliminate S3 traversal#62

perf: optimize V1 planInputPartitions() with batch API and eliminate S3 traversal#62
Thor-ChenBiao wants to merge 1 commit intozilliztech:mainfrom
Thor-ChenBiao:cbMain

Thor-ChenBiao commented Dec 11, 2025

Uh oh!

sre-ci-robot commented Dec 11, 2025

Uh oh!

sre-ci-robot commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Thor-ChenBiao commented Dec 11, 2025

Problem

Solution

1. Batch Segment Info API

2. Eliminate S3 Traversal

3. Parallel Segment Processing

Changes

Uh oh!

sre-ci-robot commented Dec 11, 2025

Uh oh!

sre-ci-robot commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants