importinto: scanning large amount of compressed files is slow

## Enhancement

In some cases, the size of single file of import data may be small, causing too much files need to scan. And it may consume more than one hour to read about 1 million files. Below is the part of the log, as you can see, it costs about 40 minutes to generate subtasks.

```
[2025/11/29 05:37:51.623 +00:00] [INFO] [scheduler.go:309] ["on next subtasks batch"] [keyspaceName=SYSTEM] [task-id=1] [task-key=xxx] [curr-step=init] [next-step=encode] [node-count=18] [table-id=15]
[2025/11/29 06:14:22.097 +00:00] [INFO] [table_import.go:420] ["populate chunks start"] [keyspaceName=SYSTEM] [task-id=1] [task-key=xxx] [curr-step=init] [next-step=encode] [node-count=18] [table-id=15]
```

There are two thing we can improve here:
- Skip reading files before submitting task with global sort, since we will read the files again on dxf scheduler, and we only used total file size to calculate dxf node resource.
- We can use part of the files to estimate a overall compression ratio, to avoid opening every file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importinto: scanning large amount of compressed files is slow #64770

Enhancement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

importinto: scanning large amount of compressed files is slow #64770

Description

Enhancement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions