Skip to content

importinto: scanning large amount of compressed files is slow #64770

@joechenrh

Description

@joechenrh

Enhancement

In some cases, the size of single file of import data may be small, causing too much files need to scan. And it may consume more than one hour to read about 1 million files. Below is the part of the log, as you can see, it costs about 40 minutes to generate subtasks.

[2025/11/29 05:37:51.623 +00:00] [INFO] [scheduler.go:309] ["on next subtasks batch"] [keyspaceName=SYSTEM] [task-id=1] [task-key=xxx] [curr-step=init] [next-step=encode] [node-count=18] [table-id=15]
[2025/11/29 06:14:22.097 +00:00] [INFO] [table_import.go:420] ["populate chunks start"] [keyspaceName=SYSTEM] [task-id=1] [task-key=xxx] [curr-step=init] [next-step=encode] [node-count=18] [table-id=15]

There are two thing we can improve here:

  • Skip reading files before submitting task with global sort, since we will read the files again on dxf scheduler, and we only used total file size to calculate dxf node resource.
  • We can use part of the files to estimate a overall compression ratio, to avoid opening every file.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions