Skip to content

DELTA_FILE_PATTERN regex is incorrectly matching tmp commit files #2201

@echai58

Description

@echai58

Environment

Delta-rs version: 0.15.3

Binding: python


Bug

What happened:

Because this regex https://github.com/delta-io/delta-rs/blob/main/crates/core/src/kernel/snapshot/log_segment.rs#L33 used here does not specify to match the entire string, a tmp commit file can match the regex, if it contains (some numbers).json.tmp.

This makes the history() list a tmp commit, which is incorrect. This happened when I was doing concurrent merges and it errored out (see: #2084 (comment)).

It seems like get_add_actions is robust to this regex, I'm not sure what checks it has to not include tmp commit files.

What you expected to happen:

history and anything else that relies on is_commit_file should not list tmp commits files.

How to reproduce it:
If you do concurrent merges / anything else that leads to tmp commit files, you can sometimes see files that look like (if the randomly generated id ends in an int): _delta_log/_commit_2132c4fe-4077-476c-b8f5-e77fea04f170.json.tmp, and this then gets listed in a history call.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions