Skip to content

Explicitly compare local/remote stage out file checksum in GFAL2 #12347

@amaltaro

Description

@amaltaro

Impact of the bug
WMAgent

Describe the bug
As just discussed with Hasan, Panos and Stephan, there are hundreds of files that are having checksum issues and are unable to be transferred between storage endpoints.

Last year, or two years ago, we identified cases where gfal-copy was reporting status code 0, but in the end the data transfer faced issues and left corrupted files in the system, as tracked in #11556

We suspect that, at least a fraction of these recent failures, are still coming from that same scenario - especially because SL7 GFAL2 library hasn't been updated in the apptainer images used by SI for production workload.

How to reproduce it
Do not know

Expected behavior
The initial solution that we discussed in the meeting involves 3 steps:

  1. we execute the gfal-copy data transfer
  2. if it is successful, we trigger a gfal-sum (or the correct command to calculate the remote file checkum), regardless whether it gfal-copy was executed with the -K checksum option or not
  3. then in WMAgent, we compare the remote checksum with the local checksum. If it is equal, then the whole stage out step returns an exit code 0, else try to remove any corrupted/broken files with gfal-rm and follow the already in place stage out retry logic.

NOTE that we need to investigate which algorithm is used for the local checksum calculation (is it done by CMSSW? or WMAgent/WMRuntime is calculating) and make sure that the same method is used for the remote calculation.

In addition, it would be a bonus if we could add some flexibility to this check, such that we can enable/disable it. However, I think it is not simple enough and we might just skip it.

Additional context and error message
None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    ToDo

    Status

    To Do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions