Impact of the bug
WMAgent
Describe the bug
As just discussed with Hasan, Panos and Stephan, there are hundreds of files that are having checksum issues and are unable to be transferred between storage endpoints.
Last year, or two years ago, we identified cases where gfal-copy was reporting status code 0, but in the end the data transfer faced issues and left corrupted files in the system, as tracked in #11556
We suspect that, at least a fraction of these recent failures, are still coming from that same scenario - especially because SL7 GFAL2 library hasn't been updated in the apptainer images used by SI for production workload.
How to reproduce it
Do not know
Expected behavior
The initial solution that we discussed in the meeting involves 3 steps:
- we execute the
gfal-copy data transfer
- if it is successful, we trigger a
gfal-sum (or the correct command to calculate the remote file checkum), regardless whether it gfal-copy was executed with the -K checksum option or not
- then in WMAgent, we compare the remote checksum with the local checksum. If it is equal, then the whole stage out step returns an exit code 0, else try to remove any corrupted/broken files with
gfal-rm and follow the already in place stage out retry logic.
NOTE that we need to investigate which algorithm is used for the local checksum calculation (is it done by CMSSW? or WMAgent/WMRuntime is calculating) and make sure that the same method is used for the remote calculation.
In addition, it would be a bonus if we could add some flexibility to this check, such that we can enable/disable it. However, I think it is not simple enough and we might just skip it.
Additional context and error message
None
Impact of the bug
WMAgent
Describe the bug
As just discussed with Hasan, Panos and Stephan, there are hundreds of files that are having checksum issues and are unable to be transferred between storage endpoints.
Last year, or two years ago, we identified cases where gfal-copy was reporting status code 0, but in the end the data transfer faced issues and left corrupted files in the system, as tracked in #11556
We suspect that, at least a fraction of these recent failures, are still coming from that same scenario - especially because SL7 GFAL2 library hasn't been updated in the apptainer images used by SI for production workload.
How to reproduce it
Do not know
Expected behavior
The initial solution that we discussed in the meeting involves 3 steps:
gfal-copydata transfergfal-sum(or the correct command to calculate the remote file checkum), regardless whether it gfal-copy was executed with the-Kchecksum option or notgfal-rmand follow the already in place stage out retry logic.NOTE that we need to investigate which algorithm is used for the local checksum calculation (is it done by CMSSW? or WMAgent/WMRuntime is calculating) and make sure that the same method is used for the remote calculation.
In addition, it would be a bonus if we could add some flexibility to this check, such that we can enable/disable it. However, I think it is not simple enough and we might just skip it.
Additional context and error message
None