Hi
A user at our site has reported that the following operation on certain 64-bit offset files created by their model has seen an increase in runtime by a factor of over 100 between netcdf/4.7.3 and netcdf/4.7.4.
nccopy -d 5 -s <infile> <outfile>
We run CentOS 8.1, and have reproduced this issue with both GCC 8.3 and Intel 2019.5 compilers. In all cases, NetCDF has been built against HDF5 1.10.5, HDF4 4.2.14 and szip 2.1. We're using autotools with the following configure options:
--enable-hdf4 --enable-mmap --disable-dap-remote-tests
I've narrowed this down to the case where is a 64-bit offset file over approx. 4GB in size. Files under this size, or in other formats don't appear to be affected. I've been using the following steps to reproduce:
wget http://dapds00.nci.org.au/thredds/fileServer/fx3/gbr4/gbr4_simple_2016-06.nc
ncks --64bit_offset gbr4_simple_2016-06.nc gbr4_simple_2016-06.nc.64
nccopy -d 5 -s gbr4_simple_2016-06.nc.64 gbr4_simple_2016-06.nc.474
This downloads a 4.5GB compressed classic model file and deflates it to a 13.8GB 64-bit offset file. When run using nccopy provided by NetCDF 4.7.3, the operation takes around 260 seconds on our system. When run using nccopy provided by NetCDF 4.7.4, the same operation takes around 32,000 seconds. I've tried setting identical chunk parameters between the two versions as described in #1757, but this did not improve performance. This regression also appears at different deflation levels (including 0, though that only takes twice as long), though I've been focused on the -d 5 case.
I've gathered some profiling data, and strace output, and from what I can tell, its taking a very different pathway inside HDF5. Both cases appear to follow the same pathway to H5D__chunk_write. Then, in the 4.7.3 case, the majority of the writes are occurring inside the H5D__chunk_cache_prune function, but in the 4.7.4 case, the writes are occurring H5D__chunk_unlock. Furthermore, nccopy 4.7.4 reading large amounts of data from the output file (during H5D__chunk_lock), then writing a significant amount more data that nccopy 4.7.3. Strace shows nccopy 4.7.4. Below is a snippet from strace -epwrite64,pread64 nccopy -d 5 -s gbr4_simple_2016-06.nc.64 gbr4_simple_2016-06.nc.474
pread64(5, "x^\354\335uXU\353\2727~:\247\204\200( \335\215H\227\nH\2034H7HJ\203Hw"..., 686159840, 603883237) = 686159840
pwrite64(5, "x^\354\335uXU\353\2727~:\247\204\200( \335\215H\227\nH\2034H7HJ\203Hw"..., 692154506, 40796584) = 692154506
pread64(5, "x^\354\335uXT\373\3727~)\1\31ZiP:\245\273K\220n\245\33\244SB\244\273\21\1"..., 909870723, 1290043077) = 909870723
pwrite64(5, "x^\354\335uXT\373\3727~)\1\31ZiP:\245\273K\220n\245\33\244SB\244\273\21\1"..., 917804449, 732951090) = 917804449
pread64(5, "x^\354\335\5TU\353\342.|\220\20p\221\22\" \235\2\"% H#]\22\322\r\322-%"..., 909308379, 3146777046) = 909308379
pwrite64(5, "x^\354\335\5TU\353\342.|\220\20p\221\22\" \235\2\"% H#]\22\322\r\322-%"..., 917255929, 2840938583) = 917255929
pread64(5, "x^\354\275\5t\34Y\262\347-\306\224,fff\262\230\321B\213\331bK\262\230\31,\266\230\231"..., 23715679, 2199913800) = 23715679
pwrite64(5, "x^\354\275\5t\34Y\262\347-\306\224,fff\262\230\321B\213\331bK\262\230\31,\266\230\231"..., 23934582, 1650755539) = 23934582
pread64(5, "x^\354\275\5t\335X\226\367k\246(fffff\266c\306\230\231\231\0353\333\261\35C\2141"..., 24817697, 2223629479) = 24817697
pwrite64(5, "x^\354\275\5t\335X\226\367k\246(fffff\266c\306\230\231\231\0353\333\261\35C\2141"..., 25043550, 1674690121) = 25043550
It appears to be reading and writing large amounts of overlapping data to the output file (the offsets and write sizes do appear to overlap over time). Running the same strace on nccopy 4.7.3 shows that the writes are much smaller, and pread64 is never called. The majority of the increased runtime then appears in H5Z_pipeline, presumably compressing the data that is has read along with whatever data it was supposed to write. I'm afraid I'm not knowledgeable enough about the interaction between HDF5 and NetCDF to provide any suggestions, but please let me know if you need any further information or profiling data.
Hi
A user at our site has reported that the following operation on certain 64-bit offset files created by their model has seen an increase in runtime by a factor of over 100 between netcdf/4.7.3 and netcdf/4.7.4.
nccopy -d 5 -s <infile> <outfile>We run CentOS 8.1, and have reproduced this issue with both GCC 8.3 and Intel 2019.5 compilers. In all cases, NetCDF has been built against HDF5 1.10.5, HDF4 4.2.14 and szip 2.1. We're using autotools with the following configure options:
--enable-hdf4 --enable-mmap --disable-dap-remote-testsI've narrowed this down to the case where is a 64-bit offset file over approx. 4GB in size. Files under this size, or in other formats don't appear to be affected. I've been using the following steps to reproduce:
This downloads a 4.5GB compressed classic model file and deflates it to a 13.8GB 64-bit offset file. When run using nccopy provided by NetCDF 4.7.3, the operation takes around 260 seconds on our system. When run using nccopy provided by NetCDF 4.7.4, the same operation takes around 32,000 seconds. I've tried setting identical chunk parameters between the two versions as described in #1757, but this did not improve performance. This regression also appears at different deflation levels (including 0, though that only takes twice as long), though I've been focused on the
-d 5case.I've gathered some profiling data, and strace output, and from what I can tell, its taking a very different pathway inside HDF5. Both cases appear to follow the same pathway to
H5D__chunk_write. Then, in the 4.7.3 case, the majority of the writes are occurring inside theH5D__chunk_cache_prunefunction, but in the 4.7.4 case, the writes are occurringH5D__chunk_unlock. Furthermore, nccopy 4.7.4 reading large amounts of data from the output file (duringH5D__chunk_lock), then writing a significant amount more data that nccopy 4.7.3. Strace shows nccopy 4.7.4. Below is a snippet fromstrace -epwrite64,pread64 nccopy -d 5 -s gbr4_simple_2016-06.nc.64 gbr4_simple_2016-06.nc.474It appears to be reading and writing large amounts of overlapping data to the output file (the offsets and write sizes do appear to overlap over time). Running the same strace on nccopy 4.7.3 shows that the writes are much smaller, and
pread64is never called. The majority of the increased runtime then appears inH5Z_pipeline, presumably compressing the data that is has read along with whatever data it was supposed to write. I'm afraid I'm not knowledgeable enough about the interaction between HDF5 and NetCDF to provide any suggestions, but please let me know if you need any further information or profiling data.