Debugging netCDF I/O problems on HPC systems can be very challenging.
Using a debugger is usually hard or impossible, and with code running on hundreds or thousands of processors, it's pretty hard to know what is going on.
One thing that helps a lot is logging - especially in PIO, where I have adjusted the logging code which I originally wrote for netcdf-4 to work better on multiple processors, giving one log file for each processor, with the logging for that processor. I have an issue into netcdf-c to get this improvement in the netCDF-4 logging as well (#1762).
As we know, logging is only enabled if --enable-logging is used. I did this out of an excess of caution. I didn't want to slow performance with logging, and this option guarantees that it does not. However, there is a serious cost. When a HPC system experiences problems, to install a re-built netcdf-c, with logging turned on, is a significant difficulty. Partly this is because these systems are complex, partly because only a few people are given permissions to install software, and those people are always backed up, with a long list of things to do.
What would be really useful would be if logging were always available. This will not effect performance because even when logging is available, it does nothing when the log level is not set. So a LOG(()) call in netcdf-c code will check the log level and exit, which should not matter to performance, especially if we are careful to avoid LOG(()) statements inside deep loops, and we mostly do that now, because such log statements produce more output than is useful.
With this change, then, HPC users could start debugging just by inserting an nc_set_log_level() statement (or nf_set_log_level() for Fortran). There would be no need of a separate netcdf-c install, so system admins would not have to be involved.
This must be carefully tested to ensure there are no performance impacts. But I don't think there will be. Most logging is done in the metadata code, which is not vital for performance. In the data read/write code, there are LOG(()) statements, but we can ensure they do not degrade performance.
Debugging netCDF I/O problems on HPC systems can be very challenging.
Using a debugger is usually hard or impossible, and with code running on hundreds or thousands of processors, it's pretty hard to know what is going on.
One thing that helps a lot is logging - especially in PIO, where I have adjusted the logging code which I originally wrote for netcdf-4 to work better on multiple processors, giving one log file for each processor, with the logging for that processor. I have an issue into netcdf-c to get this improvement in the netCDF-4 logging as well (#1762).
As we know, logging is only enabled if --enable-logging is used. I did this out of an excess of caution. I didn't want to slow performance with logging, and this option guarantees that it does not. However, there is a serious cost. When a HPC system experiences problems, to install a re-built netcdf-c, with logging turned on, is a significant difficulty. Partly this is because these systems are complex, partly because only a few people are given permissions to install software, and those people are always backed up, with a long list of things to do.
What would be really useful would be if logging were always available. This will not effect performance because even when logging is available, it does nothing when the log level is not set. So a LOG(()) call in netcdf-c code will check the log level and exit, which should not matter to performance, especially if we are careful to avoid LOG(()) statements inside deep loops, and we mostly do that now, because such log statements produce more output than is useful.
With this change, then, HPC users could start debugging just by inserting an nc_set_log_level() statement (or nf_set_log_level() for Fortran). There would be no need of a separate netcdf-c install, so system admins would not have to be involved.
This must be carefully tested to ensure there are no performance impacts. But I don't think there will be. Most logging is done in the metadata code, which is not vital for performance. In the data read/write code, there are LOG(()) statements, but we can ensure they do not degrade performance.