Skip to content

Allow lazy-read of netCDF-4/HDF5 files #857

@edhartnett

Description

@edhartnett

At this point, this represents more an aspiration that a plan, but there has been some discussion (see PR #849) of how to enable lazy reads of netCDF-4 file metadata.

Files with a very large amount of metadata take a long time to load because netCDF reads all metadata at file open. For classic files, this doesn't seem to bother people much. But for netCDF-4/HDF5 files, it does. Perhaps this can be explained by the use of netCDF-4/HDF5 for some really complex and large datasets, which end up with tens of thousands of attributes, variables, dimensions, and/or groups. Or perhaps the classic formats, having all their metadata in a block at the beginning of the file, just load faster.

This has already cost us satellite users - the NPP uses netCDF-4, but the follow-on JPSS spacecraft switched to HDF5 without netCDF, due to the slow load times. I was told a similar story about a ESA satellite system by a very active netCDF user in the Netherlands. (Satellite L2 data files generally contain a very large number of attributes, some of which may be reasonably large arrays.)

One idea I suggested is to read each group only as needed. This would be pretty easy to implement I think. It would help where there's lot of groups. @DennisHeimbigner points out that this will not help with files that contain lots of vars. He indicates a known use case with a very large number of vars, all in the root group.

Well, that's another good idea all shot to hell. ;-)

In order to do lazy reads as Dennis suggests I think much of the libsrc4 code would have to be rewritten. (The good news is that with #849 soon to merge, and #856 to follow, the libsrc4 code will be a fair bit smaller than it is now.)

For example, if we open a file and read nothing, and then the user does an nc_inq(), we need to find out how many variables there are. In the current code, we count our list, because we have already read them. In the lazy-read code, we would rsee if there's a way we can get the numbers we need without reading every variable's metadata. That is probably possible in HDF5, but not how the code is currently written.

Handling dimensions in a lazy-read is going to be particularly tricky. They may be in different groups from a variable. So if the user opens a file and does a nc_inq_var() on a var deep in the group structure, we will have to have code smart enough to find all the dimensions in whatever group they are in. All this information is in the HDF5 file, but the code to read it and use it properly remains to be written.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions