From: John Caron
Background:
In the classic model, data using the "byte" data type are interpreted as signed when converting. However, the byte data type is sometimes used for unsigned data. Unidata introduced the "_Unsigned" attribute to allow the user to specify this. Not all libraries look for this attribute.
Sometimes the "char" data type is intended to mean unsigned byte data. More typically it is used for encoding text data, but the character encoding is undefined. Probably "printable ASCII" is a reasonable assumption. Char data are fixed length arrays only, and one must specify the length using a global, shared dimension, which is unneeded and clutters the dimension namespace.
The NetCDF-4 enhanced model adds Strings and unsigned integer types, so we have the opportunity to clarify. Lots of work on character encodings have been done in the last 20 years with Unicode, and we should leverage that. UTF8 is a variable length encoding of Unicode that has ASCII as a subset, allows any language to be encoded, and has become the dominant encoding on the web. NetCDF libraries assume Strings are UTF8 encoded. If your text is ASCII, you are using UTF8 already.
Also see:
CDL Data Types
Developing Conventions for NetCDF-4 : Use of Strings
Proposal:
- Use the unsigned or signed integer data types when your data is unsigned or signed, respectively.
- Do not use _Unsigned attribute.
- Use the String data type for text data, encoded in UTF-8. Any language (aka character set) is allowable.
- The char data type is deprecated. If you must use it, use it only for ASCII text data.
From: John Caron
Background:
In the classic model, data using the "byte" data type are interpreted as signed when converting. However, the byte data type is sometimes used for unsigned data. Unidata introduced the "_Unsigned" attribute to allow the user to specify this. Not all libraries look for this attribute.
Sometimes the "char" data type is intended to mean unsigned byte data. More typically it is used for encoding text data, but the character encoding is undefined. Probably "printable ASCII" is a reasonable assumption. Char data are fixed length arrays only, and one must specify the length using a global, shared dimension, which is unneeded and clutters the dimension namespace.
The NetCDF-4 enhanced model adds Strings and unsigned integer types, so we have the opportunity to clarify. Lots of work on character encodings have been done in the last 20 years with Unicode, and we should leverage that. UTF8 is a variable length encoding of Unicode that has ASCII as a subset, allows any language to be encoded, and has become the dominant encoding on the web. NetCDF libraries assume Strings are UTF8 encoded. If your text is ASCII, you are using UTF8 already.
Also see:
CDL Data Types
Developing Conventions for NetCDF-4 : Use of Strings
Proposal: