Skip to content
This repository was archived by the owner on Apr 24, 2020. It is now read-only.
This repository was archived by the owner on Apr 24, 2020. It is now read-only.

String, char, unsigned integers, and character encodings. #3

@JohnLCaron

Description

@JohnLCaron

From: John Caron

Background:

In the classic model, data using the "byte" data type are interpreted as signed when converting. However, the byte data type is sometimes used for unsigned data. Unidata introduced the "_Unsigned" attribute to allow the user to specify this. Not all libraries look for this attribute.

Sometimes the "char" data type is intended to mean unsigned byte data. More typically it is used for encoding text data, but the character encoding is undefined. Probably "printable ASCII" is a reasonable assumption. Char data are fixed length arrays only, and one must specify the length using a global, shared dimension, which is unneeded and clutters the dimension namespace.

The NetCDF-4 enhanced model adds Strings and unsigned integer types, so we have the opportunity to clarify. Lots of work on character encodings have been done in the last 20 years with Unicode, and we should leverage that. UTF8 is a variable length encoding of Unicode that has ASCII as a subset, allows any language to be encoded, and has become the dominant encoding on the web. NetCDF libraries assume Strings are UTF8 encoded. If your text is ASCII, you are using UTF8 already.

Also see:

CDL Data Types

Developing Conventions for NetCDF-4 : Use of Strings

Proposal:

  1. Use the unsigned or signed integer data types when your data is unsigned or signed, respectively.
  2. Do not use _Unsigned attribute.
  3. Use the String data type for text data, encoded in UTF-8. Any language (aka character set) is allowable.
  4. The char data type is deprecated. If you must use it, use it only for ASCII text data.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions