Skip to content

Localized metadata in NetCDF files #244

@turnbullerin

Description

@turnbullerin

Hi Everyone!

So I work for the Government of Canada and I am working on defining the required metadata fields for us to publish data in NetCDF format. We'll be moving a lot of data into this format, so we are trying to make sure we get the format right the first time. The CF conventions are our starting point for metadata attributes.

As the data will be officially published by the Government of Canada eventually, we will have to make sure the metadata is available in both English and French. If the data contains English or French text (not from a controlled list), it needs to be translated too. I haven't found any efforts towards creating a convention for bilingual (or multilingual) metadata and data in NetCDF formats, so I wanted to reach out here to see if anyone has been working on this so we could collaborate on it.

My initial thought is that the metadata should be included in such a way as to make it easy to programmatically extract each language separately. This would allow applications that use NetCDF files (or tools that draw on the CF conventions like ERDDAP) to display the available language options and let the user select which one they would like to see without additional clutter. It should also be included in a way that does not impact existing applications to ensure compatibility.

Of note though is that some data comes from controlled lists where the values have meaning beyond the English meaning. This data probably shouldn't be translated as it would lose its meaning. For many controlled lists, applications can use their own lookup tables to translate the display if they want, and bigger vocabulary lists (like GCMD keywords) can have translations available on the web.

ISO-19115 handles this by defining "locales" (a mix of a mandatory ISO 639 language code, optional ISO 3166 country code, and optional IANA character set) and using PT_FreeText to define one value per locale for different text fields. I like this approach and I think it can translate fairly cleanly to NetCDF attributes. To align with ISO-19115, I would propose two global attributes, one called locale_default and one called locale_others (I kept the word 'locale' in front instead of at the end like in ISO-19115 since this groups similar attributes and I see this is what CF has usually done). The locale_others could use a prefix system (like what keywords_vocabulary uses) to separate different values. I would propose using the typical standards used in the HTTP protocol for separating the language, country, and encoding, e.g. language-COUNTRY;encoding. Maybe encoding and country are not necessary, I'm not sure, I just know ISO included them.

I would then propose using the prefixes from locale_others as suffixes on existing attribute names to represent the value of that attribute in another locale.

For example, this would give us the following global attributes if we wanted to include English (Canada), French (Canada), and Spanish (Mexico) in our locales and translate the title:

  :locale_default = 'en-CA;utf-8';
  :locale_others = 'fra:fr-CA;utf-8 esp:es-MX;utf-8';
  :title = 'English Title';
  :title_fra = 'Titre française';
  :title_esp = 'Título en español';

I was torn if the default locale should define a prefix too, if it did, it would let one use the non-suffixed attribute name for a combination of languages as the default (for applications that don't support localization); for example:

  :locale_default = 'eng:en-CA;utf-8';
  :locale_others = 'fra:fr-CA;utf-8 esp:es-MX;utf-8';
  :title = 'English Title | Titre française';
  :title_eng = 'English Title'
  :title_fra = 'Titre française';
  :title_esp = 'Título en español';

But then this seems like an inaccurate use of locale_default since the default is actually a combo. Maybe English should be added to locale_others in this case and locale_default changed to something like und;utf-8 or even just use the delimiter like [eng] | [fra] to show the format.

I haven't run into a data variable that needs translating yet, but if so, my thought was to define an attribute on the data variable that would allow an application to identify all the related localized variables (i.e. same data, different locale) and which variable goes with which locale. Something like

  var_name_en:locale = ':var_name';      # locale identified in locale_default
  var_name_fr:locale = 'fra:var_name';   # locale identified in locale_others

Thoughts, feedback, any other suggestions are very welcome!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions