Skip to content

Commit 1bafca1

Browse files
committed
Adding UTF-8 guide docs to doxygen and Sphinx docs
(Internal change: 2317598)
1 parent b226a1c commit 1bafca1

4 files changed

Lines changed: 154 additions & 0 deletions

File tree

docs/usdfaq.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,20 @@ What character encoding does :filename:`.usda` support?
114114

115115
The :filename:`.usda` file format encodes text as UTF-8.
116116

117+
As of the 24.03 release, USD extends UTF-8 support to path and metadata
118+
identifiers.
119+
120+
USD does not enforce or apply Unicode normalization. As an example, the
121+
second letter in München can be represented in UTF-8 by a single code point
122+
(the code point for ü) or two code points (u with an umlaut modifier) -- to USD,
123+
these two representations of München are distinct. While USD does not enforce a
124+
normalization form, Unicode "Normalization Form C" (NFC) is preferred when
125+
creating new tokens and paths.
126+
127+
See `Unicode in USD
128+
<api/_usd__page__u_t_f_8.html>`__ for more details on best practices when working
129+
with UTF-8 encoded content in `.usda` files.
130+
117131
How can I convert USD files between binary and text?
118132
####################################################
119133

pxr/usd/usd/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,7 @@ pxr_library(usd
183183
doxygen/multiThreading.dox
184184
doxygen/objectModel.dox
185185
doxygen/propertiesOfSceneDescription.dox
186+
doxygen/utf8Overview.dox
186187
doxygen/valueClips.dox
187188
doxygen/images/instancing/Instancing_Example.png
188189
doxygen/images/instancing/Nested_Instancing_Example.png

pxr/usd/usd/doxygen/front.dox

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,15 @@ import/export.
5454
<li> \ref Usd_Array_Datatypes </li>
5555
<li> \ref Usd_Dictionary_Type </li>
5656
</ol>
57+
<li> \subpage Usd_Page_UTF_8 </li>
58+
<ol type="i">
59+
<li> \ref Usd_UTF_8_Overview </li>
60+
<li> \ref Usd_UTF_8_Encoding </li>
61+
<li> \ref Usd_UTF_8_Language_Support </li>
62+
<li> \ref Usd_UTF_8_Identifiers </li>
63+
<li> \ref Usd_UTF_8_Operation_Reference </li>
64+
<li> \ref Usd_UTF_8_Encoding_Reference </li>
65+
</ol>
5766
<li> \subpage Usd_Page_PropertiesOfSceneDescription </li>
5867
<ol type="i">
5968
<li> \ref Usd_Ordering </li>
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
/*! \page Usd_Page_UTF_8 Unicode in USD
2+
3+
\section Usd_UTF_8_Overview Overview
4+
5+
Text, unless otherwise noted, should be assumed to be UTF-8 encoded. It's
6+
erroneous to describe USDA as an "ASCII" file format, as strings, tokens, and
7+
asset valued fields have been expected to support UTF-8 for several releases.
8+
USD 24.03 extends UTF-8 support to path and metadata identifiers.
9+
10+
This document aims to help users and developers reason about how to best build
11+
and validate UTF-8 content and tooling for USD.
12+
13+
\section Usd_UTF_8_Encoding UTF-8 Encoding
14+
15+
UTF-8 is a variable length encoding that is backwards compatible with ASCII.
16+
Every ASCII character and string is byte-equivalent to its UTF-8 encoded
17+
character and string. Users should think of UTF-8 strings as bytes representing
18+
"code points" in the Unicode code charts. A single code point may be represented
19+
by 1, 2, 3, or 4 byte sequences.
20+
21+
\subsection Usd_UTF_8_Encoding_Replacement Replacement Code Point
22+
23+
Not every 1, 2, 3, or 4 byte sequence represents a valid UTF-8 code point. When
24+
a byte sequence is invalid and cannot be decoded, USD replaces the sequence with
25+
�. Note that USD does not have to decode most strings that pass through it and
26+
should not be relied on for validation of content.
27+
28+
\subsection Usd_UTF_8_Encoding_Normalization Normalization
29+
30+
UTF-8 encoded strings may have sequences of code points that describe equivalent
31+
text rendered to the user. As an example, the second letter in München can be
32+
represented in UTF-8 by a single code point or two code points (u with an umlaut
33+
modifier). USD does not enforce or apply normalization forms internally. To
34+
USD, these two representations of München are distinct.
35+
36+
While USD does not enforce a normalization form, Unicode "Normalization Form C"
37+
(NFC) is preferred when creating new tokens and paths. The Python library
38+
unicodedata can be used to normalize strings. Strict validators may choose to
39+
warn users about strings (including tokens and paths) that are not NFC
40+
normalized. In the above example, the two code points version of München would
41+
be flagged by a validator checking for NFC normalization.
42+
43+
EdwardVII (where VII is three capital ASCII letters) and EdwardⅦ (where Ⅶ is a
44+
single UTF-8 code point) are distinct string values even under NFC
45+
normalization. When a user facing interface involves fuzzy matching a string,
46+
the Unicode documentation recommends Unicode "Normalization Form KC" (NFKC)
47+
normalization so a user does not have to be aware of specific encoding
48+
semantics. Strict validators may choose to warn users about siblings that have
49+
colliding NFKC normalization representations.
50+
51+
\section Usd_UTF_8_Language_Support Language Support
52+
53+
\subsection Usd_UTF_8_Language_Support_CPP C++
54+
55+
USD assumes that all C++ string types (including tokens, scene paths, and
56+
asset paths) are UTF-8 encoded unless otherwise specified. Applications must
57+
ensure content is properly UTF-8 encoded before using USD APIs. The C++ standard
58+
library does not provide a Unicode library, but many string operations designed
59+
for single byte ASCII character strings in both the C++ standard library and Tf
60+
will work without modification. Developers should verify by reading the
61+
documentation and including UTF-8 content in test cases. Tf provides a minimal
62+
set of Unicode utilities primarily for its own internal usage and does not aim
63+
to be a fully featured Unicode support library.
64+
65+
\subsection Usd_UTF_8_Language_Support_Python Python
66+
67+
Strings as of Python 3.0 are natively Unicode (though not UTF-8 encoded). It
68+
provides string operations like `casefold` for case insensitive comparison and a
69+
library unicodedata for some transformations and queries. Utilities in Boost
70+
Python and Tf handle string conversion to and from UTF-8 at the USD
71+
C++/Python language boundary.
72+
73+
\section Usd_UTF_8_Identifiers Identifiers
74+
75+
Identifiers are used to name prims, properties, and metadata fields. The Unicode
76+
specification provides two classes of code points, XID_Start and XID_Continue
77+
to validate identifiers. USD extends the XID_Start class with `_` to define its
78+
default identifier set.
79+
80+
USD path identifiers should be validated with SdfPath::IsValidIdentifier and
81+
SdfPath::IsValidNamespacedIdentifier. TfIsValidIdentifier and
82+
TfMakeValidIdentifier should generally not be used to validate and produce prim
83+
or path identifiers.
84+
85+
\section Usd_UTF_8_Operation_Reference Operation Quick Reference
86+
87+
This table lists common string operations and how to reason about them within
88+
USD's UTF-8 support.
89+
90+
Operation | Recommendation
91+
---------------------- | --------------
92+
Equivalence (==) | Strings (and tokens, paths, and assets) are considered equivalent by USD if their byte (and therefore code point) representations are equivalent.
93+
Deterministic ordering | Ordering a valid UTF-8 string by bytes should be equivalent to ordering by code point without decoding (if each byte is interpreted as an unsigned char).
94+
Backwards Compatible Deterministic Ordering | USD has a legacy sorting algorithm (TfDictionaryLessThan) which orders alphanumeric characters case independently. Case independent ordering cannot be trivially extended to the full set of UTF-8 code points, so only non-ASCII code points are ordered by code point value.
95+
Collating | USD does not provide advanced string ordering operations often known as collating.
96+
Casefolding | There is no support for general casefolding of UTF-8 strings. Use TfStringToLowerAscii to fold all ASCII characters in a UTF-8 string. TfStringToLower, TfStringToUpper, and TfStringCapitialize should not be used on UTF-8 strings.
97+
Regular expressions | TfPatternMatcher does not currently offer case insensitive matching of UTF-8 strings.
98+
Tokenizing | Splitting UTF-8 strings around common ASCII symbols like `/` or `.` does not generally require any special consideration. Use TfUtf8CodePointIterator if trying to find and split around a multi-byte code point.
99+
Concatenation | Concatenation of two valid UTF-8 strings is still a valid UTF-8 string, though normalization may not be preserved.
100+
Length | In C++, a string's length is its number of bytes, not the number of code points. The number of code points can be computed by taking the distance between a TfUtf8CodePointView's begin and end. In Python, `len` will count code points.
101+
Path identifier validation | Do not use TfIsValidIdentifier as it will reject UTF-8 characters. Use SdfPath::IsValidIdentifier, SdfPath::IsValidNamespacedIdentifier, SdfSchemaBase::IsValidVariantIdentifier, and SdfSchemaBase::IsValidVariantSelection.
102+
103+
\section Usd_UTF_8_Encoding_Reference Encoding Quick Reference
104+
105+
This table records the encoding representations and rules for USD content.
106+
Strict validators can use the best practices to warn users about
107+
non-conforming content.
108+
109+
| Type or Context | Encoding and Restrictions | Best Practices |
110+
| ------------------------ | ------------------------- | -------------- |
111+
| string (sdf value type) | UTF-8 | |
112+
| token (sdf value type) | UTF-8 | Prefer NFC normalized |
113+
| asset (sdf value type) | UTF-8 (Protocols determine lookup equivalence) | See URI and IRI Specifications |
114+
| prim identifier | UTF-8 (Xid character class + leading `_`) | Prefer NFC normalized |
115+
| property identifier | UTF-8 (Xid character class + leading `_`). Property identifiers may be namespaced with medial `:`. | Prefer NFC normalized |
116+
| variant set identifier | UTF-8 (Xid character class + leading `_`) | Prefer NFC normalized |
117+
| variant selection identifier | UTF-8 (Xid character class, with leading "continue" code points including `_` and digits) | Prefer NFC normalized |
118+
| metadata field identifier | UTF-8 (Xid character class + leading `_`) | Prefer NFC normalized |
119+
| schema type name | ASCII C++ Identifier (alphanumeric + `_` with no leading digits) | |
120+
| schema property name | ASCII C++ Identifier (alphanumeric + `_` with no leading digits) | |
121+
| file format extension (Sdf) | UTF-8. Only ASCII characters are casefolded for equivalence / dispatch. | Prefer casefolded |
122+
| resolver scheme (Ar) | URI specification. Starts with a single ASCII letter, followed by any ASCII alphanumeric, `-`, `+`, and `.`. Casefolded for equivalence and dispatch. | Prefer casefolded |
123+
124+
\section Usd_UTF_8_Additional_Resources Additional Resources
125+
126+
- [Unicode Identifiers in USD proposal](https://github.com/PixarAnimationStudios/OpenUSD-proposals/tree/main/proposals/tf_utf8_identifiers)
127+
- [Unicode Standard v15.0 (PDF)](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf)
128+
- [Unicode Identifiers and Syntax](https://www.unicode.org/reports/tr31/)
129+
130+
*/

0 commit comments

Comments
 (0)