You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This table lists common string operations and how to reason about them within
88
+
USD's UTF-8 support.
89
+
90
+
Operation | Recommendation
91
+
---------------------- | --------------
92
+
Equivalence (==) | Strings (and tokens, paths, and assets) are considered equivalent by USD if their byte (and therefore code point) representations are equivalent.
93
+
Deterministic ordering | Ordering a valid UTF-8 string by bytes should be equivalent to ordering by code point without decoding (if each byte is interpreted as an unsigned char).
94
+
Backwards Compatible Deterministic Ordering | USD has a legacy sorting algorithm (TfDictionaryLessThan) which orders alphanumeric characters case independently. Case independent ordering cannot be trivially extended to the full set of UTF-8 code points, so only non-ASCII code points are ordered by code point value.
95
+
Collating | USD does not provide advanced string ordering operations often known as collating.
96
+
Casefolding | There is no support for general casefolding of UTF-8 strings. Use TfStringToLowerAscii to fold all ASCII characters in a UTF-8 string. TfStringToLower, TfStringToUpper, and TfStringCapitialize should not be used on UTF-8 strings.
97
+
Regular expressions | TfPatternMatcher does not currently offer case insensitive matching of UTF-8 strings.
98
+
Tokenizing | Splitting UTF-8 strings around common ASCII symbols like `/` or `.` does not generally require any special consideration. Use TfUtf8CodePointIterator if trying to find and split around a multi-byte code point.
99
+
Concatenation | Concatenation of two valid UTF-8 strings is still a valid UTF-8 string, though normalization may not be preserved.
100
+
Length | In C++, a string's length is its number of bytes, not the number of code points. The number of code points can be computed by taking the distance between a TfUtf8CodePointView's begin and end. In Python, `len` will count code points.
101
+
Path identifier validation | Do not use TfIsValidIdentifier as it will reject UTF-8 characters. Use SdfPath::IsValidIdentifier, SdfPath::IsValidNamespacedIdentifier, SdfSchemaBase::IsValidVariantIdentifier, and SdfSchemaBase::IsValidVariantSelection.
| asset (sdf value type) | UTF-8 (Protocols determine lookup equivalence) | See URI and IRI Specifications |
114
+
| prim identifier | UTF-8 (Xid character class + leading `_`) | Prefer NFC normalized |
115
+
| property identifier | UTF-8 (Xid character class + leading `_`). Property identifiers may be namespaced with medial `:`. | Prefer NFC normalized |
116
+
| variant set identifier | UTF-8 (Xid character class + leading `_`) | Prefer NFC normalized |
117
+
| variant selection identifier | UTF-8 (Xid character class, with leading "continue" code points including `_` and digits) | Prefer NFC normalized |
118
+
| metadata field identifier | UTF-8 (Xid character class + leading `_`) | Prefer NFC normalized |
119
+
| schema type name | ASCII C++ Identifier (alphanumeric + `_` with no leading digits) | |
120
+
| schema property name | ASCII C++ Identifier (alphanumeric + `_` with no leading digits) | |
121
+
| file format extension (Sdf) | UTF-8. Only ASCII characters are casefolded for equivalence / dispatch. | Prefer casefolded |
122
+
| resolver scheme (Ar) | URI specification. Starts with a single ASCII letter, followed by any ASCII alphanumeric, `-`, `+`, and `.`. Casefolded for equivalence and dispatch. | Prefer casefolded |
0 commit comments