Numeric escape sequence with surrogate pairs (Turtle)#323
Conversation
|
My initial reaction is that adding support for surrogates is bad. I'm not clear on what use cases it serves, but it adds complexity. I do think adding the negative tests (and more test coverage in this area, depending on the decisions in w3c/rdf-turtle#131) is a good idea, though. |
The i18n request is at: Feel free to ask for more background. |
|
Yeah, I've been following the discussion in that issue. I just don't see any convincing use-cases. I'm not convinced by "some programming languages do this" because those seem like bad upstream choices (possibly influenced by implementation details like the use of UTF-16). |
|
If the WG wants to define the correct (and only) outcome, some systems have to change. Putting in "don't output surrogates" would be a start. |
This PR is part of the discussion w3c/rdf-turtle#131.
The tests are for allowing a pair of numeric escapes
\uHHHHto be a well-formed surrogate pair that is interpreted as the supplemental character codepoint represented by that surrogate pair.The surrogates are not part of the lexical form in the RDF data model, the supplemental character represented by that surrogate pair is and the RDFgraph is the same as if written using
\U.There are positive syntax tests for valid surrogate pairs written with
\uHHHH\uHHHH(high-low surrogate) and negative tests for a malformed surrogate pairs (low-high, low-low, high-high) and for lone surrogates; the latter is also in the RDF 1.1 Turtle test suite but the same coverage is repeated for completeness.There are evaluation tests for a valid surrogate pair with the same graph output in two forms, a supplemental character as UTF-8 and also written using
\U(they parse to the same graph).This PR is marked draft because the WG has not yet agreed a resolution of the i18n issue.