Skip to content

Numeric escape sequence with surrogate pairs (Turtle)#323

Draft
afs wants to merge 1 commit intomainfrom
surrogates
Draft

Numeric escape sequence with surrogate pairs (Turtle)#323
afs wants to merge 1 commit intomainfrom
surrogates

Conversation

@afs
Copy link
Copy Markdown
Contributor

@afs afs commented Apr 16, 2026

This PR is part of the discussion w3c/rdf-turtle#131.

The tests are for allowing a pair of numeric escapes \uHHHH to be a well-formed surrogate pair that is interpreted as the supplemental character codepoint represented by that surrogate pair.

The surrogates are not part of the lexical form in the RDF data model, the supplemental character represented by that surrogate pair is and the RDFgraph is the same as if written using \U.

There are positive syntax tests for valid surrogate pairs written with \uHHHH\uHHHH (high-low surrogate) and negative tests for a malformed surrogate pairs (low-high, low-low, high-high) and for lone surrogates; the latter is also in the RDF 1.1 Turtle test suite but the same coverage is repeated for completeness.

There are evaluation tests for a valid surrogate pair with the same graph output in two forms, a supplemental character as UTF-8 and also written using \U (they parse to the same graph).

This PR is marked draft because the WG has not yet agreed a resolution of the i18n issue.

@afs afs marked this pull request as draft April 16, 2026 16:49
@kasei
Copy link
Copy Markdown
Contributor

kasei commented Apr 16, 2026

My initial reaction is that adding support for surrogates is bad. I'm not clear on what use cases it serves, but it adds complexity.

I do think adding the negative tests (and more test coverage in this area, depending on the decisions in w3c/rdf-turtle#131) is a good idea, though.

@afs
Copy link
Copy Markdown
Contributor Author

afs commented Apr 16, 2026

I'm not clear on what use cases it serves, but it adds complexity.

The i18n request is at:

w3c/rdf-turtle#131 (comment)

Feel free to ask for more background.

@kasei
Copy link
Copy Markdown
Contributor

kasei commented Apr 16, 2026

Yeah, I've been following the discussion in that issue. I just don't see any convincing use-cases. I'm not convinced by "some programming languages do this" because those seem like bad upstream choices (possibly influenced by implementation details like the use of UTF-16).

@afs
Copy link
Copy Markdown
Contributor Author

afs commented Apr 16, 2026

If the WG wants to define the correct (and only) outcome, some systems have to change.

Putting in "don't output surrogates" would be a start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants