Skip to content

Bug report: Unescape Unicode Characters only accepts exactly 4 hex digits for U+ #2242

@williballenthin

Description

@williballenthin

Describe the bug
The Unescape Unicode Characters operation fails to decode valid code points with more than 4 hex digits when using the U+ prefix, breaking support for astral plane characters like emoji.

src/core/operations/UnescapeUnicodeCharacters.mjs, run() method, line 55

run(input, args) {
    const prefix = prefixToRegex[args[0]],
        regex = new RegExp(prefix+"([a-f\\d]{4})", "ig");
    // ...
}

The regex is hardcoded to exactly 4 hex digits for all prefixes. This rejects notation like U+1F600 (😀) and U+000041 (zero-padded A). It also breaks round-trips. Escape Unicode Characters can emit 6-digit output like U+000041 when configured with Padding: 6, but Unescape cannot decode it.

To Reproduce
add Unescape Unicode Characters with prefix U+, input U+1F600. Expected: 😀. Actual: no match.

Screenshots

Image Image

Additional context
Proposed fix widens the quantifier for U+ only:

run(input, args) {
    const prefix = prefixToRegex[args[0]],
        regex = args[0] === "U+"
            ? new RegExp(prefix+"([a-f\\d]{4,6})", "ig")
            : new RegExp(prefix+"([a-f\\d]{4})", "ig");
    // ...
}

Standard U+ notation allows variable-length hex sequences from 4 to 6 digits. The \u and %u forms are legacy and expect exactly 4 digits (or surrogate pairs), so they retain the fixed-length requirement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions