Skip to content

Latest commit

 

History

History
79 lines (48 loc) · 10.6 KB

File metadata and controls

79 lines (48 loc) · 10.6 KB

CommonMark CJK-friendly Amendments Specification

CommonMark issue: commonmark/commonmark-spec#650

The following chapters are written as an amendment to the original CommonMark specification. Missing chapters, sections, and definitions are the same as in the original specification.

2. Preliminaries

2.1 Characters and lines

A CJK character is a character (Unicode code point) that meets at least one of the following criteria:

An Ideographic Variation Selector is a character in the Variation Selectors Supplement Block (U+E0100–U+E01EF).

A Non-emoji General-use Variation Selector is a character in the Variation Selectors Block (U+FE00–U+FE0F) other than Emoji Presentation Selector U+FE0F.

A CJK sequence is a CJK character or a sequence of 2 characters where the first one is CJK character and the second one is Non-emoji General-use Variation Selector.

A CJK punctuation character is a Unicode punctuation character that is also a CJK character.

A non-CJK punctuation character is a Unicode punctuation character other than CJK punctuation character.

A Unicode punctuation sequence is a Unicode punctuation character or a sequence of 2 characters where the first one is Unicode punctuation character and the second one is Non-emoji General-use Variation Selector.

A CJK ambiguous punctuation sequence is a Standardized Variation Sequence whose description in StandardizedVariants.txt (the latest version) contains a word "fullwidth form", whose first character is a Unicode punctuation character, and the UAX #11 East Asian Width category of whose first character is A.

Note

A CJK punctuation sequence is a CJK punctuation character, a CJK ambiguous punctuation sequence, or a sequence of 2 characters where the first one is CJK punctuation character and the second one is Non-emoji General-use Variation Selector.

A Non-CJK punctuation sequence is a Non-CJK punctuation character or a sequence of 2 characters where the first one is Non-CJK punctuation character and the second one is Non-emoji General-use Variation Selector.

Note

To see the concrete ranges of each definition, see ranges.md.

6. Inlines

6.2 Emphasis and strong emphasis

Note

The bold italic means the modified part.

A left-flanking delimiter run is a delimiter run that is (1) not followed by Unicode whitespace, and either (2a) not followed by a non-CJK punctuation character or (2b) followed by a non-CJK punctuation character and preceded by (2bα) Unicode whitespace, (2bβ) a non-CJK punctuation sequence, (2bγ) a CJK sequence, or (2bδ) an Ideographic Variation Selector. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) not preceded by a non-CJK punctuation sequence, or (2b) preceded by a non-CJK punctuation sequence and followed by (2bα) Unicode whitespace, (2bβ) a non-CJK punctuation character, or (2bγ) a CJK character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

Note

If the delimiter run (1) adjoins a Code Unit that is not a part of an Encoded Character/Assigned Character (including Ill-Formed Code Unit Subsequences, e.g. isolated Surrogate Code Points/Units) or (2) is preceded by a Standard Variation Selector that is preceded by (2a) a Unicode whitespace or (2b) an Ideographic Variation Selector, both of whether the delimiter run is left-flanking and whether it is right-flanking are Unspecified.

2. A single _ character can open emphasis iff it is part of a left-flanking delimiter run and either (a) not part of a right-flanking delimiter run or (b) part of a right-flanking delimiter run preceded by a Unicode punctuation sequence.

6. A double __ can open strong emphasis iff it is part of a left-flanking delimiter run and either (a) not part of a right-flanking delimiter run or (b) part of a right-flanking delimiter run preceded by a Unicode punctuation sequence.

Tips for Implementers

See implementers-tips.md.

Unicode data list

Data name Latest Unicode 17
East Asian Width https://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt https://www.unicode.org/Public/17.0.0/ucd/EastAsianWidth.txt
Script https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt https://www.unicode.org/Public/17.0.0/ucd/Scripts.txt
Block https://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt https://www.unicode.org/Public/17.0.0/ucd/Blocks.txt
Characters followed by Non-emoji General-use Variation Selector Variation Selector https://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt https://www.unicode.org/Public/17.0.0/ucd/StandardizedVariants.txt
Default emoji presentation characters https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt https://www.unicode.org/Public/17.0.0/ucd/emoji/emoji-data.txt
Characters followed by U+FE0E/U+FE0F https://unicode.org/Public/UCD/latest/ucd/emoji/emoji-variation-sequences.txt https://unicode.org/Public/17.0.0/ucd/emoji/emoji-variation-sequences.txt
Fully-qualified Emojis (without ZWJ) https://unicode.org/Public/emoji/latest/emoji-sequences.txt https://unicode.org/Public/17.0.0/emoji/emoji-sequences.txt
Emoji qualification test https://unicode.org/Public/emoji/latest/emoji-test.txt https://unicode.org/Public/17.0.0/emoji/emoji-test.txt