Clarification of `LineColumn::column` encoding #762

saecki · 2026-03-06T14:23:42Z

saecki
Mar 6, 2026

The docs of LineColumn::column say:

The 1-based column number of the character.

This made me assume it was the character-based offset + 1 (essentially as if encoded in UTF-32).
But if a Sourcepos contains UTF-8 characters that are longer than 1 byte such as ö or 介, this doesn't hold true.

So what encoding is LineColumn::column using? If this isn't already documented somewhere, it would be great if the documentation would mention that.

kivikakk · 2026-03-07T00:08:32Z

kivikakk
Mar 7, 2026
Maintainer

Indeed, it's UTF-8, which is a bit of a Yikes™, but compatible with the closest thing we have to an upstream:

$ echo '好' | ./cmark --sourcepos
<p data-sourcepos="1:1-1:3">好</p>

PR to correct the docs happily accepted.

3 replies

saecki Mar 9, 2026
Author

I honestly think using UTF-8 is fine. But the output is a bit of a head-scratcher (atleast to me). I would've expected all column numbers to be offset by 1. Since the length of the 好 character is 3 bytes, I would have expected 1:1-1:4 or 1:1-1:1 if the range is end-inclusive. This just seems like a bug where 1 is naively subtracted from the column of SourcePos::end, assuming it's an ASCII character.

some rust code

fn main() {
    dbg!("好".len());
    dbg!('好'.len_utf8());
}

Output:

[src/main.rs:2:5] "好".len() = 3
[src/main.rs:3:5] '好'.len_utf8() = 3

kivikakk Mar 9, 2026
Maintainer

It's 1:1–1:3 because the range is end-inclusive: the range covers (1-based) bytes 1 through 3. Does that make sense? In the same way just a has sourcepos 1:1-1:1 (bytes 1 through 1, 1-based), 好 has sourcepos 1:1-1:3 (bytes 1 through 3, 1-based).

saecki Mar 9, 2026
Author

Oh, I see. So it is indeed always subtracting 1 (byte) from the end column offset, which essentially makes it a 0-based end-exclusive offset.

Martin005 · 2026-03-09T09:35:06Z

Martin005
Mar 9, 2026

@saecki @kivikakk PR correcting the docs: #764 🙂

1 reply

kivikakk Mar 9, 2026
Maintainer

Thanks so much! :D

Martin005 · 2026-03-24T10:11:55Z

Martin005
Mar 24, 2026

@saecki I created issue #777 to add a new parsing option that will transform the UTF-8-based columns into Unicode character-based columns. And already created an implementation for that – #779. Feel free to take a look 🙂

1 reply

kivikakk Mar 29, 2026
Maintainer

This has now been merged and will be part of the next release :)

Uh oh!

Clarification of LineColumn::column encoding #762

Uh oh!

saecki Mar 6, 2026

Replies: 3 comments · 5 replies

Uh oh!

kivikakk Mar 7, 2026 Maintainer

Uh oh!

Uh oh!

saecki Mar 9, 2026 Author

Uh oh!

kivikakk Mar 9, 2026 Maintainer

Uh oh!

saecki Mar 9, 2026 Author

Uh oh!

Martin005 Mar 9, 2026

Uh oh!

kivikakk Mar 9, 2026 Maintainer

Uh oh!

Uh oh!

Martin005 Mar 24, 2026

Uh oh!

kivikakk Mar 29, 2026 Maintainer

Clarification of `LineColumn::column` encoding #762

saecki
Mar 6, 2026

Replies: 3 comments 5 replies

kivikakk
Mar 7, 2026
Maintainer

saecki Mar 9, 2026
Author

kivikakk Mar 9, 2026
Maintainer

saecki Mar 9, 2026
Author

Martin005
Mar 9, 2026

kivikakk Mar 9, 2026
Maintainer

Martin005
Mar 24, 2026

kivikakk Mar 29, 2026
Maintainer