Summary
Arturo using this fork of linenoise in a REPL environment and ran into a serious issue with UTF-8 input.
Current Behavior
When entering multi-byte characters (e.g. Chinese or Korean), the input gets corrupted during editing.
Example:
Input:
Output:
The middle character (你) is not just displayed incorrectly — it is actually lost and replaced, which suggests the UTF-8 sequence is being broken.
Analysis
This appears to happen because input is processed byte-by-byte instead of as UTF-8 codepoints.
For example, the character 你 is encoded as:
But the implementation seems to treat each byte as an individual character, causing:
- partial reads of multi-byte sequences
- invalid character reconstruction
- fallback to incorrect ASCII characters (like '`')
So this is not just a rendering issue — it is data corruption during input handling.
Expected Behavior
Proper UTF-8 handling should:
- read full codepoints (not individual bytes)
- handle cursor movement based on characters, not bytes
- avoid breaking multi-byte sequences
Steps To Reproduce
In repl mode
type ”a你b“
you will get "a`b"
OS
all
Version
all
Anything else?
At the moment, this makes the library unusable in UTF-8 environments (which are standard today).
It might be worth:
- clearly documenting that UTF-8 input is not fully supported
- considering switching to / referencing a more complete implementation
I discussed with AI, and it seems to have determined that the issue is likely related to the "linenoise" library. The AI suggests that it's cutting characters into bytes, so it recommends using alternative C libraries and projects as replacements.
Thanks!
Is there an existing issue for this?
Summary
Arturo using this fork of linenoise in a REPL environment and ran into a serious issue with UTF-8 input.
Current Behavior
When entering multi-byte characters (e.g. Chinese or Korean), the input gets corrupted during editing.
Example:
Input:
Output:
The middle character (
你) is not just displayed incorrectly — it is actually lost and replaced, which suggests the UTF-8 sequence is being broken.Analysis
This appears to happen because input is processed byte-by-byte instead of as UTF-8 codepoints.
For example, the character
你is encoded as:But the implementation seems to treat each byte as an individual character, causing:
So this is not just a rendering issue — it is data corruption during input handling.
Expected Behavior
Proper UTF-8 handling should:
Steps To Reproduce
In repl mode
type ”a你b“
you will get "a`b"
OS
all
Version
all
Anything else?
At the moment, this makes the library unusable in UTF-8 environments (which are standard today).
It might be worth:
I discussed with AI, and it seems to have determined that the issue is likely related to the "linenoise" library. The AI suggests that it's cutting characters into bytes, so it recommends using alternative C libraries and projects as replacements.
Thanks!
Is there an existing issue for this?