Skip to content

[Merged by Bors] - feat(Tactic/Linter): lint unwanted unicode#36773

Closed
adomasbaliuka wants to merge 30 commits intoleanprover-community:masterfrom
adomasbaliuka:unicode_linter
Closed

[Merged by Bors] - feat(Tactic/Linter): lint unwanted unicode#36773
adomasbaliuka wants to merge 30 commits intoleanprover-community:masterfrom
adomasbaliuka:unicode_linter

Conversation

@adomasbaliuka
Copy link
Copy Markdown
Collaborator

@adomasbaliuka adomasbaliuka commented Mar 17, 2026

Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters.

Unicode is very versatile and useful for Lean and Mathlib.
However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them).
In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib.

In programming and cybersecurity, there are many known issues and attacks concerning unicode.
Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models.
Some notable ones:

  • homograph attacks: confusion caused by use of distinct characters which look the same (we probably don't want to fully address this and this PR does not attempt to)
  • Trojan Source: abuse of bidirectional characters. Characters used by languages with right-to-left reading direction can cause code to be displayed differently than it is parsed. (This PR tries to address this)
  • Exploits involving Private Use Area characters, e.g. GlassWorm. (This PR tries to address this)

See also unicode code source handling and Programming with Unicode for further details and guidelines.

Co-authored-by: Michael Rothgang @grunweg
Co-authored-by: Jon Eugster @joneugster


Continues work from #16215 (due to PRs now being made from forks).

Discussed at Zulip

Note: the script was added due to reviewer comment in #16215. Perhaps that is overkill here.

Open in Gitpod

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 17, 2026

PR summary 10fc8c0fdd

Import changes for modified files

No significant changes to the import graph

Import changes for all files
Files Import difference

Declarations diff

+ ASCII.allowed
+ ASCII.allowedWhitespace
+ ASCII.printable
+ Char.allowedNonAscii
+ Char.isAscii
+ Char.manuallyExcluded
+ allLinterDefinedCharacterLists
+ isPrivateUseAreaChar
+ main
+ othersInMathlib
+ reprCharsChunked
+ withVSCodeAbbrev

You can run this locally as follows
## summary with just the declaration names:
./scripts/pr_summary/declarations_diff.sh <optional_commit>

## more verbose report:
./scripts/pr_summary/declarations_diff.sh long <optional_commit>

The doc-module for scripts/pr_summary/declarations_diff.sh contains some details about this script.


No changes to technical debt.

You can run this locally as

./scripts/reporting/technical-debt-metrics.sh pr_summary
  • The relative value is the weighted sum of the differences with weight given by the inverse of the current value of the statistic.
  • The absolute value is the relative value divided by the total sum of the inverses of the current values (i.e. the weighted average of the differences).

⚠️ Scripts folder reminder

This PR adds files under scripts/.
Please consider whether each added script belongs in this repository or in leanprover-community/mathlib-ci.

A script belongs in mathlib-ci if it is a CI automation script that interacts with GitHub (e.g. managing labels, posting comments, triggering bots), runs from a trusted external checkout in CI, or requires access to secrets.

A script belongs in this repository (scripts/) if it is a developer or maintainer tool to be run locally, a code maintenance or analysis utility, a style linting tool, or a data file used by the library's own linters.

See the mathlib-ci README for more details.

Added scripts files:

  • scripts/extract-unique-nonascii.lean

@adomasbaliuka
Copy link
Copy Markdown
Collaborator Author

adomasbaliuka commented Mar 17, 2026

The remaining error is

error: Mathlib/RingTheory/Spectrum/Prime/Polynomial.lean:28: This line contains a bad unicode character '̲' (U+0332).

where I'm not sure what to do.
It is used in a comment to underline a variable name, indicating that it refers to multiple variables of a polynomial.
Either we allow that character, or find another notation for that place which also makes sense.

Another question @joneugster is whether we want to enable text-variant-selectors for some characters, such as the errors. I put a TODO in the code at that point.

@adomasbaliuka adomasbaliuka added the t-linter Linter label Mar 17, 2026
puts functions unused in the file itself (only imported from here) at the bottom
Co-authored-by: Violeta Hernández Palacios <vi.hdz.p@gmail.com>
@adomasbaliuka adomasbaliuka added blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) and removed blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) labels Mar 18, 2026
@mathlib-dependent-issues mathlib-dependent-issues bot added blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) and removed blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) labels Mar 18, 2026
@mathlib-dependent-issues
Copy link
Copy Markdown

This PR/issue depends on:

@mathlib-merge-conflicts
Copy link
Copy Markdown

This pull request has conflicts, please merge master and resolve them.

@mathlib-merge-conflicts mathlib-merge-conflicts bot added the merge-conflict The PR has a merge conflict with master, and needs manual merging. (this label is managed by a bot) label Mar 21, 2026
I propose to disallow this character. I rewrote the module docstring while presenting the relevant type more explicitly than the underline which was used to signify a multivariate polynomial. This notation does not seem to have been standard in Mathlib, given that this was the only occurrance of the character.
@github-actions github-actions bot removed the merge-conflict The PR has a merge conflict with master, and needs manual merging. (this label is managed by a bot) label Mar 22, 2026
@adomasbaliuka
Copy link
Copy Markdown
Collaborator Author

adomasbaliuka commented Mar 22, 2026

Just as an example of what this PR is supposed to help avoid:

While updating the proof for disallowed_of_replaceable, I came across a misprinting of a goal involving U+202B:

tmp

Copy link
Copy Markdown
Contributor

@joneugster joneugster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Overall, with the script to autogenerate the list, a check to remove redundant symbols from the hand-curated list (see comment), and some information to the user how to add new symbols (see comment) I think this is a very reasonable addition!

I've validated that the script works and outputs the list included in this PR!

@joneugster joneugster added the awaiting-author A reviewer has asked the author a question or requested changes. label Mar 27, 2026
The character was used in a comment, seemingly by accident. Also removes it from the allow-list for the linter.
@adomasbaliuka
Copy link
Copy Markdown
Collaborator Author

I don't understand the CI failure...

@bryangingechen
Copy link
Copy Markdown
Contributor

I think you just have to merge master; we just merged a PR adding a new CI job and a fix to a broken check in previous master.

@grunweg
Copy link
Copy Markdown
Contributor

grunweg commented Apr 11, 2026

The webpage PR looks good to me, as does the proposed sequence of events.

@adomasbaliuka
Copy link
Copy Markdown
Collaborator Author

adomasbaliuka commented Apr 12, 2026

This is on the "maintainer merge" queue with some more changes having happened after it was put on the queue.

In my view, these additional changes are done and this PR can now be merged.

Copy link
Copy Markdown
Contributor

@Vierkantor Vierkantor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors d=@grunweg

@mathlib-bors
Copy link
Copy Markdown
Contributor

mathlib-bors bot commented Apr 13, 2026

✌️ grunweg can now approve this pull request. To approve and merge a pull request, simply reply with bors r+. More detailed instructions are available here.

@mathlib-triage mathlib-triage bot added delegated This pull request has been delegated to the PR author (or occasionally another non-maintainer). and removed maintainer-merge A reviewer has approved the changed; awaiting maintainer approval. labels Apr 13, 2026
adomasbaliuka and others added 4 commits April 13, 2026 14:04
Co-authored-by: Anne Baanen <Vierkantor@users.noreply.github.com>
Earlier commit changed the error message for the unicode linter. The offending character is extracted by a hard-coded word position index (yes this is not elegant but there is no obvious alternative). This index had not been changed when changing the error message which caused tests to fail. That change is done in this commit which should make the tests pass. Also, some comments and redundant options are cleaned up in the tests file.
Copy link
Copy Markdown
Contributor

@grunweg grunweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your perseverance - let's get this merged!
bors merge

@mathlib-triage mathlib-triage bot added the ready-to-merge This PR has been sent to bors. label Apr 13, 2026
mathlib-bors bot pushed a commit that referenced this pull request Apr 13, 2026
Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters.

Unicode is very versatile and useful for Lean and Mathlib.
However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them).
In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib.

In programming and cybersecurity, there are many known [issues and attacks concerning unicode](https://en.wikipedia.org/wiki/Unicode#Security_issues).
Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models.
Some notable ones:

- [homograph attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack): confusion caused by use of distinct characters which look the same (we probably don't want to fully address this and this PR does not attempt to)
- [Trojan Source](https://en.wikipedia.org/wiki/Trojan_Source):  abuse of bidirectional characters. Characters used by languages with right-to-left reading direction can cause code to be displayed differently than it is parsed. (This PR tries to address this)
-  Exploits involving Private Use Area characters, e.g. [GlassWorm](https://abit.ee/en/cybersecurity/viruses-trojans-and-other-malware/glassworm-github-supply-chain-attack-unicode-solana-malware-npm-vs-code-2026-en). (This PR tries to address this)

See also [unicode code source handling](https://www.unicode.org/reports/tr55/) and [Programming with Unicode](https://unicodebook.readthedocs.io/nightmare.html) for further details and guidelines.

Co-authored-by: Michael Rothgang @grunweg 
Co-authored-by: Jon Eugster @joneugster
@mathlib-bors
Copy link
Copy Markdown
Contributor

mathlib-bors bot commented Apr 13, 2026

Build failed (retrying...):

mathlib-bors bot pushed a commit that referenced this pull request Apr 13, 2026
Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters.

Unicode is very versatile and useful for Lean and Mathlib.
However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them).
In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib.

In programming and cybersecurity, there are many known [issues and attacks concerning unicode](https://en.wikipedia.org/wiki/Unicode#Security_issues).
Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models.
Some notable ones:

- [homograph attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack): confusion caused by use of distinct characters which look the same (we probably don't want to fully address this and this PR does not attempt to)
- [Trojan Source](https://en.wikipedia.org/wiki/Trojan_Source):  abuse of bidirectional characters. Characters used by languages with right-to-left reading direction can cause code to be displayed differently than it is parsed. (This PR tries to address this)
-  Exploits involving Private Use Area characters, e.g. [GlassWorm](https://abit.ee/en/cybersecurity/viruses-trojans-and-other-malware/glassworm-github-supply-chain-attack-unicode-solana-malware-npm-vs-code-2026-en). (This PR tries to address this)

See also [unicode code source handling](https://www.unicode.org/reports/tr55/) and [Programming with Unicode](https://unicodebook.readthedocs.io/nightmare.html) for further details and guidelines.

Co-authored-by: Michael Rothgang @grunweg 
Co-authored-by: Jon Eugster @joneugster
@mathlib-bors
Copy link
Copy Markdown
Contributor

mathlib-bors bot commented Apr 13, 2026

Build failed:

Fix if necessary, and then someone with permission can run bors r+ or bors retry.

@grunweg
Copy link
Copy Markdown
Contributor

grunweg commented Apr 13, 2026

Once more with feeling:
bors merge

mathlib-bors bot pushed a commit that referenced this pull request Apr 13, 2026
Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters.

Unicode is very versatile and useful for Lean and Mathlib.
However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them).
In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib.

In programming and cybersecurity, there are many known [issues and attacks concerning unicode](https://en.wikipedia.org/wiki/Unicode#Security_issues).
Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models.
Some notable ones:

- [homograph attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack): confusion caused by use of distinct characters which look the same (we probably don't want to fully address this and this PR does not attempt to)
- [Trojan Source](https://en.wikipedia.org/wiki/Trojan_Source):  abuse of bidirectional characters. Characters used by languages with right-to-left reading direction can cause code to be displayed differently than it is parsed. (This PR tries to address this)
-  Exploits involving Private Use Area characters, e.g. [GlassWorm](https://abit.ee/en/cybersecurity/viruses-trojans-and-other-malware/glassworm-github-supply-chain-attack-unicode-solana-malware-npm-vs-code-2026-en). (This PR tries to address this)

See also [unicode code source handling](https://www.unicode.org/reports/tr55/) and [Programming with Unicode](https://unicodebook.readthedocs.io/nightmare.html) for further details and guidelines.

Co-authored-by: Michael Rothgang @grunweg 
Co-authored-by: Jon Eugster @joneugster
@mathlib-bors
Copy link
Copy Markdown
Contributor

mathlib-bors bot commented Apr 13, 2026

Pull request successfully merged into master.

Build succeeded:

@mathlib-bors mathlib-bors bot changed the title feat(Tactic/Linter): lint unwanted unicode [Merged by Bors] - feat(Tactic/Linter): lint unwanted unicode Apr 13, 2026
@mathlib-bors mathlib-bors bot closed this Apr 13, 2026
@adomasbaliuka adomasbaliuka deleted the unicode_linter branch April 13, 2026 18:10
@grunweg
Copy link
Copy Markdown
Contributor

grunweg commented Apr 13, 2026

I just merged the webpage PR: I'll happily merge a follow-up PR tomorrow linking to the style guide.

mathlib-bors bot pushed a commit that referenced this pull request Apr 14, 2026
As planned in #36773, we add a link to the style guide, which was updated at leanprover-community/leanprover-community.github.io#820.

Further changes:
- some more documentation updates (some obsolete comments were talking about a "blocklist" which no longer exists)
-  change order of checks in `isAllowedCharacter`. The new order is more logical and may slightly improve performance (which probably doesn't matter) since Mathlib has more characters in `otherInMathlib` than `emojis`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

delegated This pull request has been delegated to the PR author (or occasionally another non-maintainer). ready-to-merge This PR has been sent to bors. t-linter Linter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants