[Merged by Bors] - feat(Tactic/Linter): lint unwanted unicode by adomasbaliuka · Pull Request #36773 · leanprover-community/mathlib4

adomasbaliuka · 2026-03-17T15:57:31Z

Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters.

Unicode is very versatile and useful for Lean and Mathlib.
However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them).
In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib.

In programming and cybersecurity, there are many known issues and attacks concerning unicode.
Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models.
Some notable ones:

homograph attacks: confusion caused by use of distinct characters which look the same (we probably don't want to fully address this and this PR does not attempt to)
Trojan Source: abuse of bidirectional characters. Characters used by languages with right-to-left reading direction can cause code to be displayed differently than it is parsed. (This PR tries to address this)
Exploits involving Private Use Area characters, e.g. GlassWorm. (This PR tries to address this)

See also unicode code source handling and Programming with Unicode for further details and guidelines.

Co-authored-by: Michael Rothgang @grunweg
Co-authored-by: Jon Eugster @joneugster

Continues work from #16215 (due to PRs now being made from forks).

Discussed at Zulip

Note: the script was added due to reviewer comment in #16215. Perhaps that is overkill here.

depends on: [Merged by Bors] - feat(Tactic/Linter): Add more unicode replacements #36811

github-actions · 2026-03-17T16:00:27Z

PR summary 10fc8c0fdd

Import changes for modified files

No significant changes to the import graph

Import changes for all files

Files	Import difference

Declarations diff

+ ASCII.allowed
+ ASCII.allowedWhitespace
+ ASCII.printable
+ Char.allowedNonAscii
+ Char.isAscii
+ Char.manuallyExcluded
+ allLinterDefinedCharacterLists
+ isPrivateUseAreaChar
+ main
+ othersInMathlib
+ reprCharsChunked
+ withVSCodeAbbrev

You can run this locally as follows

## summary with just the declaration names:
./scripts/pr_summary/declarations_diff.sh <optional_commit>

## more verbose report:
./scripts/pr_summary/declarations_diff.sh long <optional_commit>

The doc-module for scripts/pr_summary/declarations_diff.sh contains some details about this script.

No changes to technical debt.

You can run this locally as

./scripts/reporting/technical-debt-metrics.sh pr_summary

The relative value is the weighted sum of the differences with weight given by the inverse of the current value of the statistic.
The absolute value is the relative value divided by the total sum of the inverses of the current values (i.e. the weighted average of the differences).

⚠️ Scripts folder reminder

This PR adds files under scripts/.
Please consider whether each added script belongs in this repository or in leanprover-community/mathlib-ci.

A script belongs in mathlib-ci if it is a CI automation script that interacts with GitHub (e.g. managing labels, posting comments, triggering bots), runs from a trusted external checkout in CI, or requires access to secrets.

A script belongs in this repository (scripts/) if it is a developer or maintainer tool to be run locally, a code maintenance or analysis utility, a style linting tool, or a data file used by the library's own linters.

See the mathlib-ci README for more details.

Added scripts files:

scripts/extract-unique-nonascii.lean

adomasbaliuka · 2026-03-17T16:10:27Z

The remaining error is

error: Mathlib/RingTheory/Spectrum/Prime/Polynomial.lean:28: This line contains a bad unicode character '̲' (U+0332).

where I'm not sure what to do.
It is used in a comment to underline a variable name, indicating that it refers to multiple variables of a polynomial.
Either we allow that character, or find another notation for that place which also makes sense.

Another question @joneugster is whether we want to enable text-variant-selectors for some characters, such as the errors. I put a TODO in the code at that point.

puts functions unused in the file itself (only imported from here) at the bottom

scripts/extract-unique-nonascii.lean

Mathlib/Tactic/Linter/TextBased/UnicodeLinter.lean

Co-authored-by: Violeta Hernández Palacios <vi.hdz.p@gmail.com>

mathlib-dependent-issues · 2026-03-21T07:53:31Z

This PR/issue depends on:

~~[Merged by Bors] - feat(Tactic/Linter): Add more unicode replacements #36811~~
By Dependent Issues (🤖). Happy coding!

mathlib-merge-conflicts · 2026-03-21T08:38:08Z

This pull request has conflicts, please merge master and resolve them.

I propose to disallow this character. I rewrote the module docstring while presenting the relevant type more explicitly than the underline which was used to signify a multivariate polynomial. This notation does not seem to have been standard in Mathlib, given that this was the only occurrance of the character.

adomasbaliuka · 2026-03-22T22:00:26Z

Just as an example of what this PR is supposed to help avoid:

While updating the proof for disallowed_of_replaceable, I came across a misprinting of a goal involving U+202B:

joneugster

Thank you!

Overall, with the script to autogenerate the list, a check to remove redundant symbols from the hand-curated list (see comment), and some information to the user how to add new symbols (see comment) I think this is a very reasonable addition!

I've validated that the script works and outputs the list included in this PR!

Mathlib/Tactic/Linter/TextBased/UnicodeLinter.lean

scripts/extract-unique-nonascii.lean

MathlibTest/LintStyle.lean

Mathlib/Tactic/Linter/TextBased/UnicodeLinter.lean

The character was used in a comment, seemingly by accident. Also removes it from the allow-list for the linter.

adomasbaliuka · 2026-04-10T16:07:04Z

I don't understand the CI failure...

bryangingechen · 2026-04-10T16:13:44Z

I think you just have to merge master; we just merged a PR adding a new CI job and a fix to a broken check in previous master.

grunweg · 2026-04-11T16:07:04Z

The webpage PR looks good to me, as does the proposed sequence of events.

adomasbaliuka · 2026-04-12T19:40:08Z

This is on the "maintainer merge" queue with some more changes having happened after it was put on the queue.

In my view, these additional changes are done and this PR can now be merged.

Vierkantor

bors d=@grunweg

Mathlib/Tactic/Linter/TextBased.lean

scripts/extract-unique-nonascii.lean

mathlib-bors · 2026-04-13T11:32:36Z

✌️ grunweg can now approve this pull request. To approve and merge a pull request, simply reply with bors r+. More detailed instructions are available here.

Co-authored-by: Anne Baanen <Vierkantor@users.noreply.github.com>

Earlier commit changed the error message for the unicode linter. The offending character is extracted by a hard-coded word position index (yes this is not elegant but there is no obvious alternative). This index had not been changed when changing the error message which caused tests to fail. That change is done in this commit which should make the tests pass. Also, some comments and redundant options are cleaned up in the tests file.

grunweg

Thanks for your perseverance - let's get this merged!
bors merge

@grunweg

Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters. Unicode is very versatile and useful for Lean and Mathlib. However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them). In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib. In programming and cybersecurity, there are many known [issues and attacks concerning unicode](https://en.wikipedia.org/wiki/Unicode#Security_issues). Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models. Some notable ones: - [homograph attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack): confusion caused by use of distinct characters which look the same (we probably don't want to fully address this and this PR does not attempt to) - [Trojan Source](https://en.wikipedia.org/wiki/Trojan_Source): abuse of bidirectional characters. Characters used by languages with right-to-left reading direction can cause code to be displayed differently than it is parsed. (This PR tries to address this) - Exploits involving Private Use Area characters, e.g. [GlassWorm](https://abit.ee/en/cybersecurity/viruses-trojans-and-other-malware/glassworm-github-supply-chain-attack-unicode-solana-malware-npm-vs-code-2026-en). (This PR tries to address this) See also [unicode code source handling](https://www.unicode.org/reports/tr55/) and [Programming with Unicode](https://unicodebook.readthedocs.io/nightmare.html) for further details and guidelines. Co-authored-by: Michael Rothgang @grunweg Co-authored-by: Jon Eugster @joneugster

mathlib-bors · 2026-04-13T16:04:10Z

Build failed (retrying...):

ci (staging) / Lint style

@grunweg

Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters. Unicode is very versatile and useful for Lean and Mathlib. However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them). In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib. In programming and cybersecurity, there are many known [issues and attacks concerning unicode](https://en.wikipedia.org/wiki/Unicode#Security_issues). Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models. Some notable ones: - [homograph attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack): confusion caused by use of distinct characters which look the same (we probably don't want to fully address this and this PR does not attempt to) - [Trojan Source](https://en.wikipedia.org/wiki/Trojan_Source): abuse of bidirectional characters. Characters used by languages with right-to-left reading direction can cause code to be displayed differently than it is parsed. (This PR tries to address this) - Exploits involving Private Use Area characters, e.g. [GlassWorm](https://abit.ee/en/cybersecurity/viruses-trojans-and-other-malware/glassworm-github-supply-chain-attack-unicode-solana-malware-npm-vs-code-2026-en). (This PR tries to address this) See also [unicode code source handling](https://www.unicode.org/reports/tr55/) and [Programming with Unicode](https://unicodebook.readthedocs.io/nightmare.html) for further details and guidelines. Co-authored-by: Michael Rothgang @grunweg Co-authored-by: Jon Eugster @joneugster

mathlib-bors · 2026-04-13T16:05:36Z

Build failed:

ci (staging) / Post-Build Step

Fix if necessary, and then someone with permission can run bors r+ or bors retry.

grunweg · 2026-04-13T16:52:50Z

Once more with feeling:
bors merge

@grunweg

Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters. Unicode is very versatile and useful for Lean and Mathlib. However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them). In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib. In programming and cybersecurity, there are many known [issues and attacks concerning unicode](https://en.wikipedia.org/wiki/Unicode#Security_issues). Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models. Some notable ones: - [homograph attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack): confusion caused by use of distinct characters which look the same (we probably don't want to fully address this and this PR does not attempt to) - [Trojan Source](https://en.wikipedia.org/wiki/Trojan_Source): abuse of bidirectional characters. Characters used by languages with right-to-left reading direction can cause code to be displayed differently than it is parsed. (This PR tries to address this) - Exploits involving Private Use Area characters, e.g. [GlassWorm](https://abit.ee/en/cybersecurity/viruses-trojans-and-other-malware/glassworm-github-supply-chain-attack-unicode-solana-malware-npm-vs-code-2026-en). (This PR tries to address this) See also [unicode code source handling](https://www.unicode.org/reports/tr55/) and [Programming with Unicode](https://unicodebook.readthedocs.io/nightmare.html) for further details and guidelines. Co-authored-by: Michael Rothgang @grunweg Co-authored-by: Jon Eugster @joneugster

mathlib-bors · 2026-04-13T17:57:43Z

Pull request successfully merged into master.

Build succeeded:

grunweg · 2026-04-13T21:51:50Z

I just merged the webpage PR: I'll happily merge a follow-up PR tomorrow linking to the style guide.

As planned in #36773, we add a link to the style guide, which was updated at leanprover-community/leanprover-community.github.io#820. Further changes: - some more documentation updates (some obsolete comments were talking about a "blocklist" which no longer exists) - change order of checks in `isAllowedCharacter`. The new order is more logical and may slightly improve performance (which probably doesn't matter) since Mathlib has more characters in `otherInMathlib` than `emojis`.

adomasbaliuka added 3 commits March 17, 2026 16:32

Implements full allowlist for unicode characters

03ae62b

some auto-fixes in whitespace

2d91f82

Updates Authors list

9b54fab

adomasbaliuka mentioned this pull request Mar 17, 2026

feat(Tactic/Linter): lint unwanted unicode #16215

Closed

4 tasks

adomasbaliuka added the t-linter Linter label Mar 17, 2026

reorder declarations

a7e31f7

puts functions unused in the file itself (only imported from here) at the bottom

vihdzp reviewed Mar 18, 2026

View reviewed changes

scripts/extract-unique-nonascii.lean Outdated Show resolved Hide resolved

joneugster reviewed Mar 18, 2026

View reviewed changes

Mathlib/Tactic/Linter/TextBased/UnicodeLinter.lean Show resolved Hide resolved

fix typo in comment

d508c77

Co-authored-by: Violeta Hernández Palacios <vi.hdz.p@gmail.com>

adomasbaliuka mentioned this pull request Mar 18, 2026

[Merged by Bors] - feat(Tactic/Linter): Add more unicode replacements #36811

Closed

adomasbaliuka added blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) and removed blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) labels Mar 18, 2026

mathlib-dependent-issues bot added blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) and removed blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) labels Mar 18, 2026

mathlib-merge-conflicts bot added the merge-conflict The PR has a merge conflict with master, and needs manual merging. (this label is managed by a bot) label Mar 21, 2026

adomasbaliuka added 2 commits March 22, 2026 21:35

merge master

95073b5

github-actions bot removed the merge-conflict The PR has a merge conflict with master, and needs manual merging. (this label is managed by a bot) label Mar 22, 2026

Fixes unit test

4c9ded4

mathlib-triage bot assigned joneugster Mar 23, 2026

joneugster reviewed Mar 27, 2026

View reviewed changes

joneugster added the awaiting-author A reviewer has asked the author a question or requested changes. label Mar 27, 2026

adomasbaliuka added 3 commits March 29, 2026 13:24

fix indentation

cfd7a08

scripts/extract-unique-nonascii: print excluded

6f28b44

Adjust linter error message

61d5ac2

removes single occurrence of grave accent diacritic

73cdb9c

The character was used in a comment, seemingly by accident. Also removes it from the allow-list for the linter.

Merge branch 'master' into unicode_linter

6d4ab8c

Vierkantor approved these changes Apr 13, 2026

View reviewed changes

Mathlib/Tactic/Linter/TextBased.lean Outdated Show resolved Hide resolved

scripts/extract-unique-nonascii.lean Outdated Show resolved Hide resolved

mathlib-triage bot added delegated This pull request has been delegated to the PR author (or occasionally another non-maintainer). and removed maintainer-merge A reviewer has approved the changed; awaiting maintainer approval. labels Apr 13, 2026

adomasbaliuka and others added 4 commits April 13, 2026 14:04

updates linter error message: blocklist -> allowlist

7f7257f

Co-authored-by: Anne Baanen <Vierkantor@users.noreply.github.com>

fix: long line

3be069c

moves isAscii to Mathlib.Data.String.Defs

3d24c48

grunweg reviewed Apr 13, 2026

View reviewed changes

mathlib-triage bot added the ready-to-merge This PR has been sent to bors. label Apr 13, 2026

mathlib-bors bot changed the title ~~feat(Tactic/Linter): lint unwanted unicode~~ [Merged by Bors] - feat(Tactic/Linter): lint unwanted unicode Apr 13, 2026

mathlib-bors bot closed this Apr 13, 2026

adomasbaliuka deleted the unicode_linter branch April 13, 2026 18:10

This was referenced Apr 13, 2026

[Merged by Bors] - chore(Tactic/Linter): unicode linter documentation improvements #38019

Closed

RFC: Remove character '﷼' from abbreviations leanprover/vscode-lean4#522

Closed

Conversation

adomasbaliuka commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR summary 10fc8c0fdd

Import changes for modified files

Declarations diff

⚠️ Scripts folder reminder

Uh oh!

adomasbaliuka commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mathlib-dependent-issues bot commented Mar 21, 2026

Uh oh!

mathlib-merge-conflicts bot commented Mar 21, 2026

Uh oh!

adomasbaliuka commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joneugster left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adomasbaliuka commented Apr 10, 2026

Uh oh!

bryangingechen commented Apr 10, 2026

Uh oh!

grunweg commented Apr 11, 2026

Uh oh!

adomasbaliuka commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vierkantor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mathlib-bors bot commented Apr 13, 2026

Uh oh!

grunweg left a comment

Choose a reason for hiding this comment

Uh oh!

mathlib-bors bot commented Apr 13, 2026

Uh oh!

mathlib-bors bot commented Apr 13, 2026

Uh oh!

grunweg commented Apr 13, 2026

Uh oh!

mathlib-bors bot commented Apr 13, 2026

Uh oh!

grunweg commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

adomasbaliuka commented Mar 17, 2026 •

edited

Loading

github-actions bot commented Mar 17, 2026 •

edited

Loading

adomasbaliuka commented Mar 17, 2026 •

edited

Loading

adomasbaliuka commented Mar 22, 2026 •

edited

Loading

adomasbaliuka commented Apr 12, 2026 •

edited

Loading