[Merged by Bors] - feat(Tactic/Linter): lint unwanted unicode#36773
[Merged by Bors] - feat(Tactic/Linter): lint unwanted unicode#36773adomasbaliuka wants to merge 30 commits intoleanprover-community:masterfrom
Conversation
PR summary 10fc8c0fddImport changes for modified filesNo significant changes to the import graph Import changes for all files
|
|
The remaining error is where I'm not sure what to do. Another question @joneugster is whether we want to enable text-variant-selectors for some characters, such as the errors. I put a TODO in the code at that point. |
puts functions unused in the file itself (only imported from here) at the bottom
Co-authored-by: Violeta Hernández Palacios <vi.hdz.p@gmail.com>
|
This PR/issue depends on:
|
|
This pull request has conflicts, please merge |
I propose to disallow this character. I rewrote the module docstring while presenting the relevant type more explicitly than the underline which was used to signify a multivariate polynomial. This notation does not seem to have been standard in Mathlib, given that this was the only occurrance of the character.
joneugster
left a comment
There was a problem hiding this comment.
Thank you!
Overall, with the script to autogenerate the list, a check to remove redundant symbols from the hand-curated list (see comment), and some information to the user how to add new symbols (see comment) I think this is a very reasonable addition!
I've validated that the script works and outputs the list included in this PR!
The character was used in a comment, seemingly by accident. Also removes it from the allow-list for the linter.
|
I don't understand the CI failure... |
|
I think you just have to merge |
|
The webpage PR looks good to me, as does the proposed sequence of events. |
|
This is on the "maintainer merge" queue with some more changes having happened after it was put on the queue. In my view, these additional changes are done and this PR can now be merged. |
|
✌️ grunweg can now approve this pull request. To approve and merge a pull request, simply reply with |
Co-authored-by: Anne Baanen <Vierkantor@users.noreply.github.com>
Earlier commit changed the error message for the unicode linter. The offending character is extracted by a hard-coded word position index (yes this is not elegant but there is no obvious alternative). This index had not been changed when changing the error message which caused tests to fail. That change is done in this commit which should make the tests pass. Also, some comments and redundant options are cleaned up in the tests file.
grunweg
left a comment
There was a problem hiding this comment.
Thanks for your perseverance - let's get this merged!
bors merge
Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters. Unicode is very versatile and useful for Lean and Mathlib. However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them). In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib. In programming and cybersecurity, there are many known [issues and attacks concerning unicode](https://en.wikipedia.org/wiki/Unicode#Security_issues). Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models. Some notable ones: - [homograph attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack): confusion caused by use of distinct characters which look the same (we probably don't want to fully address this and this PR does not attempt to) - [Trojan Source](https://en.wikipedia.org/wiki/Trojan_Source): abuse of bidirectional characters. Characters used by languages with right-to-left reading direction can cause code to be displayed differently than it is parsed. (This PR tries to address this) - Exploits involving Private Use Area characters, e.g. [GlassWorm](https://abit.ee/en/cybersecurity/viruses-trojans-and-other-malware/glassworm-github-supply-chain-attack-unicode-solana-malware-npm-vs-code-2026-en). (This PR tries to address this) See also [unicode code source handling](https://www.unicode.org/reports/tr55/) and [Programming with Unicode](https://unicodebook.readthedocs.io/nightmare.html) for further details and guidelines. Co-authored-by: Michael Rothgang @grunweg Co-authored-by: Jon Eugster @joneugster
|
Build failed (retrying...): |
Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters. Unicode is very versatile and useful for Lean and Mathlib. However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them). In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib. In programming and cybersecurity, there are many known [issues and attacks concerning unicode](https://en.wikipedia.org/wiki/Unicode#Security_issues). Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models. Some notable ones: - [homograph attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack): confusion caused by use of distinct characters which look the same (we probably don't want to fully address this and this PR does not attempt to) - [Trojan Source](https://en.wikipedia.org/wiki/Trojan_Source): abuse of bidirectional characters. Characters used by languages with right-to-left reading direction can cause code to be displayed differently than it is parsed. (This PR tries to address this) - Exploits involving Private Use Area characters, e.g. [GlassWorm](https://abit.ee/en/cybersecurity/viruses-trojans-and-other-malware/glassworm-github-supply-chain-attack-unicode-solana-malware-npm-vs-code-2026-en). (This PR tries to address this) See also [unicode code source handling](https://www.unicode.org/reports/tr55/) and [Programming with Unicode](https://unicodebook.readthedocs.io/nightmare.html) for further details and guidelines. Co-authored-by: Michael Rothgang @grunweg Co-authored-by: Jon Eugster @joneugster
|
Build failed: Fix if necessary, and then someone with permission can run |
|
Once more with feeling: |
Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters. Unicode is very versatile and useful for Lean and Mathlib. However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them). In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib. In programming and cybersecurity, there are many known [issues and attacks concerning unicode](https://en.wikipedia.org/wiki/Unicode#Security_issues). Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models. Some notable ones: - [homograph attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack): confusion caused by use of distinct characters which look the same (we probably don't want to fully address this and this PR does not attempt to) - [Trojan Source](https://en.wikipedia.org/wiki/Trojan_Source): abuse of bidirectional characters. Characters used by languages with right-to-left reading direction can cause code to be displayed differently than it is parsed. (This PR tries to address this) - Exploits involving Private Use Area characters, e.g. [GlassWorm](https://abit.ee/en/cybersecurity/viruses-trojans-and-other-malware/glassworm-github-supply-chain-attack-unicode-solana-malware-npm-vs-code-2026-en). (This PR tries to address this) See also [unicode code source handling](https://www.unicode.org/reports/tr55/) and [Programming with Unicode](https://unicodebook.readthedocs.io/nightmare.html) for further details and guidelines. Co-authored-by: Michael Rothgang @grunweg Co-authored-by: Jon Eugster @joneugster
|
Pull request successfully merged into master. Build succeeded: |
|
I just merged the webpage PR: I'll happily merge a follow-up PR tomorrow linking to the style guide. |
As planned in #36773, we add a link to the style guide, which was updated at leanprover-community/leanprover-community.github.io#820. Further changes: - some more documentation updates (some obsolete comments were talking about a "blocklist" which no longer exists) - change order of checks in `isAllowedCharacter`. The new order is more logical and may slightly improve performance (which probably doesn't matter) since Mathlib has more characters in `otherInMathlib` than `emojis`.

Extends the text-based style linter that checks all unicode characters. Provides automatic replacements for some disallowed characters.
Unicode is very versatile and useful for Lean and Mathlib.
However, it is also very complex and few people have a thorough understanding of all its pitfalls (I don't claim to be one of them).
In order to avoid unpleasant surprises going forward, both accidental and malicious, we should keep track of which Unicode characters are allowed in Mathlib.
In programming and cybersecurity, there are many known issues and attacks concerning unicode.
Many open source repositories have been hit by such attacks, which are becoming ever more frequent due to the use of automation using e.g. large language models.
Some notable ones:
See also unicode code source handling and Programming with Unicode for further details and guidelines.
Co-authored-by: Michael Rothgang @grunweg
Co-authored-by: Jon Eugster @joneugster
Continues work from #16215 (due to PRs now being made from forks).
Discussed at Zulip
Note: the script was added due to reviewer comment in #16215. Perhaps that is overkill here.