Skip to content

Exclude punctuation from character count for Japanese texts#22050

Merged
agnieszkaszuba merged 9 commits intotrunkfrom
exclude-punctuation-from-character-count-for-sentence-length-for-japanese
Feb 26, 2025
Merged

Exclude punctuation from character count for Japanese texts#22050
agnieszkaszuba merged 9 commits intotrunkfrom
exclude-punctuation-from-character-count-for-sentence-length-for-japanese

Conversation

@marinakoleva
Copy link
Copy Markdown
Contributor

@marinakoleva marinakoleva commented Feb 17, 2025

Context

  • Unlike all other languages, for Japanese we measure the length of a sentence/ paragraph/ text by counting the number of characters, rather than the number of words.
  • Since when counting the number of words we don't take into account the punctuation in the sentence, it made sense to exclude punctuation from the character count as well.

Summary

This PR can be summarized in the following changelog entry:

  • Improves the accuracy of assessments measuring character count for Japanese texts by removing common punctuation from the count.
  • [yoastseo] Removes common punctuation from the character count for Japanese in countCharacters.js.
  • [shopify-seo] Improves the accuracy of assessments measuring character count for Japanese texts by removing common punctuation from the count.

Relevant technical choices:

  • For Japanese, we use 2 helpers to help 😉 us with the character count: countCharacters.js and wordsCharacterCount.js. We use the latter in Keyphrase assessments, which filter out punctuation after the string has been segmented. We also use the helper for the Reading time feature.
  • This PR focuses on the helper countCharacters.js, which is used for the following assessments: Sentence length, Paragraph length, Subheading distribution, Text length.
  • The reason why I didn't use the general removePunctuation helper in countCharacters.js is because it doesn’t work for Japanese unless the punctuation is at the beginning or end of a string. That's because if the punctuation is not at the beginning or end of the string, it requires a space before/ after a character in order to recognize it, and Japanese doesn't use spaces for the most part.
  • The punctuation characters that were excluded represent the most commonly used punctuation in Japanese. I was the one making the choice of what is "most common" :)
  • Resources consulted:

Test instructions

Test instructions for the acceptance test before the PR gets merged

This PR can be acceptance tested by following these steps:

Testing on a Post in WordPress

  • Make sure the Free plugin is active.
  • Set your site language to Japanese.
  • Open a new post
  • Test the Sentence length assessment (文の長さ) within the Readability analysis (可読性解析)
    • Paste this sentence: 「黒猫」(くろねこ、Black Cat)は、1843年に発表されたエドガー・アラン・ポーの短編小説。
    • The sentence has 40 characters when all punctuation marks are excluded (the maximum recommended length).
    • Confirm the assessment returns a 🍏 green traffic light, with the feedback 文の長さ: いい感じです !
    • Confirm that adding a random letter or number to the sentence switches the feedback to a red traffic light 🍎 . Remove that letter.
  • Test the Paragraph length assessment (段落の長さ)
    • Add this paragraph to the text:
      現在、全国の約130住宅が参加しており、そのほとんどが個人所有の民家です。日本の文化財建造物は、そのほとんどが『木造建築』で地震、台風、洪水など自然災害や火災の多い中で築後何百年と云う長い歴史を生き残って来たものです。更に戦争や社会構造などの変化で消えてしまった建造物も数多くあったことでしょう。昭和52年(1977)に当『全国重文民家の集い』が誕生して早や半世紀近く経とうとしています。 その間、国指定の重要文化財民家(略称: 重文民家)の所有者が手探りで学んで来た経験を互いに情報交換し、更に地域社会との更なる交流、行政や学識経験者との協力を深めて来ました。 構造物としての家屋の保存だけでなく、地域社会の文化やその家に伝わる伝統・住文化の継承に貢献。
    • The paragraph is 328 characters altogether, but when spaces and punctuation are excluded, it's 300 characters (the maximum recommended length)
    • Confirm the assessment returns a 🍏 green traffic light, with the feedback 段落の長さ: 長過ぎる段落はありません。Good Job!
    • Confirm that adding a random letter or number to the paragraph switches the feedback to an orange traffic light 🍊 . Remove that letter.
  • Test the Subheading distribution assessment (小見出し分布)
    • Add the following text to the post:
      又、近年では英国のH.H.A.(Historic Houses Association ―歴史住宅協会―)との交流を深め、英国を初め欧州の文化財情報や所有者の高齢化に伴う次世代への継承問題についての情報交換を行っている。 こんばんは~!お昼のブログもたくさん見ていただきありがとうございました:Dうーーー、今から90年代に戻れるなら「絶対に抜いたらあかんで!」って言いに行きたい💦 でも上に貼ってるブログ見たら、しみじみ眉毛で顔ってぜんぜん印象違うな…と思う!(NARSのチークでもおすすめ~!)美容つながりもうひとつ・・・。生理前はホルモンバランスが崩れて口周りにニキビができちゃう。
    • The text in the post now becomes 671 characters altogether, but when spaces and punctuation are excluded, it's 600 characters (the maximum recommended length you can have without having a subheading).
    • Confirm the assessment returns a 🍏 green traffic light, with the feedback 小見出し分布: 小見出しは使用していませんが、テキストは十分に短く、おそらく必要ありません。
    • Confirm that adding a random letter or number to the text switches the feedback to an red traffic light 🍎 . Remove that letter.
  • Test the Text length assessment (テキストの長さ) within the SEO analysis (SEO 解析)
    • Confirm the assessment returns a 🍏 green traffic light, with the feedback テキストの長さ: テキストは600 文字です。いいですね !
    • Remove 1 character from the text (that is not a punctuation)
    • Confirm the assessment now returns an 🍊 orange traffic light, with the feedback テキストの長さ: テキストは 599 文字です。これは推奨下限値 600 文字を少し下回ります。文章をもう少し加えましょう.

Testing on a Product page in Shopify

Note: Some of the assessments we want to test here have different criteria for regular posts and for product pages, this is why the testing instructions are different.

  • Set your site language to Japanese.
  • Open a new product page.
  • Test the Sentence length assessment (文の長さ) within the Readability analysis (可読性解析)
    • Paste this sentence: 「黒猫」(くろねこ、Black Cat)は、1843年に発表されたエドガー・アラン・ポーの短編小説。
    • The sentence has 40 characters when all punctuation marks are excluded (the maximum recommended length).
    • Confirm the assessment returns a 🍏 green traffic light, with the feedback 文の長さ: いい感じです !
    • Confirm that adding a random letter or number to the sentence switches the feedback to a red traffic light 🍎 . Remove that letter.
  • Test the Paragraph length assessment (段落の長さ)
    • Add this paragraph to the text:
      現在には、全国の約130住宅が参加しており、そのほとんどが個人所有の民家です。日本の文化財建造物は、そのほとんどが『木造建築』で地震、台風、洪水など自然災害や火災の多い中で築後何百年と云う長い歴史を生き残って来たものです。更に戦争や社会構造などの変化で消えてしまった建造物も数多くあったことでしょう。
    • The paragraph is 150 characters altogether, but when spaces and punctuation are excluded, it's 140 characters (the maximum recommended length)
    • Confirm the assessment returns a 🍏 green traffic light, with the feedback 段落の長さ: 長過ぎる段落はありません。Good Job!
    • Confirm that adding a random letter or number to the paragraph switches the feedback to an orange traffic light 🍊 . Remove that letter.
  • Test the Subheading distribution assessment (小見出し分布)
    • Add the following text to the post:
      昭和52年(1977)に当『全国重文民家の集い』が誕生して早や半世紀近く経とうとしています。 その間、国指定の重要文化財民家(略称: 重文民家)の所有者が手探りで学んで来た経験を互いに情報交換し、更に地域社会との更なる交流、行政や学識経験者との協力を深めて来ました。 構造物としての家屋の保存だけでなく、地域社会の文化やその家に伝わる伝統・住文化の継承に貢献。又、近年では英国のH.H.A.(Historic Houses Association ―歴史住宅協会―)との交流を深め、英国を初め欧州の文化財情報や所有者の高齢化に伴う次世代への継承問題についての情報交換を行っている。 こんばんは~!お昼のブログもたくさん見ていただきありがとうございました:Dうーーー、今から90年代に戻れるなら「絶対に抜いたらあかんで!」って言いに行きたい💦 でも上に貼ってるブログ見たら、しみじみ眉毛で顔ってぜんぜん印象違うな…と思う!(NARSのチークでもおすすめ~!)美容つながりもうひとつ・・・。生理前はホルモンバランスが崩れて口周りにニキビができちゃう。
    • The text in the post now becomes 672 characters altogether, but when spaces and punctuation are excluded, it's 602 characters (2 characters above the maximum recommended length you can have without having a subheading (600)).
    • Confirm the assessment returns a 🍎 red traffic light, with the feedback 小見出し分布: テキストが比較的長いにも関わらず小見出しが使われていません。小見出しをいくつか追加してください。
    • Remove the last three characters in the text (ゃう。). They count as 2 characters, because the last one is a full stop. This makes the text 600 characters.
    • Confirm the assessment disappears from the Readability analysis (you can cmd F the name of the assessment, 小見出し分布). This is because the assessment applies only to texts longer than 600 characters.
  • Test the Text length assessment (テキストの長さ) within the SEO analysis (SEO 解析)
    • Confirm the assessment returns a 🍏 green traffic light, with the feedback テキストの長さ: テキストは600 文字です。いいですね !

Relevant test scenarios

  • Changes should be tested with the browser console open
  • Changes should be tested on different posts/pages/taxonomies/custom post types/custom taxonomies
  • Changes should be tested on different editors (Default Block/Gutenberg/Classic/Elementor/other)
  • Changes should be tested on different browsers
  • Changes should be tested on multisite

Test instructions for QA when the code is in the RC

  • QA should use the same steps as above.

QA can test this PR by following these steps:

Impact check

This PR affects the following parts of the plugin, which may require extra testing:

  • N/a

UI changes

  • This PR changes the UI in the plugin. I have added the 'UI change' label to this PR.

Other environments

  • This PR also affects Shopify. I have added a changelog entry starting with [shopify-seo], added test instructions for Shopify and attached the Shopify label to this PR.

Documentation

  • I have written documentation for this change. For example, comments in the Relevant technical choices, comments in the code, documentation on Confluence / shared Google Drive / Yoast developer portal, or other.

Quality assurance

  • I have tested this code to the best of my abilities.
  • During testing, I had activated all plugins that Yoast SEO provides integrations for.
  • I have added unit tests to verify the code works as intended.
  • If any part of the code is behind a feature flag, my test instructions also cover cases where the feature flag is switched off.
  • I have written this PR in accordance with my team's definition of done.
  • I have checked that the base branch is correctly set.

Innovation

  • No innovation project is applicable for this PR.
  • This PR falls under an innovation project. I have attached the innovation label.
  • I have added my hours to the WBSO document.

Fixes ##523

@marinakoleva marinakoleva added the changelog: enhancement Needs to be included in the 'Enhancements' category in the changelog label Feb 17, 2025
@coveralls
Copy link
Copy Markdown

coveralls commented Feb 17, 2025

Pull Request Test Coverage Report for Build 052891733da39674c8a11e1e4765a00724685f38

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.001%) to 57.875%

Totals Coverage Status
Change from base Build f709a8bfb5ba868df873f15a28afbb0c35197f8e: 0.001%
Covered Lines: 13693
Relevant Lines: 23294

💛 - Coveralls

} );
it( "should return a good result for taxonomy pages in Japanese when the text is 60 characters or more", function() {
const paper = new Paper( "欧米では、かつては不吉の象徴とする迷信があり、魔女狩りなどによって黒猫が殺されることがあった。たとえばベルギー・ウェス。" );
const paper = new Paper( "欧米では、かつては不吉の象徴とする迷信があり、魔女狩りなどによって黒猫が殺されることがあった。その傾向は現在も続いており、" +
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since punctuation was removed from the count, the sentence length became less than 60 characters, so I added some more text, in order to trigger the same feedback from the assessment.

expect( sentences[ 1 ].sentenceLength ).toBe( 7 );
expect( sentences[ 2 ].sentenceLength ).toBe( 5 );
} );
it( "returns sentences with exclamation mark", function() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was combined with the one above.

@marinakoleva marinakoleva added the Shopify This PR impacts Shopify. label Feb 18, 2025
@github-actions
Copy link
Copy Markdown

A merge conflict has been detected for the proposed code changes in this PR. Please resolve the conflict by either rebasing the PR or merging in changes from the base branch.

@agnieszkaszuba
Copy link
Copy Markdown
Contributor

agnieszkaszuba commented Feb 26, 2025

CR and testing in WP: 👍
Marina and Jordi also confirmed it works as expected in Shopify

@agnieszkaszuba agnieszkaszuba merged commit 56f1adc into trunk Feb 26, 2025
@agnieszkaszuba agnieszkaszuba deleted the exclude-punctuation-from-character-count-for-sentence-length-for-japanese branch February 26, 2025 14:10
@marinakoleva marinakoleva added this to the 24.6 milestone Feb 27, 2025
@hardikgohil7988
Copy link
Copy Markdown

Tested all the assessments given in the test instruction in Japanese language. No issue found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog: enhancement Needs to be included in the 'Enhancements' category in the changelog Shopify This PR impacts Shopify.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants