Skip to content

LR-10 Improves sentence recognition for German with ordinal numbers#18560

Merged
mhkuu merged 17 commits intotrunkfrom
LR-10-de-sentence-detection-is-incorrect-in-sentences-containing-ordinal-numbers
Jun 15, 2022
Merged

LR-10 Improves sentence recognition for German with ordinal numbers#18560
mhkuu merged 17 commits intotrunkfrom
LR-10-de-sentence-detection-is-incorrect-in-sentences-containing-ordinal-numbers

Conversation

@hdvos
Copy link
Copy Markdown
Contributor

@hdvos hdvos commented Jun 13, 2022

Context

  • Previously an ordinal number in German would have the sentenceTokenizer split a sentence as the result of the fullstop in the ordinal and the next word beginning with a capital letter. (In German, all nouns start with a capital letter.). This code adds exceptions to the existing sentenceTokenizer that catch this case. Now a sentence is not split with a German ordinal.

Summary

This PR can be summarized in the following changelog entry:

  • Improves sentence recognition for German by disregarding ordinal numbers as potential sentence boundaries.
  • [yoastseo] Improves sentence recognition for German by disregarding ordinal numbers as potential sentence boundaries.
  • [shopify-seo] Improves sentence recognition for German by disregarding ordinal numbers as potential sentence boundaries.

Relevant technical choices:

  • I created a new sentenceTokenizer Class for German that extends the generic sentenceTokenizer.
  • I added an 'empty' method (endsWithOrdinalDot) to the generic sentenceTokenizer to prevent an Error. In the generic Tokenizer this always returns false.
  • We decided to support ordinals up to 3 digits. This was decided to prevent years (e.g. 2022) from braking. Combined with the fact that high ordinals are quite rare.
  • It fails to recognize a sentence boundary if a sentence ends with a number. This is a known edge case.

Test instructions

Test instructions for the acceptance test before the PR gets merged

This PR can be acceptance tested by following these steps:

For wordpress

  1. Set the site language to German (Settings -> site language -> save)

  2. Create a new post (Beiträge --> Erstellen) and give it a title.

  3. Paste this text in the content.

  4. Add "12. Club" (including the double quotes) as the focus keyphrase.

  5. Go to SEO-Analyse and toggle the text highlighting for Keyphrasendichte, Make sure that 12. Club is highlighted like in the screenshot.
    wp_keyword

  6. In the Readability Analysis (Lesbarkeits-Analyse) toggle the mark text button for passive sentences. Make sure that the following sentences are marked: In den 1. Club der Stadt wird nachts getanzt., In den 12. Club der Stadt wird nachts getanzt., In den 123. Club der Stadt wird nachts getanzt.. And make sure that the following sentence is only partly marked (which is undesirable but a caveat that we accept): In den 1234. Club der Stadt wird nachts getanzt.
    wp_pv

For shopify

  1. Set the store language to german (See this page on how to do that.).
  2. Create a new product and give it a name. And paste the same text as in step 3 for wordrpress.
  3. Go to the yoast app and repeat step 4-6 from the wordpress test. Make sure to copy and paste the text when optimizing, not in the Shopify editor.
    shopify_keyword

shopify_pv

Test instructions for QA when the code is in the RC

  • QA should use the same steps as above.

QA can test this PR by following these steps:

Impact check

This PR affects the following parts of the plugin, which may require extra testing:

UI changes

  • This PR changes the UI in the plugin. I have added the 'UI change' label to this PR.

Other environments

  • This PR also affects Shopify. I have added a changelog entry starting with [shopify-seo], added test instructions for Shopify and attached the Shopify label to this PR.

Documentation

  • I have written documentation for this change.

Quality assurance

  • I have tested this code to the best of my abilities
  • I have added unittests to verify the code works as intended
  • If any part of the code is behind a feature flag, my test instructions also cover cases where the feature flag is switched off.
  • I have written this PR in accordance with my team's definition of done.

Fixes https://yoast.atlassian.net/browse/LR-10
Fixes Yoast/YoastSEO.js#745

@hdvos hdvos added changelog: enhancement Needs to be included in the 'Enhancements' category in the changelog Shopify This PR impacts Shopify. labels Jun 13, 2022
@hdvos hdvos marked this pull request as ready for review June 14, 2022 12:03
Copy link
Copy Markdown
Contributor

@agnieszkaszuba agnieszkaszuba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CR: I have one small comment, and I made a commit with minor fixes (such as spelling). But other than that, looks good! 👍

I have not acceptance tested yet (will leave it for another day/person).

Comment thread packages/yoastseo/src/languageProcessing/researches/getPassiveVoiceResult.js Outdated
@hdvos
Copy link
Copy Markdown
Contributor Author

hdvos commented Jun 15, 2022

Thank you for the feedback. I'm pretty annoyed with myself for those sloppy typos. I really should do better.

@mhkuu mhkuu changed the title LR-10-de-sentence-detection-is-incorrect-in-sentences-containing-ordinal-numbers LR-10 Improves sentence recognition for German by disregarding ordinal numbers Jun 15, 2022
@mhkuu mhkuu changed the title LR-10 Improves sentence recognition for German by disregarding ordinal numbers LR-10 Improves sentence recognition for German with ordinal numbers Jun 15, 2022
@mhkuu
Copy link
Copy Markdown
Contributor

mhkuu commented Jun 15, 2022

Acceptance test:

  • works neatly in WordPress 🥇
  • for Shopify, the issue with marking can be circumvented by copying not into the Shopify editor, but only when optimizing in the Yoast SEO app. This is related to this resolved issue, that returns here, because there we only fixed it specifically for the paragraph length assessment (so I'll create a new issue). So, I've removed "Note that in step 5: the keywords are correctly counted but not marked in the text. This is related to a known issue." from the test instructions.

Let's merge! 👍

@mhkuu mhkuu added this to the 19.3 milestone Jun 15, 2022
@mhkuu mhkuu merged commit 6f796fb into trunk Jun 15, 2022
@mhkuu mhkuu deleted the LR-10-de-sentence-detection-is-incorrect-in-sentences-containing-ordinal-numbers branch June 15, 2022 14:19
herregroen pushed a commit that referenced this pull request Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog: enhancement Needs to be included in the 'Enhancements' category in the changelog Shopify This PR impacts Shopify.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect sentence detection in case of ordinal numbers in German

3 participants