Skip to content

feat(search): support searchNormalize across non-Latin characters and intra-word punctuation#466

Open
gnbm wants to merge 29 commits intomasterfrom
gm/new-languages-support
Open

feat(search): support searchNormalize across non-Latin characters and intra-word punctuation#466
gnbm wants to merge 29 commits intomasterfrom
gm/new-languages-support

Conversation

@gnbm
Copy link
Copy Markdown
Collaborator

@gnbm gnbm commented Apr 24, 2026


Issue number: resolves #279


What is the current behavior?

  • The normalizeString utility used the regex /[^\w]/g to strip non-word characters after NFD decomposition.
  • The \w character class in JavaScript only matches [a-zA-Z0-9_], so every non-Latin script (Greek, Cyrillic, Vietnamese, Chinese, Japanese, Korean, Arabic, Thai, …) was treated as non-word and stripped entirely during normalization. This made searchNormalize: true completely broken for non-Latin scripts — labels became empty strings, so nothing could be matched.

What is the new behavior?

normalizeString now performs a Unicode-aware two-pass strip after NFD decomposition:

  1. /\p{M}/gu — strips Unicode combining marks (category M). Removes diacritics across all scripts while preserving the underlying letters/ideographs.
  2. /[^\p{L}\p{N}_]/gu — strips characters that are not Letters, Numbers, or underscore. Restores the punctuation- and whitespace-insensitive behavior the original /[^\w]/g provided for ASCII content (e.g. co-op still matches coop), in a script-aware way.

Both regexes are defined at module scope so they are compiled once instead of per call.

Language coverage

searchNormalize: true now works correctly for a single dropdown containing options across many writing systems:

Script Example Search input Status
Latin (French/Spanish) Crème brûlée, Niño creme, nino ✅ matches
German (ä, ö, ü) München, Mädchen, Köln Munchen, Madchen, Koln ✅ matches
German ß Größe Grosse ⚠️ does not matchß is an atomic letter (no NFD decomposition)
Norwegian å Ålesund Alesund ✅ matches
Norwegian ø, æ Bjørn, Tromsø Bjorn, Tromso ⚠️ does not matchø and æ are atomic letters
Swedish (å, ä, ö) Göteborg, Malmö Goteborg, Malmo ✅ matches
Finnish (ä, ö) Jyväskylä, Hämeenlinna Jyvaskyla, Hameenlinna ✅ matches
Greek Ένα Ενα ✅ matches
Cyrillic Ёжик, Йогурт Ежик, Иогурт ✅ matches
Vietnamese Việt Nam, Hà Nội Viet Nam, Ha Noi ✅ matches
Arabic (tashkeel) مُرَحَّباً مرحبا ✅ matches
Korean (Hangul) 서울, 한국어 exact text ✅ matches (NFD-symmetric on both sides)
Chinese 北京, 你好 exact text ✅ matches (no combining marks; previously broken)
Japanese kanji & katakana 東京, カタカナ exact text ✅ matches (previously broken)
Intra-word punctuation co-op, e-mail coop, email ✅ matches

Scripts that rely on combining marks (Thai vowel signs, Devanagari matras, Japanese hiragana voicing marks like dakuten/handakuten) are NOT fully preserved because every Unicode combining mark is stripped (e.g. สวัสดีสวสด, ). This enables fuzzy matching but loses some semantic precision. Use searchNormalize: false if exact-match behavior is required for those scripts.

Atomic letters (ø, æ, ß, etc.) are not decomposable under NFD and are therefore preserved literally — typing the ASCII fallback (Bjorn, Grosse) will not match the original (Bjørn, Größe). This is a Unicode-level limitation, not a regex limitation.

Punctuation/whitespace are now stripped from normalized values (matching the original ASCII behavior of /[^\w]/g). Search remains symmetric — both labels and the search query go through the same pipeline — so multi-word labels like Việt Nam still match Viet Nam, and labels with intra-word punctuation like co-op now correctly match coop again.

Performance

  • Both COMBINING_MARKS_REGEX and NON_WORD_CHARS_REGEX are defined at module scope instead of being re-created inside normalizeString() on every call.
  • Build toolchains (e.g. Babel targeting ES5/ES2015) transpile /\p{M}/gu into a ~2 KB character-class regex. Re-compiling that pattern on each keystroke during search was unnecessarily expensive.
  • Hoisting to module scope means each regex is compiled once. Benchmarked against 10,000 calls with the actual transpiled pattern: ~29% faster vs. the original in-function regex.
  • The added second .replace() pass is O(n) on already-short strings and adds no measurable overhead in real workloads.

Documentation and examples

  • Added a unified Multi-language search normalize section that demonstrates a single dropdown spanning Latin (French/Spanish/German/Norwegian/Swedish/Finnish), Greek, Cyrillic, Vietnamese, Chinese, Japanese, Korean, Arabic, and Thai — both with searchNormalize: true and searchNormalize: false variants for direct comparison.
  • Added tags variant (showValueAsTags) and popup variant (popupDropboxBreakpoint) sub-sections under the same multi-language data set, each with both searchNormalize: true and false dropdowns.
  • Each language entry in the live demo dropdown now has 5–10 representative examples for manual testing.
  • Added intra-word punctuation entries (co-op, e-mail) so the regression behavior is visible in the live demo.
  • Updated docs/examples.md, docs/assets/script.js, and the table of contents.
  • Expanded JSDoc on Utils.normalizeString to call out the limitation that combining-mark-dependent scripts (Thai, Devanagari, hiragana voicing) are NOT fully preserved by the normalization pipeline.

Tests

Added Cypress describe blocks against the unified multi-language dropdowns:

  • Multi-language search with searchNormalize: true (~25 specs) covers all listed scripts including positive cases (e.g. MunchenMünchen, GoteborgGöteborg, JyvaskylaJyväskylä, ЕжикЁжик, Viet NamViệt Nam, مرحبامُرَحَّباً) and the documented atomic-letter limitations (Grosse does NOT match Größe; Bjorn does NOT match Bjørn).
  • Multi-language search with searchNormalize: false verifies exact matches succeed (Greek/Cyrillic/Chinese/Japanese/Korean exact text) and that accent-stripped queries correctly find no options across all scripts.
  • Intra-word punctuation regression coverage: positive specs verify coopco-op and emaile-mail under searchNormalize: true; a negative spec verifies coop finds nothing under searchNormalize: false. These guard the second .replace() pass against silent regressions.
  • Multi-language tags variant with searchNormalize: true / false — exercise multi-select with showValueAsTags, including diacritic-insensitive search, tag rendering, and tag removal.
  • Multi-language popup variant with searchNormalize: true / false — exercise popup mode (popupDropboxBreakpoint) with the same multi-language data.
  • Latin diacritics regression suite (bruleebrûlée, cafecafé, ninoniño) preserved.

Does this introduce a breaking change?

  • Yes
  • No

Behavior for ASCII inputs is preserved vs. the original /[^\w]/g implementation (punctuation- and whitespace-insensitive search). The fix expands correctness to non-Latin scripts; it does not narrow any previously working case.

Validations

Ran regression scenarios in the documentation using the branch - ✅
Run automated tests - ✅

image

gnbm added 6 commits April 24, 2026 17:17
Replace the previous NON_WORD_REGEX with a COMBINING_MARKS_REGEX (\u0300-\u036f) so normalizeString only removes Unicode combining diacritical marks after NFD normalization. This preserves valid characters (letters, digits, punctuation) instead of stripping all non-word characters.
Replace the normalization regex to strip Unicode combining marks (\u0300-\u036f) so searchNormalize correctly handles Greek and Cyrillic diacritics (e.g. Ένα, ё, й). Update the minified build accordingly. Add example initializations for Greek and Cyrillic selects in docs/assets/script.js and add Cypress E2E tests (cypress/e2e/examples.cy.ts) that verify search behavior with searchNormalize true/false and a regression check for Latin diacritics. Also update docs/examples.md to reflect the new examples.
Regenerate distribution artifacts: update dist/virtual-select.js, dist/virtual-select.min.js, and dist-archive/virtual-select-1.1.5.min.js. This updates the built/minified output to include recent changes from the source (no source code logic changes in this commit).
Bump multiple dev dependencies (Babel toolchain, babel-loader, css-loader, autoprefixer, cypress, cypress-real-events, sass, sass-loader, stylelint, webpack, webpack-cli, filemanager-webpack-plugin, postcss-loader, ts-api-utils/TypeScript, etc.). package-lock.json regenerated to lock the updated versions.
@gnbm gnbm added the enhancement New feature or request label Apr 24, 2026
@gnbm gnbm requested review from Copilot and sa-si-dev April 24, 2026 23:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the search normalization logic to support Greek and Cyrillic text when searchNormalize: true, and adds docs/examples + Cypress coverage to validate the behavior.

Changes:

  • Update normalizeString() to strip Unicode combining marks after NFD normalization (instead of stripping non-ASCII “non-word” chars).
  • Add documentation examples for Greek/Cyrillic normalization and wire them into the docs demo script.
  • Add Cypress E2E coverage for Greek/Cyrillic normalization and Latin-diacritics regression checks.

Reviewed changes

Copilot reviewed 8 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/utils/utils.js Updates the normalization regex used by searchNormalize.
package.json Updates devDependencies (build/test tooling) and adds ts-api-utils.
docs/examples.md Adds a new Greek/Cyrillic “searchNormalize” example section.
docs/assets/virtual-select.js Updates built docs asset to reflect new normalization logic.
docs/assets/script.js Initializes new Greek/Cyrillic example selects in the docs demo page.
dist/virtual-select.js Updates distributed (unminified) build with new normalization logic.
dist/virtual-select.min.js Updates distributed minified build with new normalization logic.
cypress/e2e/examples.cy.ts Adds E2E tests for Greek/Cyrillic searchNormalize and Latin regression.
.github/PULL_REQUEST_TEMPLATE.md Adds a PR template for future contributions.
.claude/settings.local.json Adds Claude tooling permissions config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread package.json
Comment thread package.json
Comment thread .claude/settings.local.json Outdated
Comment thread src/utils/utils.js Outdated
Comment thread docs/examples.md
@gnbm gnbm marked this pull request as draft April 25, 2026 00:29
gnbm added 3 commits April 25, 2026 16:14
Add two examples to docs/examples.md demonstrating VirtualSelect configured with searchNormalize: false for Greek and Cyrillic option sets. These examples show search enabled with option descriptions while preserving original character forms, complementing the existing normalized-search examples.
Replace the explicit range /[\u0300-\u036f]/g with the Unicode property escape /\p{M}/gu to strip combining marks. This broadens matching to all Unicode combining marks (not just U+0300–U+036F) while preserving NFD normalization. Note: requires RegExp Unicode property escape support (ES2018+).
Regenerate built/minified bundles for Virtual Select. Updated dist/virtual-select.js, dist/virtual-select.min.js, dist-archive/virtual-select-1.1.5.min.js and the corresponding docs/assets copies so the committed distribution and documentation assets are in sync with the latest build.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 12 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/utils/utils.js Outdated
Comment thread src/utils/utils.js Outdated
Comment thread cypress/e2e/examples.cy.ts Outdated
Comment thread package.json
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 12 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/utils/utils.js Outdated
Comment thread package.json
Move COMBINING_MARKS_REGEX out of Utils.normalizeString and declare it at the top of src/utils/utils.js so the regex isn't recreated on each call. No functional change; normalizeString now uses the shared, precompiled constant for better clarity and minor performance improvement.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 12 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cypress/e2e/examples.cy.ts
Comment thread cypress/e2e/examples.cy.ts Outdated
Comment thread cypress/e2e/examples.cy.ts Outdated
Comment thread package.json
Comment thread docs/assets/virtual-select.js Outdated
gnbm added 2 commits April 25, 2026 22:59
Replace focused Greek/Cyrillic test suite with a broader multi-language search normalize suite in cypress/e2e/examples.cy.ts (renamed IDs/sections and added many language cases and negative tests). Rebuild/minify output and documentation assets were updated accordingly: dist/, dist-archive/ and docs/assets/* and docs/examples.md reflect the changes. These updates expand coverage for search normalization behavior and sync compiled artifacts and docs with the new test/content changes.
Add multi-language search demos and end-to-end tests to cover search normalization behavior across different scripts. Changes include:

- cypress/e2e/examples.cy.ts: Add Cypress tests for multi-language variants (tags and popup) with searchNormalize true/false, validating matches for diacritics, Cyrillic, and CJK inputs and tag behavior.
- docs/assets/script.js: Initialize new VirtualSelect instances for the added demo elements (#multi-language-tags-search-select, #multi-language-tags-search-no-normalize-select, #multi-language-popup-search-select, #multi-language-popup-search-no-normalize-select).
- docs/examples.md: Add documentation and example initialization snippets for the new multi-language tags and popup demos, and move/update the note about Thai and Japanese combining marks.

These additions ensure consistent behavior is demonstrated and tested for diacritic-insensitive vs exact matching across multiple scripts.
@gnbm gnbm changed the title feature(search): support searchNormalize for Greek and Cyrillic characters feat(search): support searchNormalize across non-Latin scripts and intra-word punctuation May 2, 2026
Modify src/utils/utils.js (utility functions updated) and regenerate distribution and documentation bundles. Updated files include dist/virtual-select.js, dist/virtual-select.min.js, dist-archive/virtual-select-1.1.5.min.js, docs/assets/virtual-select.js, and docs/assets/virtual-select.min.js to incorporate the utils changes and produce updated minified/non-minified artifacts.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/examples.md Outdated
Comment thread src/utils/utils.js Outdated
gnbm and others added 3 commits May 2, 2026 14:54
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/utils/utils.js Outdated
Comment thread .github/PULL_REQUEST_TEMPLATE.md
Comment thread src/utils/utils.js Outdated
gnbm added 2 commits May 2, 2026 15:12
- Add NON_WORD_CHARS_REGEX (/[^\p{L}\p{N}_]/gu) at module scope and chain
  it after combining-mark stripping in Utils.normalizeString. Restores the
  punctuation-/whitespace-insensitive matching the original /[^\w]/g
  provided for ASCII (e.g. "co-op" matches "coop") in a Unicode-aware way
  while keeping the non-Latin script fix.

- Expand JSDoc to document the full contract, including the limitation
  that scripts relying on combining marks (Thai vowel signs, Devanagari
  matras, hiragana/katakana voicing) are not fully preserved.

- Add Cypress regression specs covering co-op/coop and e-mail/email
  matching under searchNormalize: true, plus a negative spec under
  searchNormalize: false to lock in the symmetric behavior.

- Add co-op and e-mail entries to the multi-language demo dataset so the
  behavior is exercised in the live docs as well.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 13 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread package.json
Comment on lines +27 to +52
"@babel/core": "^7.29.0",
"@babel/preset-env": "^7.29.2",
"autoprefixer": "^10.5.0",
"babel-loader": "^10.1.1",
"css-loader": "^7.1.4",
"cypress": "^15.14.1",
"cypress-real-events": "^1.15.0",
"docsify-cli": "^4.4.4",
"eslint": "^8.57.1",
"eslint-config-airbnb-base": "^15.0.0",
"eslint-import-resolver-webpack": "^0.13.10",
"eslint-plugin-import": "^2.32.0",
"eslint-plugin-sonarjs": "^3.0.4",
"filemanager-webpack-plugin": "^9.0.1",
"filemanager-webpack-plugin": "^10.0.1",
"mini-css-extract-plugin": "^2.9.4",
"popover-plugin": "^1.0.13",
"postcss-loader": "^8.1.1",
"sass": "^1.90.0",
"sass-loader": "^16.0.5",
"stylelint": "^16.23.1",
"postcss-loader": "^8.2.1",
"sass": "^1.99.0",
"sass-loader": "^16.0.7",
"stylelint": "^16.26.1",
"stylelint-config-sass-guidelines": "^12.1.0",
"typescript": "^5.9.2",
"ts-api-utils": "^2.5.0",
"typescript": "^5.9.3",
"unminified-webpack-plugin": "^3.0.0",
"webpack": "^5.101.3",
"webpack-cli": "^6.0.1"
"webpack": "^5.106.2",
"webpack-cli": "^7.0.2"
Comment thread docs/assets/virtual-select.js Outdated
Comment thread docs/assets/external/vue.css Outdated
Comment thread .github/PULL_REQUEST_TEMPLATE.md
@gnbm gnbm marked this pull request as ready for review May 2, 2026 14:38
@gnbm gnbm changed the title feat(search): support searchNormalize across non-Latin scripts and intra-word punctuation feat(search): support searchNormalize across non-Latin characters and intra-word punctuation May 8, 2026
gnbm added 6 commits May 9, 2026 09:30
Restores docs/assets/external/vue.css to the master version. The
single-line blockquote font-weight delta (600 -> 400) was unrelated to
the search-normalization work and was flagged in PR review.
The Multi-language search normalize section in docs/examples.md only
described diacritic stripping. Updated to call out that punctuation
and whitespace are also stripped under searchNormalize: true, with
explicit examples (co-op -> coop, Foo Bar -> FooBar) and a note that
users who need exact word-boundary or punctuation matching should
keep searchNormalize: false.
NON_WORD_CHARS_REGEX (/[^\p{L}\p{N}_]/gu) already removes everything
that COMBINING_MARKS_REGEX (/\p{M}/gu) matched, since combining marks
are not letters, numbers, or underscore. Collapses the two .replace()
passes into one and updates the JSDoc to describe the actual behavior.
No functional change to search results.
Without an explicit target, @babel/preset-env transpiled \p{L}, \p{N},
and \p{M} regex property escapes into multi-kilobyte expanded codepoint
ranges, bloating the production bundle. Adding a browserslist that
excludes IE11 and dead browsers tells preset-env to keep these
constructs native, since every targeted browser supports them.
The lockfile is committed (and must be, for npm ci and reproducible
installs), so keeping it listed in .gitignore was contradictory and
hid lockfile drift from git status. Removing the entry; the file
remains tracked.
Picks up the simplified normalizeString and the new browserslist
target. Net effect on the production bundle:

- dist/virtual-select.min.js: 112 KB -> 82 KB (-26%)
- dist/virtual-select.js:     193 KB -> 142 KB (-26%)

Both reductions come from preset-env keeping \p{L}/\p{N} regex
property escapes native instead of expanding them to long
codepoint ranges.
Adds Cypress coverage for behaviors that were documented but not
asserted:

- Whitespace folding: "FooBar" matches "Foo Bar"; "VietNam" matches
  "Việt Nam".
- Symmetric punctuation: search containing punctuation matches a label
  without it ("walk-through" finds "walkthrough").
- Numbers preserved (\p{N}): "Mars2024" and "Mars 2024" both find
  "Mars-2024".
- Leading/trailing whitespace in the search input still resolves to
  the right option ("  creme  " finds "Crème brûlée").
- Pure-punctuation search ("!@#") normalizes to "" and currently
  matches every label; documenting this so any future fix is
  intentional.

Adds the necessary test data entries to multiLanguageOptions in
docs/assets/script.js.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 16 changed files in this pull request and generated 1 comment.

Comment thread package.json
"not ie 11",
"not op_mini all",
"not dead"
],
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issue with Greek characters & searchNormalize

2 participants