Skip to content

Next - Unicode 11 support#24

Open
JLHwung wants to merge 16 commits into
orling:masterfrom
JLHwung:next
Open

Next - Unicode 11 support#24
JLHwung wants to merge 16 commits into
orling:masterfrom
JLHwung:next

Conversation

@JLHwung

@JLHwung JLHwung commented Dec 8, 2018

Copy link
Copy Markdown
Contributor

The change should be a breaking change since

  • an ES Module is exported instead of CommonJS module now. People have to change
const GraphmeSplitter = require("grapheme-splitter")

to

import GraphmeSplitter from "grapheme-splitter"

or if they are using a legacy environment

const GraphmeSplitter = require("grapheme-splitter").default

Other than that, the API is stable.

  • The new implementation now conformed to Unicode 11

Dev Infrastructure Changes:

  • Added scripts to convert GraphemeBreakProperty.txt to JavaScript snippet.
  • Added scripts to convert emoji-data.txt to JavaScript snippet.
  • Documented the usage of these maintenance scripts.

@orling Could you install travis to this repository so that I can setup CI? It would be good to prove that the software works as expected.

@JLHwung JLHwung mentioned this pull request Dec 8, 2018
4 tasks
@JLHwung JLHwung changed the title WIP: Next Next - Unicode 11 support Dec 8, 2018
Comment thread package.json
}
],
"main": "index.js",
"files": [

@JLHwung JLHwung Dec 8, 2018

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have constrained the files here, only distributed libraries and LICENSE/README will be distributed. The src, scripts, dev infrastructure babel.config.js will not be included in our npm package. Thus we obtain an optimal node_modules footprint.

Here is the result of npm pack:

npm notice
npm notice 📦  grapheme-splitter@1.0.4
npm notice === Tarball Contents ===
npm notice 1.1kB   package.json
npm notice 839B    index.d.ts
npm notice 145.1kB index.js
npm notice 1.1kB   LICENSE
npm notice 5.2kB   README.md
npm notice === Tarball Details ===
npm notice name:          grapheme-splitter
npm notice version:       1.0.4
npm notice filename:      grapheme-splitter-1.0.4.tgz
npm notice package size:  32.7 kB
npm notice unpacked size: 153.3 kB
npm notice shasum:        c3455c5317b8c40b340d7c7035b78edf11c561e7
npm notice integrity:     sha512-p+i2AbQ0PNq/T[...]3C7vTVdWa8gLQ==
npm notice total files:   5

@orling

orling commented Dec 11, 2018 via email

Copy link
Copy Markdown
Owner

@JLHwung

JLHwung commented Dec 11, 2018 via email

Copy link
Copy Markdown
Contributor Author

@rebirthtobi

Copy link
Copy Markdown

Hi @orling,

Hoping this could be approved soon

@orling

orling commented Oct 24, 2019

Copy link
Copy Markdown
Owner

Better not break APIs, even for minor things.

Also this pull request contains several unrelated changes, making it very time-consuming to review and risky to merge

@rebornix

rebornix commented Oct 29, 2019

Copy link
Copy Markdown

@orling @JLHwung thanks for the good work done by both of you. I'm trying to improve the unicode segmentation for VS Code/Monaco Editor and find this project already doing most of the work. I made some changes in my own fork https://github.com/rebornix/grapheme-splitter/tree/perf, including

My change is very unlikely to be merged into upstream and I still like to share here, just in case if you are interested.

@JLHwung

JLHwung commented Oct 29, 2019

Copy link
Copy Markdown
Contributor Author

@orling Oh it was my code almost a year ago. I can split that into different PRs.

@jasonsbarr

Copy link
Copy Markdown

@rebornix oh nice, you refactored those incredibly long if conditions. Any issues you've found with your version, or is it as accurate as the original?

@rebornix

Copy link
Copy Markdown

@jasonsbarr I didn't run into any weird issue in all my use cases (and it passed the test suites.)

@mattpauldavies

Copy link
Copy Markdown

@orling amazing work on creating this library... it's been extremely useful for us.

I was also super impressed with @JLHwung's work (and I personally agree with the recommendation to move to Typescript).

I urgently needed Unicode 13 support, so I forked this pull request and have created Graphemer

It includes the following:

  • new documentation to make the library easier to maintain
  • updated to include Unicode 13
  • refactored in Typescript

If you'd like to discuss consolidating those efforts into this library that would work for me or if, life is getting in the way, and grapheme-splitter is a bit much to maintain. I'd really appreciate support on the Graphemer project.

@xorgy

xorgy commented Feb 12, 2021

Copy link
Copy Markdown

@mattpauldavies I stupidly went ahead and factored grapheme-splitter into a module without classes; not noticing this PR at all, nor your project. I'm going to look into factoring your Unicode 13+ work into my module.

Given that these classes have no actual state and are completely pointless, I feel like these should just be functions.

@mattpauldavies

Copy link
Copy Markdown

Not stupid at all @xorgy. I agree the classes don't provide any real benefit. I kept them as I wanted a direct swap and I was already using grapheme-splitter.

How would you feel about refactoring Graphemer to use functions? We could then either release a v2.0 (with breaking changes) or we could map the functions to a sort of proxy class that would provide backwards capability.

I want to think about it a bit, but if the functions were split into separate files it would make maintenance easier. Especially updating to new Unicode versions.

If that doesn't vibe with you feel free to take the Unicode 13 work and crack on!

@xorgy

xorgy commented Feb 13, 2021

Copy link
Copy Markdown

@mattpauldavies For my immediate use case, I ended up writing https://github.com/xorgy/grapheme-iterator from scratch instead. (though I don't suggest anyone use it right now, I'm not 100% confident in the correctness of my state machine right now, and the generator is a bit of a mess since I wrote the state machine directly from reading the spec).

I think the approach is pretty good though, my classify function is much faster than the equivalent in grapheme-splitter, so overall grapheme-iterator is about twice as fast as grapheme-splitter, even when you compare using the iterator just for counting (throwing away the values) and comparing that to the hand-written countGraphemes loop in grapheme-splitter.

classify uses a table computed directly from the Unicode 13.0 GraphemeBreakProperty.txt file, and when 13.1 comes out it should Just Work™ by what I know about that standards process.

The other benefit is that none of the symbols in grapheme-iterator need to be preserved when minifying. Overall it ends up about 3500 bytes gzipped with the table, even with no name mangling.

@xorgy

xorgy commented Feb 13, 2021

Copy link
Copy Markdown

@mattpauldavies I think maybe I could make a CommonJS version of it (might need this myself anyway, if I want to use it from a CommonJS-based node app), and write a GraphemeSplitter interface emulator on top of that; then new GraphemeSplitter could just depend on the cleaner module.

Or somebody else could do that, it's only a couple hundred lines of code, and you'd only really need to touch about a dozen of them.

@xorgy

xorgy commented Feb 16, 2021

Copy link
Copy Markdown

Also now instead of being just 2x faster than GraphemeSplitter, grapheme-iterator is about 22x faster.

@JLHwung

JLHwung commented Feb 22, 2021

Copy link
Copy Markdown
Contributor Author

Note that Intl.Segmenter is a stage 3 ES proposal and has been implemented by Chrome.

The GraphemeSplitter interface

const splitter = new GraphemeSplitter();
splitter.splitGraphemes("abcd"); // returns ["a", "b", "c", "d"]

can be replaced by

const segmenter = new Intl.Segmenter({granularity: "grapheme"});
[...segmenter.segment("abcd")] // returns [{segment: "a", index: 0, input: "abcd"} , ... , {segment: "d", index: 3, input: "abcd"}]

@orling Consider leave a note on README and suggest transition to Intl.Segmenter.

@ljharb

ljharb commented Feb 22, 2021

Copy link
Copy Markdown

There'd still need to be a polyfill, otherwise most websites won't be able to rely on it for about 5-10 years.

@xorgy

xorgy commented Feb 26, 2021

Copy link
Copy Markdown

It also seems that Intl.Segmenter involves the rest of TR29 as well, not just grapheme breaking, and the proposed API selects which of these segmenters you use through a "granularity" property in an object, which means that it is not trivial to just polyfill the bit that you want. If you wanted to have a polyfill mechanism, you'd want it the other way around: start with a grapheme splitter and try to use Intl.Segmenter.

P.S. I think grapheme-iterator is working correctly now, need to find a better test suite than the one from the Unicode Consortium (which doesn't even include examples for each GraphemeBreakProperty (!)), but when I find such a test suite I'll put it up as 1.0.

@AlexRMU

AlexRMU commented Nov 12, 2024

Copy link
Copy Markdown

🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants