This repository contains a set of scripts to build a ready-to-use Juman++ model for Jumandic.
- Unix environment (on Windows use WSL or MSYS2/MinGW64)
- Juman++ build environment
- Python 3.6+
- Ruby
- Perl
- Configured ssh authorization for github (we will clone several repositories via ssh)
- 32 GB of RAM
- Original texts from Mainichi Shinbun (year 1995) for Kyoto Corpus (see the page for more information). Othewise, Juman++ model will be trained only on Leads corpus and will have poor quality.
Run the configuration script: python3 configure.py.
It will prompt for the location of Mainichi Shinbun texts.
After that run make nornn for training a model without RNN component.
make rnn produces the model with RNN component.
The models will be inside the bld/model folder.
It is possible to add your words to the model. To do it:
- Perform the configuration as described above:
python3 configure.py - Fetch the repositories
make repo. - Go into
bld/repos/jumandicfolder, it is a local clone of JumanDIC repository. - Create a new file with the
.dicextension in theuserdicfolder of thebld/repos/jumandicfolder. - Put your words into that file, in JUMAN dictionary format (refer to other files for example).
- Execute
make clean-dicif you have already built a Juman++ model. - Build your model as shown above.
If the built model does not contain your words, ensure that the binary dictionary was rebuilt after adding new words.