|
| 1 | + |
| 2 | +# int-hugginface-tagger |
| 3 | + |
| 4 | +Use models from huggingface transformers for historical Dutch PoS tagging and lemmatisation. |
| 5 | + |
| 6 | +Caution: This is a prerelease. |
| 7 | + |
| 8 | +### GaLAHaD-related Repositories |
| 9 | +- [galahad](https://github.com/INL/galahad) |
| 10 | +- [galahad-train-battery](https://github.com/INL/galahad-train-battery) |
| 11 | +- [galahad-taggers-dockerized](https://github.com/INL/galahad-taggers-dockerized) |
| 12 | +- [galahad-corpus-data](https://github.com/INL/galahad-corpus-data/) |
| 13 | +- [int-pie](https://github.com/INL/int-pie) |
| 14 | +- [int-huggingface-tagger](https://github.com/INL/huggingface-tagger) [you are here] |
| 15 | +- [galahad-huggingface-models](https://github.com/INL/galahad-huggingface-models) |
| 16 | + |
| 17 | +## Synopsis |
| 18 | +This repository contains code to: |
| 19 | +* run and train a simple huggingface token classifier based PoS tagger |
| 20 | +* train and run a lemmatizer combining: |
| 21 | + * the INT historical lexicon |
| 22 | + * for out of vocabulary tokens, a ByT5 model (https://huggingface.co/docs/transformers/model_doc/byt5) |
| 23 | + |
| 24 | +To use: |
| 25 | +* clone this repository |
| 26 | +* Create a virtual environment and activate it (python 3.10 or later) |
| 27 | +* run `bash requirements.sh` |
| 28 | +* [install Git Large File Storage] (https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage) |
| 29 | +* clone the model repository |
| 30 | + |
| 31 | +Assuming you have installed Git LFS and you work on linux: |
| 32 | +``` |
| 33 | +git clone https://github.com/INL/int-huggingface-tagger |
| 34 | +cd int-huggingface-tagger |
| 35 | +python3.10 -m venv venv |
| 36 | +source ./venv/bin/activate |
| 37 | +bash requirements.sh |
| 38 | +cd .. |
| 39 | +git clone https://github.com/INL/galahad-huggingface-models |
| 40 | +cd int-huggingface-tagger |
| 41 | +ln -s ../galahad-huggingface-models/models/ . |
| 42 | +python example-usage.py config/galahad/tagger-lemmatizer/ALL.tdn.config example-data/eline.txt /tmp/output.txt |
| 43 | +``` |
| 44 | + |
| 45 | +You need to enable git lfs to include the trained models in a clone. |
| 46 | + |
| 47 | +Tagger-lemmatizer |
| 48 | +================= |
| 49 | + |
| 50 | +Run: |
| 51 | +---- |
| 52 | + |
| 53 | +* On plain text: `python example-usage.py <configuration file> <input text> <output tsv> |
| 54 | + ```python example-usage.py config/galahad/tagger-lemmatizer/ALL.tdn.config example-data/eline.txt /tmp/output.txt``` |
| 55 | + |
| 56 | +* On (tokenized!) TEI: `python example-usage-TEI.py config/galahad/tagger-lemmatizer/ALL.tdn.config example-data/example.tei /tmp/example.tagged.tei``` |
| 57 | +Your tei needs to be tokenized, with word id's in xml:id. |
| 58 | + |
| 59 | + |
| 60 | +```xml |
| 61 | +<TEI xmlns="http://www.tei-c.org/ns/1.0"> |
| 62 | +<teiHeader/> |
| 63 | +<text> |
| 64 | + <body> |
| 65 | + <w xml:id="w1">Dit</w> |
| 66 | + <w xml:id="w2">is</w> |
| 67 | + <w xml:id="w3">een</w> |
| 68 | + <w xml:id="w4">corte</w> |
| 69 | + <w xml:id="w5">Zin</w> |
| 70 | + <pc>.</pc> |
| 71 | + </body> |
| 72 | +</text> |
| 73 | +</TEI> |
| 74 | +``` |
| 75 | + |
| 76 | +```xml |
| 77 | +<TEI xmlns="http://www.tei-c.org/ns/1.0"> |
| 78 | +<teiHeader/> |
| 79 | +<text> |
| 80 | + <body> |
| 81 | + <w xml:id="w1" pos="PD(type=d-p,position=free)" lemma="dit" resp="#lexicon.molex">Dit</w> |
| 82 | + <w xml:id="w2" pos="VRB(finiteness=fin,tense=pres)" lemma="zijn" resp="#lexicon.molex">is</w> |
| 83 | + <w xml:id="w3" pos="PD(type=indef,subtype=art,position=prenom)" lemma="een" resp="#lexicon.molex">een</w> |
| 84 | + <w xml:id="w4" pos="AA(degree=pos,position=prenom)" lemma="kort" resp="#lexicon.hilex">corte</w> |
| 85 | + <w xml:id="w5" pos="NOU-C(number=sg)" lemma="zin" resp="#lexicon.molex">Zin</w> |
| 86 | + <pc pos="PC">.</pc> |
| 87 | + </body> |
| 88 | +</text> |
| 89 | +``` |
| 90 | + |
| 91 | +The config file specifies the tagging model and lemmatization models and the (pickled) lexicon filename, e.g. |
| 92 | +```json |
| 93 | +{ |
| 94 | + "tagging_model" : "../data/tagging/tagging_models/pos_tagging_model_combined_gysbert/", |
| 95 | + "lem_tokenizer" : "../data/byt5-lem-hilex-19", |
| 96 | + "lem_model" : "../data/byt5-lem-hilex-19/checkpoint-53500/", |
| 97 | + "lexicon_path" : "../data/lexicon/lexicon.pickle" |
| 98 | +} |
| 99 | +``` |
| 100 | + |
| 101 | +Training the token classifier for PoS tagging |
| 102 | +--------------------------------------------- |
| 103 | + |
| 104 | +### From configuration file |
| 105 | +``` |
| 106 | +python train.py <config file> |
| 107 | +``` |
| 108 | + |
| 109 | +A typical configuration file example |
| 110 | +```json |
| 111 | +{ |
| 112 | + "epochs": 10, |
| 113 | + "data_format" : "tsv", |
| 114 | + "fields" : {"tokens" : 0, "tags" : 1, "lemmata": 2}, |
| 115 | + "max_chunk_size" : 50, |
| 116 | + "base_dir" : "/vol1/data/tagger/galahad-corpus-data/training-data/couranten/", |
| 117 | + "training_data" : ["couranten.train.tsv.gz"], |
| 118 | + "test_data" : "couranten.test.tsv.gz", |
| 119 | + "dev_data" : "couranten.dev.tsv.gz", |
| 120 | + "model_output_path" : "../data/tagging/tagging_models/couranten/", |
| 121 | + "base_model" : "emanjavacas/GysBERT" |
| 122 | +} |
| 123 | +``` |
| 124 | + |
| 125 | +### From python: for instance: at `tagging/nineteen.py` |
| 126 | + |
| 127 | +Finetuning a pretrained language model (adapt a general language model to a specific domain, using unlabeled text): |
| 128 | +```python |
| 129 | +def finetune_dbnl_19_lm(base_bert='GroNLP/bert-base-dutch-cased', |
| 130 | + output_dir='../data/tagging/language_models/dbnl_19_lm'): |
| 131 | + training_data = '../data/tagging/unannotated/dbnl_stukje_19.json' |
| 132 | + test_data = '../data/tagging/unannotated/multatuli_ideen.json' |
| 133 | + dataset = load_dataset('json', |
| 134 | + data_files={'train': [training_data], 'test': test_data}, |
| 135 | + #sep="\t", |
| 136 | + download_mode='force_redownload') |
| 137 | + dataset = dataset.remove_columns('tags').rename_column('tokens', 'text') |
| 138 | + transfer.finetune_language_model(base_bert, dataset,output_dir) |
| 139 | + |
| 140 | +``` |
| 141 | + |
| 142 | + |
| 143 | +Tasktuning a model to a training dataset (add a classification layer to a language model, using PoS-labeled data): |
| 144 | +```python |
| 145 | +def train_pos_dataset_gysbert(): # Use the Gysbert model from MacBerth, best results for now |
| 146 | + training_data = '../data/nederval/json/19_thomas.train.json' |
| 147 | + test_data = '../data/nederval/json/19_thomas.test.json' |
| 148 | + task_output_dir = '../data/tagging/tagging_models/pos_tagging_model_19_gysbert' |
| 149 | + |
| 150 | + base_model = 'emanjavacas/GysBERT' |
| 151 | + dataset = transfer.create_pos_dataset(training_data, test_data) |
| 152 | + transfer.tasktune_token_classification_model(base_model, dataset, label_column_name='label', |
| 153 | + output_dir=task_output_dir, num_train_epochs=epochs()) |
| 154 | +``` |
| 155 | + |
| 156 | +### Data format for training files: |
| 157 | + TSV (Cf. [galahad-corpus-data](https://github.com/INL/galahad-corpus-data/)) or huggingface dataset JSON, with one object per line, which represents a tagged sentence: |
| 158 | + |
| 159 | +```json |
| 160 | +{"id":"168","tokens":["Ik","lees","voor","me","pleizier",",","meneer",",","als","ik","lees","."],"tags":["PD(type=pers,position=free)","VRB(finiteness=fin,tense=pres)","ADP(type=pre)","PD(type=poss,position=prenom)","NOU-C(number=sg)","LET","NOU-C(number=sg)","LET","CONJ(type=sub)","PD(type=pers,position=free)","VRB(finiteness=fin,tense=pres)","LET"]} |
| 161 | +{"id":"102","tokens":["O","verbasterde","nazaet","!"],"tags":["INT","AA(degree=pos,position=prenom)","NOU-C(number=sg)","LET"]} |
| 162 | +``` |
| 163 | + |
| 164 | + |
| 165 | + |
| 166 | + |
| 167 | +Training the byt5 model for unknown word lemmatisation |
| 168 | +------------------------------------------------------ |
| 169 | + |
| 170 | +Cf. `lemmatizer/train_byt5_lemmatizer.py`. |
| 171 | + |
| 172 | +Input is a tab separated file, exported from the historical lexicon or the training corpus data, containing at least word, pos, and lemma columns |
| 173 | +Training takes a long time (as does running). Byt5 is slow. |
| 174 | + |
| 175 | +Besides the ByT5 model, the lemmatizer uses data from the INT historical lexicon (`data/lexicon/lexicon.pickle`). |
| 176 | +The ByT5 model is used as a fallback for unknown words. |
0 commit comments