Skip to content

Commit 2c4426b

Browse files
committed
Clean history: keep only the latest commit
0 parents  commit 2c4426b

54 files changed

Lines changed: 8251 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
2+
# int-hugginface-tagger
3+
4+
Use models from huggingface transformers for historical Dutch PoS tagging and lemmatisation.
5+
6+
Caution: This is a prerelease.
7+
8+
### GaLAHaD-related Repositories
9+
- [galahad](https://github.com/INL/galahad)
10+
- [galahad-train-battery](https://github.com/INL/galahad-train-battery)
11+
- [galahad-taggers-dockerized](https://github.com/INL/galahad-taggers-dockerized)
12+
- [galahad-corpus-data](https://github.com/INL/galahad-corpus-data/)
13+
- [int-pie](https://github.com/INL/int-pie)
14+
- [int-huggingface-tagger](https://github.com/INL/huggingface-tagger) [you are here]
15+
- [galahad-huggingface-models](https://github.com/INL/galahad-huggingface-models)
16+
17+
## Synopsis
18+
This repository contains code to:
19+
* run and train a simple huggingface token classifier based PoS tagger
20+
* train and run a lemmatizer combining:
21+
* the INT historical lexicon
22+
* for out of vocabulary tokens, a ByT5 model (https://huggingface.co/docs/transformers/model_doc/byt5)
23+
24+
To use:
25+
* clone this repository
26+
* Create a virtual environment and activate it (python 3.10 or later)
27+
* run `bash requirements.sh`
28+
* [install Git Large File Storage] (https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage)
29+
* clone the model repository
30+
31+
Assuming you have installed Git LFS and you work on linux:
32+
```
33+
git clone https://github.com/INL/int-huggingface-tagger
34+
cd int-huggingface-tagger
35+
python3.10 -m venv venv
36+
source ./venv/bin/activate
37+
bash requirements.sh
38+
cd ..
39+
git clone https://github.com/INL/galahad-huggingface-models
40+
cd int-huggingface-tagger
41+
ln -s ../galahad-huggingface-models/models/ .
42+
python example-usage.py config/galahad/tagger-lemmatizer/ALL.tdn.config example-data/eline.txt /tmp/output.txt
43+
```
44+
45+
You need to enable git lfs to include the trained models in a clone.
46+
47+
Tagger-lemmatizer
48+
=================
49+
50+
Run:
51+
----
52+
53+
* On plain text: `python example-usage.py <configuration file> <input text> <output tsv>
54+
```python example-usage.py config/galahad/tagger-lemmatizer/ALL.tdn.config example-data/eline.txt /tmp/output.txt```
55+
56+
* On (tokenized!) TEI: `python example-usage-TEI.py config/galahad/tagger-lemmatizer/ALL.tdn.config example-data/example.tei /tmp/example.tagged.tei```
57+
Your tei needs to be tokenized, with word id's in xml:id.
58+
59+
60+
```xml
61+
<TEI xmlns="http://www.tei-c.org/ns/1.0">
62+
<teiHeader/>
63+
<text>
64+
<body>
65+
<w xml:id="w1">Dit</w>
66+
<w xml:id="w2">is</w>
67+
<w xml:id="w3">een</w>
68+
<w xml:id="w4">corte</w>
69+
<w xml:id="w5">Zin</w>
70+
<pc>.</pc>
71+
</body>
72+
</text>
73+
</TEI>
74+
```
75+
76+
```xml
77+
<TEI xmlns="http://www.tei-c.org/ns/1.0">
78+
<teiHeader/>
79+
<text>
80+
<body>
81+
<w xml:id="w1" pos="PD(type=d-p,position=free)" lemma="dit" resp="#lexicon.molex">Dit</w>
82+
<w xml:id="w2" pos="VRB(finiteness=fin,tense=pres)" lemma="zijn" resp="#lexicon.molex">is</w>
83+
<w xml:id="w3" pos="PD(type=indef,subtype=art,position=prenom)" lemma="een" resp="#lexicon.molex">een</w>
84+
<w xml:id="w4" pos="AA(degree=pos,position=prenom)" lemma="kort" resp="#lexicon.hilex">corte</w>
85+
<w xml:id="w5" pos="NOU-C(number=sg)" lemma="zin" resp="#lexicon.molex">Zin</w>
86+
<pc pos="PC">.</pc>
87+
</body>
88+
</text>
89+
```
90+
91+
The config file specifies the tagging model and lemmatization models and the (pickled) lexicon filename, e.g.
92+
```json
93+
{
94+
"tagging_model" : "../data/tagging/tagging_models/pos_tagging_model_combined_gysbert/",
95+
"lem_tokenizer" : "../data/byt5-lem-hilex-19",
96+
"lem_model" : "../data/byt5-lem-hilex-19/checkpoint-53500/",
97+
"lexicon_path" : "../data/lexicon/lexicon.pickle"
98+
}
99+
```
100+
101+
Training the token classifier for PoS tagging
102+
---------------------------------------------
103+
104+
### From configuration file
105+
```
106+
python train.py <config file>
107+
```
108+
109+
A typical configuration file example
110+
```json
111+
{
112+
"epochs": 10,
113+
"data_format" : "tsv",
114+
"fields" : {"tokens" : 0, "tags" : 1, "lemmata": 2},
115+
"max_chunk_size" : 50,
116+
"base_dir" : "/vol1/data/tagger/galahad-corpus-data/training-data/couranten/",
117+
"training_data" : ["couranten.train.tsv.gz"],
118+
"test_data" : "couranten.test.tsv.gz",
119+
"dev_data" : "couranten.dev.tsv.gz",
120+
"model_output_path" : "../data/tagging/tagging_models/couranten/",
121+
"base_model" : "emanjavacas/GysBERT"
122+
}
123+
```
124+
125+
### From python: for instance: at `tagging/nineteen.py`
126+
127+
Finetuning a pretrained language model (adapt a general language model to a specific domain, using unlabeled text):
128+
```python
129+
def finetune_dbnl_19_lm(base_bert='GroNLP/bert-base-dutch-cased',
130+
output_dir='../data/tagging/language_models/dbnl_19_lm'):
131+
training_data = '../data/tagging/unannotated/dbnl_stukje_19.json'
132+
test_data = '../data/tagging/unannotated/multatuli_ideen.json'
133+
dataset = load_dataset('json',
134+
data_files={'train': [training_data], 'test': test_data},
135+
#sep="\t",
136+
download_mode='force_redownload')
137+
dataset = dataset.remove_columns('tags').rename_column('tokens', 'text')
138+
transfer.finetune_language_model(base_bert, dataset,output_dir)
139+
140+
```
141+
142+
143+
Tasktuning a model to a training dataset (add a classification layer to a language model, using PoS-labeled data):
144+
```python
145+
def train_pos_dataset_gysbert(): # Use the Gysbert model from MacBerth, best results for now
146+
training_data = '../data/nederval/json/19_thomas.train.json'
147+
test_data = '../data/nederval/json/19_thomas.test.json'
148+
task_output_dir = '../data/tagging/tagging_models/pos_tagging_model_19_gysbert'
149+
150+
base_model = 'emanjavacas/GysBERT'
151+
dataset = transfer.create_pos_dataset(training_data, test_data)
152+
transfer.tasktune_token_classification_model(base_model, dataset, label_column_name='label',
153+
output_dir=task_output_dir, num_train_epochs=epochs())
154+
```
155+
156+
### Data format for training files:
157+
TSV (Cf. [galahad-corpus-data](https://github.com/INL/galahad-corpus-data/)) or huggingface dataset JSON, with one object per line, which represents a tagged sentence:
158+
159+
```json
160+
{"id":"168","tokens":["Ik","lees","voor","me","pleizier",",","meneer",",","als","ik","lees","."],"tags":["PD(type=pers,position=free)","VRB(finiteness=fin,tense=pres)","ADP(type=pre)","PD(type=poss,position=prenom)","NOU-C(number=sg)","LET","NOU-C(number=sg)","LET","CONJ(type=sub)","PD(type=pers,position=free)","VRB(finiteness=fin,tense=pres)","LET"]}
161+
{"id":"102","tokens":["O","verbasterde","nazaet","!"],"tags":["INT","AA(degree=pos,position=prenom)","NOU-C(number=sg)","LET"]}
162+
```
163+
164+
165+
166+
167+
Training the byt5 model for unknown word lemmatisation
168+
------------------------------------------------------
169+
170+
Cf. `lemmatizer/train_byt5_lemmatizer.py`.
171+
172+
Input is a tab separated file, exported from the historical lexicon or the training corpus data, containing at least word, pos, and lemma columns
173+
Training takes a long time (as does running). Byt5 is slow.
174+
175+
Besides the ByT5 model, the lemmatizer uses data from the INT historical lexicon (`data/lexicon/lexicon.pickle`).
176+
The ByT5 model is used as a fallback for unknown words.

__init__.py

Whitespace-only changes.

cleanRepo.sh

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Clone your repo
2+
#git clone https://github.com/username/repo.git
3+
#cd repo
4+
5+
# Create a new orphan branch (no history)
6+
git checkout --orphan latest-commit
7+
8+
# Add all files
9+
git add -A
10+
11+
# Commit
12+
git commit -m "Clean history: keep only the latest commit"
13+
14+
# Delete the old branch (e.g., 'main' or 'master')
15+
git branch -D master # or: git branch -D master
16+
17+
# Rename new branch to 'main'
18+
git branch -m master
19+
20+
# Force push to GitHub
21+
#git push --force origin main
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"tagging_model": "models/galahad/tagger/pos_model_tdn_1400-1600/",
3+
"lem_tokenizer": "models/galahad/lemmatizer/tdn_byt5_model/",
4+
"lem_model": "models/galahad/lemmatizer/tdn_byt5_model/",
5+
"lexicon_path": "models/galahad/lexicon/lexicon.pickle",
6+
"chop_pos_to_main": false
7+
}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"tagging_model": "models/galahad/tagger/pos_model_tdn_1600-1900/",
3+
"lem_tokenizer": "models/galahad/lemmatizer/tdn_byt5_model/",
4+
"lem_model": "models/galahad/lemmatizer/tdn_byt5_model/",
5+
"lexicon_path": "models/galahad/lexicon/lexicon.pickle",
6+
"chop_pos_to_main": false
7+
}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"tagging_model": "models/galahad/tagger/pos_model_tdn_ALL/",
3+
"lem_tokenizer": "models/galahad/lemmatizer/tdn_byt5_model/",
4+
"lem_model": "models/galahad/lemmatizer/tdn_byt5_model/",
5+
"lexicon_path": "models/galahad/lexicon/lexicon.pickle",
6+
"chop_pos_to_main": false
7+
}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"tagging_model": "models/galahad/tagger/pos_model_tdn_all_enhanced/",
3+
"lem_tokenizer": "models/galahad/lemmatizer/tdn_byt5_model/",
4+
"lem_model": "models/galahad/lemmatizer/tdn_byt5_model/",
5+
"lexicon_path": "models/galahad/lexicon/lexicon.pickle",
6+
"chop_pos_to_main": false
7+
}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"tagging_model": "models/galahad/tagger/pos_model_tdn_BAB/",
3+
"lem_tokenizer": "models/galahad/lemmatizer/tdn_byt5_model/",
4+
"lem_model": "models/galahad/lemmatizer/tdn_byt5_model/",
5+
"lexicon_path": "models/galahad/lexicon/lexicon.pickle",
6+
"chop_pos_to_main": false
7+
}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"tagging_model": "models/galahad/tagger/pos_model_tdn_CLVN/",
3+
"lem_tokenizer": "models/galahad/lemmatizer/tdn_byt5_model/",
4+
"lem_model": "models/galahad/lemmatizer/tdn_byt5_model/",
5+
"lexicon_path": "models/galahad/lexicon/lexicon.pickle",
6+
"chop_pos_to_main": false
7+
}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"tagging_model": "models/galahad/tagger/pos_model_tdn_COUR/",
3+
"lem_tokenizer": "models/galahad/lemmatizer/tdn_byt5_model/",
4+
"lem_model": "models/galahad/lemmatizer/tdn_byt5_model/",
5+
"lexicon_path": "models/galahad/lexicon/lexicon.pickle",
6+
"chop_pos_to_main": false
7+
}

0 commit comments

Comments
 (0)