Experimental machine translation example#864
Experimental machine translation example#864akurniawan wants to merge 35 commits intopytorch:mainfrom
Conversation
Merging from upstream
Codecov Report
@@ Coverage Diff @@
## master #864 +/- ##
=======================================
Coverage 76.99% 76.99%
=======================================
Files 44 44
Lines 3052 3052
=======================================
Hits 2350 2350
Misses 702 702 Continue to review full report at Codecov.
|
| self.dropout = dropout | ||
|
|
||
| self.embedding = embedding | ||
| self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True) |
There was a problem hiding this comment.
Just for the record. I will add or switch transformer model in 2020 H2 as many OSS users require an example to show how to use transformer decoder.
https://discuss.pytorch.org/t/nn-transformer-explaination/53175/14
https://discuss.pytorch.org/t/how-to-use-nn-transformerdecoder-at-inference-time/49484
|
|
||
| epoch_mins, epoch_secs = epoch_time(start_time, end_time) | ||
|
|
||
| print(f"Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s") |
There was a problem hiding this comment.
Should we use bleu_score for a metric?
text/torchtext/data/metrics.py
Line 35 in bcb9104
| @@ -0,0 +1,9 @@ | |||
| # This is an example to create a machine translation dataset and train a translation model. | |||
|
|
|||
There was a problem hiding this comment.
Should we include a metric for the test/valid datasets with the trained model? See my comments about bleu_score below.
There was a problem hiding this comment.
That would be better. However, I may not be able to run a full blown training as my resource is quite limited. Do you have any suggestions?
There was a problem hiding this comment.
Never mind. I will find time and work on it this half. Then, I can update this. Just to make sure that you set up the model/training correctly by checking the learning curve.
There was a problem hiding this comment.
Got it. Sorry for the trouble 🙏
There was a problem hiding this comment.
borrowed a resource to run 10 epochs, already putting the result on README. wdyt?
There was a problem hiding this comment.
It might be a little bit too long. Should we just include the final test result?
There was a problem hiding this comment.
Instead of removing the whole training metrics from the docs I trim it so that I only include the first and the last training output to give users some idea on the loss value while running the example
| from utils import collate_fn, count_parameters, epoch_time, seed_everything | ||
|
|
||
| # Ensure reproducibility | ||
| seed_everything(42) |
There was a problem hiding this comment.
We should allow users to set up the see, like this
There was a problem hiding this comment.
merged with other hyperparams with argparse
| # enc_dropout = 0.5 | ||
| # dec_dropout = 0.5 | ||
|
|
||
| enc_emb_dim = 300 |
There was a problem hiding this comment.
We use parser to set up those hyperparameters, like this
| @@ -0,0 +1,62 @@ | |||
| import itertools | |||
There was a problem hiding this comment.
Are there any generic building blocks that we could add to the library to support other pipeline?
There was a problem hiding this comment.
pad_chars and pad_words would essentially be useful for other pipeline for sure, but for pad_chars not really sure how often people are actually using it compares to subword/word level. the rest (epoch_time, count_parameters, and seed_everything) are typically available on training framework such as pytorch-lightning, ignite, etc, I don't think we need to re-implement them in torchtext, what do you think?
| return pad_sequence(txt, True, pad_idx) | ||
|
|
||
|
|
||
| def collate_fn(batch): |
There was a problem hiding this comment.
In general, we put collate_fn together with DataLoader in a same file.
There was a problem hiding this comment.
moved to train.py
|
@zhangguanheng66 just to let you know, this is ready for review. Thanks! |
| python train_char.py | ||
| ``` | ||
|
|
||
| For character level training, and |
There was a problem hiding this comment.
What's the difference between "character" level vs "word" level training? Better to be more clear with more doc here.
There was a problem hiding this comment.
added more explanation
| return build_vocab_from_iterator(tok_list) | ||
|
|
||
|
|
||
| def char_vocab_func(vocab): |
There was a problem hiding this comment.
Is this something like subword method? Do you think if sentencepiece method is easier to implement?
There was a problem hiding this comment.
By subword method you mean tokenizing a sentence into a list of subwords? Then no. This is a function with similar functionality with vocab_func, converting a list of strings to their index, except instead of converting words to string we are converting a list of chars to their index. To give you an example
sentence: I love torchtext
character representation: [['I'], ['l', 'o', 'v', 'e'], ['t', 'o', 'r', 'c', 'h', 't', 'e', 'x', 't']]
char_vocab_func(character representation): [[1], [2, 3, 4], [5, 6, 7, 8, 9, 10, 11, 12, 13]]
| from torch.utils.data import DataLoader | ||
|
|
||
| from torchtext.data.utils import get_tokenizer | ||
| from torchtext.experimental.datasets.translation import (DATASETS, |
There was a problem hiding this comment.
You should be able to load the raw datasets from torchtext.experimental.datasets.raw directly.
There was a problem hiding this comment.
So just to be clear, it's not recommended/intended to use the raw datasets from torchtext.experimental.datasets?
|
@zhangguanheng66 I got |
|
@akurniawan Just to check in and see if you have problems. |
|
@zhangguanheng66 yes, I have been trying to follow the code that you have given above, the one created by @cpuhrsch. It throws an error |
Could you point out where |
It's not being used right now on the code as it throws the error I mentioned earlier. But I was following this implementation on line 80 and 84.
yes correct |
The Vocab in the main folder doesn't have |
Sorry for not being clear. So basically this is what I did. After you made the comment for using On the example code, we do it this way train, _, _ = DATASETS[dataset_name]()
src_char_vocab = build_char_vocab(train, src_char_transform, index=0)
src_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)
train, _, _ = DATASETS[dataset_name]()
tgt_char_vocab = build_char_vocab(train, tgt_char_transform, index=1)
tgt_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)
train, _, _ = DATASETS[dataset_name]()
tgt_word_vocab = build_vocab_from_iterator(iter(map(lambda tgt_word_transform(x[0]), train)))
tgt_word_vocab.insert(eos_word_token, 0)
tgt_word_vocab.insert(init_word_token, 0)We have def build_word_vocab(data, transforms, index, init_token="<w>", eos_token="</w>"):
tok_list = [[init_token], [eos_token]]
return build_vocab_from_iterator(tok_list + list(map(lambda x: transforms(x[index]), data)))
def build_char_vocab(
data, transforms, index, init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>",
):
tok_list = [
[init_word_token],
[eos_word_token],
[init_sent_token],
[eos_sent_token],
]
for line in data:
tokens = list(itertools.chain.from_iterable(transforms(line[index])))
tok_list.append(tokens)
return build_vocab_from_iterator(tok_list)Where I don't use |
Feel free to use whatever you see it works. |
Cool, it's ready for review then |
This is a PR for new torchtext API in machine translation use case. This includes:
@zhangguanheng66 let me know what you think of this. Thanks!