Experimental machine translation example by akurniawan · Pull Request #864 · pytorch/text

akurniawan · 2020-07-02T09:46:20Z

This is a PR for new torchtext API in machine translation use case. This includes:

Sample on how to build character and word representation
Embedding model for character representation
Seq2seq model from https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html

@zhangguanheng66 let me know what you think of this. Thanks!

Merging from upstream

codecov · 2020-07-02T10:04:06Z

Codecov Report

Merging #864 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #864   +/-   ##
=======================================
  Coverage   76.99%   76.99%           
=======================================
  Files          44       44           
  Lines        3052     3052           
=======================================
  Hits         2350     2350           
  Misses        702      702

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a93cad5...0886a68. Read the comment docs.

zhangguanheng66 · 2020-07-02T14:26:59Z

+        self.dropout = dropout
+
+        self.embedding = embedding
+        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)


Just for the record. I will add or switch transformer model in 2020 H2 as many OSS users require an example to show how to use transformer decoder.
https://discuss.pytorch.org/t/nn-transformer-explaination/53175/14
https://discuss.pytorch.org/t/how-to-use-nn-transformerdecoder-at-inference-time/49484

zhangguanheng66 · 2020-07-02T14:29:21Z

+
+        epoch_mins, epoch_secs = epoch_time(start_time, end_time)
+
+        print(f"Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s")


Should we use bleu_score for a metric?

text/torchtext/data/metrics.py

Line 35 in bcb9104

def bleu_score(candidate_corpus, references_corpus, max_n=4, weights=[0.25] * 4):

zhangguanheng66 · 2020-07-02T14:30:32Z

@@ -0,0 +1,9 @@
+# This is an example to create a machine translation dataset and train a translation model.
+


Should we include a metric for the test/valid datasets with the trained model? See my comments about bleu_score below.

That would be better. However, I may not be able to run a full blown training as my resource is quite limited. Do you have any suggestions?

Never mind. I will find time and work on it this half. Then, I can update this. Just to make sure that you set up the model/training correctly by checking the learning curve.

Got it. Sorry for the trouble 🙏

borrowed a resource to run 10 epochs, already putting the result on README. wdyt?

It might be a little bit too long. Should we just include the final test result?

Instead of removing the whole training metrics from the docs I trim it so that I only include the first and the last training output to give users some idea on the loss value while running the example

zhangguanheng66 · 2020-07-06T13:55:33Z

+from utils import collate_fn, count_parameters, epoch_time, seed_everything
+
+# Ensure reproducibility
+seed_everything(42)


We should allow users to set up the see, like this

merged with other hyperparams with argparse

zhangguanheng66 · 2020-07-06T13:56:22Z

+    # enc_dropout = 0.5
+    # dec_dropout = 0.5
+
+    enc_emb_dim = 300


We use parser to set up those hyperparameters, like this

added argparse

zhangguanheng66 · 2020-07-06T14:01:54Z

@@ -0,0 +1,62 @@
+import itertools


Are there any generic building blocks that we could add to the library to support other pipeline?

pad_chars and pad_words would essentially be useful for other pipeline for sure, but for pad_chars not really sure how often people are actually using it compares to subword/word level. the rest (epoch_time, count_parameters, and seed_everything) are typically available on training framework such as pytorch-lightning, ignite, etc, I don't think we need to re-implement them in torchtext, what do you think?

zhangguanheng66 · 2020-07-06T14:02:46Z

+    return pad_sequence(txt, True, pad_idx)
+
+
+def collate_fn(batch):


In general, we put collate_fn together with DataLoader in a same file.

moved to train.py

…l model and word level model

…tion-example

akurniawan · 2020-07-09T11:55:15Z

@zhangguanheng66 just to let you know, this is ready for review. Thanks!

zhangguanheng66 · 2020-07-09T15:25:53Z

+python train_char.py
+```
+
+For character level training, and


What's the difference between "character" level vs "word" level training? Better to be more clear with more doc here.

added more explanation

zhangguanheng66 · 2020-07-09T15:28:48Z

+    return build_vocab_from_iterator(tok_list)
+
+
+def char_vocab_func(vocab):


Is this something like subword method? Do you think if sentencepiece method is easier to implement?

By subword method you mean tokenizing a sentence into a list of subwords? Then no. This is a function with similar functionality with vocab_func, converting a list of strings to their index, except instead of converting words to string we are converting a list of chars to their index. To give you an example

sentence: I love torchtext character representation: [['I'], ['l', 'o', 'v', 'e'], ['t', 'o', 'r', 'c', 'h', 't', 'e', 'x', 't']] char_vocab_func(character representation): [[1], [2, 3, 4], [5, 6, 7, 8, 9, 10, 11, 12, 13]]

zhangguanheng66 · 2020-07-09T16:23:46Z

+from torch.utils.data import DataLoader
+
+from torchtext.data.utils import get_tokenizer
+from torchtext.experimental.datasets.translation import (DATASETS,


You should be able to load the raw datasets from torchtext.experimental.datasets.raw directly.

So just to be clear, it's not recommended/intended to use the raw datasets from torchtext.experimental.datasets?

zhangguanheng66

@cpuhrsch created an example to show how to use map and partial to consume vocab object. See here or a diff file here

akurniawan · 2020-07-10T05:02:22Z

@zhangguanheng66 I got AttributeError: 'Vocab' object has no attribute 'insert_tokens' in following the script. I have tried to pull the latest changes but seems still no insert_tokens, on vocab object

zhangguanheng66 · 2020-07-27T21:37:37Z

@akurniawan Just to check in and see if you have problems.

…tion-example

akurniawan · 2020-07-29T10:10:35Z

@zhangguanheng66 yes, I have been trying to follow the code that you have given above, the one created by @cpuhrsch. It throws an error AttributeError: 'Vocab' object has no attribute 'insert_tokens'. I take a look at both vocab objects (one in the legacy and one in the experimental) and both don't have insert_tokens and insert method as well. I'm not sure if that was only a pseudocode to insert tokens after building vocabulary or it's a real methods that haven't been implemented yet. So as of now I'm still keeping the old way to build the vocab

zhangguanheng66 · 2020-07-29T14:37:22Z

@zhangguanheng66 yes, I have been trying to follow the code that you have given above, the one created by @cpuhrsch. It throws an error AttributeError: 'Vocab' object has no attribute 'insert_tokens'. I take a look at both vocab objects (one in the legacy and one in the experimental) and both don't have insert_tokens and insert method as well. I'm not sure if that was only a pseudocode to insert tokens after building vocabulary or it's a real methods that haven't been implemented yet. So as of now I'm still keeping the old way to build the vocab

Could you point out where insert_tokens is used in your code? And you are using the vocab class in the main folder, not experimental folder, right?

akurniawan · 2020-07-29T14:45:35Z

Could you point out where insert_tokens is used in your code?

It's not being used right now on the code as it throws the error I mentioned earlier. But I was following this implementation on line 80 and 84.

And you are using the vocab class in the main folder, not experimental folder, right?

yes correct

zhangguanheng66 · 2020-07-29T15:03:26Z

Could you point out where insert_tokens is used in your code?

It's not being used right now on the code as it throws the error I mentioned earlier. But I was following this implementation on line 80 and 84.

And you are using the vocab class in the main folder, not experimental folder, right?

yes correct

The Vocab in the main folder doesn't have insert_token method, however, the one in experimental folder has. Overall, no sure why you got that error because you never call insert_token method. Could you show me the full error message chain?

akurniawan · 2020-07-30T00:37:08Z

The Vocab in the main folder doesn't have insert_token method, however, the one in experimental folder has. Overall, no sure why you got that error because you never call insert_token method. Could you show me the full error message chain?

Sorry for not being clear. So basically this is what I did. After you made the comment for using map and partial to tokenize the character-level representation, I tried to make changes in my local by following exactly the example code (the one that contains insert_tokens and insert method for inserting vocabulary). It throws error that both insert_tokens and insert were not found. Then because of that error, I removed the implementation back. The difference between the example you gave and the current implementation is on the way we insert the special tokens.

On the example code, we do it this way

    train, _, _ = DATASETS[dataset_name]()
    src_char_vocab = build_char_vocab(train, src_char_transform, index=0)
    src_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_char_vocab = build_char_vocab(train, tgt_char_transform, index=1)
    tgt_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_word_vocab = build_vocab_from_iterator(iter(map(lambda tgt_word_transform(x[0]), train)))
    tgt_word_vocab.insert(eos_word_token, 0)
    tgt_word_vocab.insert(init_word_token, 0)

We have insert_tokens and insert to insert special tokens both for char and word level. When I run this, it throws errors as both methods are not available. Therefore, I revert back and still use the following way

def build_word_vocab(data, transforms, index, init_token="<w>", eos_token="</w>"):
    tok_list = [[init_token], [eos_token]]
    return build_vocab_from_iterator(tok_list + list(map(lambda x: transforms(x[index]), data)))


def build_char_vocab(
    data, transforms, index, init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>",
):
    tok_list = [
        [init_word_token],
        [eos_word_token],
        [init_sent_token],
        [eos_sent_token],
    ]
    for line in data:
        tokens = list(itertools.chain.from_iterable(transforms(line[index])))
        tok_list.append(tokens)
    return build_vocab_from_iterator(tok_list)

Where I don't use insert_tokens and insert method to put the special tokens

zhangguanheng66 · 2020-07-30T14:28:54Z

The Vocab in the main folder doesn't have insert_token method, however, the one in experimental folder has. Overall, no sure why you got that error because you never call insert_token method. Could you show me the full error message chain?

Sorry for not being clear. So basically this is what I did. After you made the comment for using map and partial to tokenize the character-level representation, I tried to make changes in my local by following exactly the example code (the one that contains insert_tokens and insert method for inserting vocabulary). It throws error that both insert_tokens and insert were not found. Then because of that error, I removed the implementation back. The difference between the example you gave and the current implementation is on the way we insert the special tokens.

On the example code, we do it this way
    train, _, _ = DATASETS[dataset_name]()
    src_char_vocab = build_char_vocab(train, src_char_transform, index=0)
    src_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_char_vocab = build_char_vocab(train, tgt_char_transform, index=1)
    tgt_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_word_vocab = build_vocab_from_iterator(iter(map(lambda tgt_word_transform(x[0]), train)))
    tgt_word_vocab.insert(eos_word_token, 0)
    tgt_word_vocab.insert(init_word_token, 0)
We have insert_tokens and insert to insert special tokens both for char and word level. When I run this, it throws errors as both methods are not available. Therefore, I revert back and still use the following way
def build_word_vocab(data, transforms, index, init_token="<w>", eos_token="</w>"):
    tok_list = [[init_token], [eos_token]]
    return build_vocab_from_iterator(tok_list + list(map(lambda x: transforms(x[index]), data)))


def build_char_vocab(
    data, transforms, index, init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>",
):
    tok_list = [
        [init_word_token],
        [eos_word_token],
        [init_sent_token],
        [eos_sent_token],
    ]
    for line in data:
        tokens = list(itertools.chain.from_iterable(transforms(line[index])))
        tok_list.append(tokens)
    return build_vocab_from_iterator(tok_list)
Where I don't use insert_tokens and insert method to put the special tokens

Feel free to use whatever you see it works.

akurniawan · 2020-07-31T02:21:06Z

Feel free to use whatever you see it works.

Cool, it's ready for review then

akurniawan and others added 10 commits February 13, 2019 14:37

Merge pull request #1 from pytorch/master

3ab5c1e

Merging from upstream

Merge branch 'master' of https://github.com/pytorch/text

db1557f

Merge branch 'master' of https://github.com/pytorch/text

d2bac2b

Merge branch 'master' of https://github.com/pytorch/text

0a39944

Merge branch 'master' of https://github.com/pytorch/text

9259228

first commit for machine translation example

4791ccf

adding word version for target

9dbb558

add word vocab to the training dataset

8ba7975

wrapping up training and evaluation code

780d837

add README

6af29b1

zhangguanheng66 reviewed Jul 2, 2020

View reviewed changes

akurniawan added 9 commits July 6, 2020 10:57

add bleu score

55523e2

add device to inputs

24b4534

run full training

d35b12f

add tqdm for training and evaluation bar visualization

5fa01e5

add seed to ensure reproducibility

90e3200

add remove extra whitespace preprocessing

1d5a0ce

add param on testing data

1e6cb1f

fix printing format in test

ea94065

add result for the machine translation example

e2a2386

zhangguanheng66 reviewed Jul 6, 2020

View reviewed changes

akurniawan added 3 commits July 7, 2020 08:19

add argparse and move collate fn

7e89051

rename train.py to train_char to differentiate between character leve…

17da84b

…l model and word level model

add train_word for word level training in machine translation

fd81ecb

akurniawan added 9 commits July 7, 2020 19:27

add more complete todo message

78ef553

add case to handle whitespaces

66428da

fix wrong calculation by removing first index

88d6332

fix wrong learning rate

1e14007

add saving functionality

b72f530

Merge branch 'master' of https://github.com/pytorch/text into transla…

b4f9851

…tion-example

change wrong index in testing data

55eec60

add complete experiments output for both char and word version

27dbf43

Merge branch 'master' of https://github.com/pytorch/text into transla…

2724231

…tion-example

zhangguanheng66 reviewed Jul 9, 2020

View reviewed changes

add more explanations on char vs word example

0aa87f4

akurniawan added 2 commits July 29, 2020 16:53

Merge branch 'master' of https://github.com/pytorch/text into transla…

ae52c0d

…tion-example

char_transform with partial and map

5eba34e

remove unused imports

0886a68

zhangguanheng66 mentioned this pull request Aug 24, 2020

[HELP WANTED]Tutorial for nn.Transformer with nn.TransformerDecoder pytorch/tutorials#719

Closed


		epoch_mins, epoch_secs = epoch_time(start_time, end_time)

		print(f"Epoch: {epoch+1:02} \| Time: {epoch_mins}m {epoch_secs}s")

		@@ -0,0 +1,9 @@
		# This is an example to create a machine translation dataset and train a translation model.

		return pad_sequence(txt, True, pad_idx)


		def collate_fn(batch):

		return build_vocab_from_iterator(tok_list)


		def char_vocab_func(vocab):

Conversation

akurniawan commented Jul 2, 2020

Uh oh!

codecov Bot commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akurniawan commented Jul 9, 2020

Uh oh!

zhangguanheng66 Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akurniawan Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 left a comment

Choose a reason for hiding this comment

Uh oh!

akurniawan commented Jul 10, 2020

Uh oh!

zhangguanheng66 commented Jul 27, 2020

Uh oh!

akurniawan commented Jul 29, 2020

Uh oh!

zhangguanheng66 commented Jul 29, 2020

Uh oh!

akurniawan commented Jul 29, 2020

Uh oh!

zhangguanheng66 commented Jul 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akurniawan commented Jul 30, 2020

codecov Bot commented Jul 2, 2020 •

edited

Loading

zhangguanheng66 Jul 9, 2020 •

edited

Loading

zhangguanheng66 Jul 9, 2020 •

edited

Loading

akurniawan Jul 10, 2020 •

edited

Loading

zhangguanheng66 commented Jul 29, 2020 •

edited

Loading