Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Experimental machine translation example#864

Open
akurniawan wants to merge 35 commits intopytorch:mainfrom
akurniawan:translation-example
Open

Experimental machine translation example#864
akurniawan wants to merge 35 commits intopytorch:mainfrom
akurniawan:translation-example

Conversation

@akurniawan
Copy link
Copy Markdown
Contributor

This is a PR for new torchtext API in machine translation use case. This includes:

@zhangguanheng66 let me know what you think of this. Thanks!

@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 2, 2020

Codecov Report

Merging #864 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #864   +/-   ##
=======================================
  Coverage   76.99%   76.99%           
=======================================
  Files          44       44           
  Lines        3052     3052           
=======================================
  Hits         2350     2350           
  Misses        702      702           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a93cad5...0886a68. Read the comment docs.

self.dropout = dropout

self.embedding = embedding
self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record. I will add or switch transformer model in 2020 H2 as many OSS users require an example to show how to use transformer decoder.
https://discuss.pytorch.org/t/nn-transformer-explaination/53175/14
https://discuss.pytorch.org/t/how-to-use-nn-transformerdecoder-at-inference-time/49484

Comment thread examples/machine_translation/train.py Outdated

epoch_mins, epoch_secs = epoch_time(start_time, end_time)

print(f"Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use bleu_score for a metric?

def bleu_score(candidate_corpus, references_corpus, max_n=4, weights=[0.25] * 4):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,9 @@
# This is an example to create a machine translation dataset and train a translation model.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include a metric for the test/valid datasets with the trained model? See my comments about bleu_score below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better. However, I may not be able to run a full blown training as my resource is quite limited. Do you have any suggestions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I will find time and work on it this half. Then, I can update this. Just to make sure that you set up the model/training correctly by checking the learning curve.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Sorry for the trouble 🙏

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

borrowed a resource to run 10 epochs, already putting the result on README. wdyt?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a little bit too long. Should we just include the final test result?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing the whole training metrics from the docs I trim it so that I only include the first and the last training output to give users some idea on the loss value while running the example

Comment thread examples/machine_translation/train.py Outdated
from utils import collate_fn, count_parameters, epoch_time, seed_everything

# Ensure reproducibility
seed_everything(42)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow users to set up the see, like this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged with other hyperparams with argparse

Comment thread examples/machine_translation/train.py Outdated
# enc_dropout = 0.5
# dec_dropout = 0.5

enc_emb_dim = 300
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use parser to set up those hyperparameters, like this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added argparse

@@ -0,0 +1,62 @@
import itertools
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any generic building blocks that we could add to the library to support other pipeline?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pad_chars and pad_words would essentially be useful for other pipeline for sure, but for pad_chars not really sure how often people are actually using it compares to subword/word level. the rest (epoch_time, count_parameters, and seed_everything) are typically available on training framework such as pytorch-lightning, ignite, etc, I don't think we need to re-implement them in torchtext, what do you think?

Comment thread examples/machine_translation/utils.py Outdated
return pad_sequence(txt, True, pad_idx)


def collate_fn(batch):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we put collate_fn together with DataLoader in a same file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to train.py

@akurniawan
Copy link
Copy Markdown
Contributor Author

@zhangguanheng66 just to let you know, this is ready for review. Thanks!

Comment thread examples/machine_translation/README.md Outdated
python train_char.py
```

For character level training, and
Copy link
Copy Markdown
Contributor

@zhangguanheng66 zhangguanheng66 Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between "character" level vs "word" level training? Better to be more clear with more doc here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more explanation

return build_vocab_from_iterator(tok_list)


def char_vocab_func(vocab):
Copy link
Copy Markdown
Contributor

@zhangguanheng66 zhangguanheng66 Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something like subword method? Do you think if sentencepiece method is easier to implement?

Copy link
Copy Markdown
Contributor Author

@akurniawan akurniawan Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By subword method you mean tokenizing a sentence into a list of subwords? Then no. This is a function with similar functionality with vocab_func, converting a list of strings to their index, except instead of converting words to string we are converting a list of chars to their index. To give you an example

sentence: I love torchtext
character representation: [['I'], ['l', 'o', 'v', 'e'], ['t', 'o', 'r', 'c', 'h', 't', 'e', 'x', 't']]
char_vocab_func(character representation): [[1], [2, 3, 4], [5, 6, 7, 8, 9, 10, 11, 12, 13]]

from torch.utils.data import DataLoader

from torchtext.data.utils import get_tokenizer
from torchtext.experimental.datasets.translation import (DATASETS,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to load the raw datasets from torchtext.experimental.datasets.raw directly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to be clear, it's not recommended/intended to use the raw datasets from torchtext.experimental.datasets?

Copy link
Copy Markdown
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpuhrsch created an example to show how to use map and partial to consume vocab object. See here or a diff file here

@akurniawan
Copy link
Copy Markdown
Contributor Author

@zhangguanheng66 I got AttributeError: 'Vocab' object has no attribute 'insert_tokens' in following the script. I have tried to pull the latest changes but seems still no insert_tokens, on vocab object

@zhangguanheng66
Copy link
Copy Markdown
Contributor

@akurniawan Just to check in and see if you have problems.

@akurniawan
Copy link
Copy Markdown
Contributor Author

@zhangguanheng66 yes, I have been trying to follow the code that you have given above, the one created by @cpuhrsch. It throws an error AttributeError: 'Vocab' object has no attribute 'insert_tokens'. I take a look at both vocab objects (one in the legacy and one in the experimental) and both don't have insert_tokens and insert method as well. I'm not sure if that was only a pseudocode to insert tokens after building vocabulary or it's a real methods that haven't been implemented yet. So as of now I'm still keeping the old way to build the vocab

@zhangguanheng66
Copy link
Copy Markdown
Contributor

@zhangguanheng66 yes, I have been trying to follow the code that you have given above, the one created by @cpuhrsch. It throws an error AttributeError: 'Vocab' object has no attribute 'insert_tokens'. I take a look at both vocab objects (one in the legacy and one in the experimental) and both don't have insert_tokens and insert method as well. I'm not sure if that was only a pseudocode to insert tokens after building vocabulary or it's a real methods that haven't been implemented yet. So as of now I'm still keeping the old way to build the vocab

Could you point out where insert_tokens is used in your code? And you are using the vocab class in the main folder, not experimental folder, right?

@akurniawan
Copy link
Copy Markdown
Contributor Author

Could you point out where insert_tokens is used in your code?

It's not being used right now on the code as it throws the error I mentioned earlier. But I was following this implementation on line 80 and 84.

And you are using the vocab class in the main folder, not experimental folder, right?

yes correct

@zhangguanheng66
Copy link
Copy Markdown
Contributor

zhangguanheng66 commented Jul 29, 2020

Could you point out where insert_tokens is used in your code?

It's not being used right now on the code as it throws the error I mentioned earlier. But I was following this implementation on line 80 and 84.

And you are using the vocab class in the main folder, not experimental folder, right?

yes correct

The Vocab in the main folder doesn't have insert_token method, however, the one in experimental folder has. Overall, no sure why you got that error because you never call insert_token method. Could you show me the full error message chain?

@akurniawan
Copy link
Copy Markdown
Contributor Author

The Vocab in the main folder doesn't have insert_token method, however, the one in experimental folder has. Overall, no sure why you got that error because you never call insert_token method. Could you show me the full error message chain?

Sorry for not being clear. So basically this is what I did. After you made the comment for using map and partial to tokenize the character-level representation, I tried to make changes in my local by following exactly the example code (the one that contains insert_tokens and insert method for inserting vocabulary). It throws error that both insert_tokens and insert were not found. Then because of that error, I removed the implementation back. The difference between the example you gave and the current implementation is on the way we insert the special tokens.

On the example code, we do it this way

    train, _, _ = DATASETS[dataset_name]()
    src_char_vocab = build_char_vocab(train, src_char_transform, index=0)
    src_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_char_vocab = build_char_vocab(train, tgt_char_transform, index=1)
    tgt_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_word_vocab = build_vocab_from_iterator(iter(map(lambda tgt_word_transform(x[0]), train)))
    tgt_word_vocab.insert(eos_word_token, 0)
    tgt_word_vocab.insert(init_word_token, 0)

We have insert_tokens and insert to insert special tokens both for char and word level. When I run this, it throws errors as both methods are not available. Therefore, I revert back and still use the following way

def build_word_vocab(data, transforms, index, init_token="<w>", eos_token="</w>"):
    tok_list = [[init_token], [eos_token]]
    return build_vocab_from_iterator(tok_list + list(map(lambda x: transforms(x[index]), data)))


def build_char_vocab(
    data, transforms, index, init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>",
):
    tok_list = [
        [init_word_token],
        [eos_word_token],
        [init_sent_token],
        [eos_sent_token],
    ]
    for line in data:
        tokens = list(itertools.chain.from_iterable(transforms(line[index])))
        tok_list.append(tokens)
    return build_vocab_from_iterator(tok_list)

Where I don't use insert_tokens and insert method to put the special tokens

@zhangguanheng66
Copy link
Copy Markdown
Contributor

The Vocab in the main folder doesn't have insert_token method, however, the one in experimental folder has. Overall, no sure why you got that error because you never call insert_token method. Could you show me the full error message chain?

Sorry for not being clear. So basically this is what I did. After you made the comment for using map and partial to tokenize the character-level representation, I tried to make changes in my local by following exactly the example code (the one that contains insert_tokens and insert method for inserting vocabulary). It throws error that both insert_tokens and insert were not found. Then because of that error, I removed the implementation back. The difference between the example you gave and the current implementation is on the way we insert the special tokens.

On the example code, we do it this way

    train, _, _ = DATASETS[dataset_name]()
    src_char_vocab = build_char_vocab(train, src_char_transform, index=0)
    src_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_char_vocab = build_char_vocab(train, tgt_char_transform, index=1)
    tgt_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_word_vocab = build_vocab_from_iterator(iter(map(lambda tgt_word_transform(x[0]), train)))
    tgt_word_vocab.insert(eos_word_token, 0)
    tgt_word_vocab.insert(init_word_token, 0)

We have insert_tokens and insert to insert special tokens both for char and word level. When I run this, it throws errors as both methods are not available. Therefore, I revert back and still use the following way

def build_word_vocab(data, transforms, index, init_token="<w>", eos_token="</w>"):
    tok_list = [[init_token], [eos_token]]
    return build_vocab_from_iterator(tok_list + list(map(lambda x: transforms(x[index]), data)))


def build_char_vocab(
    data, transforms, index, init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>",
):
    tok_list = [
        [init_word_token],
        [eos_word_token],
        [init_sent_token],
        [eos_sent_token],
    ]
    for line in data:
        tokens = list(itertools.chain.from_iterable(transforms(line[index])))
        tok_list.append(tokens)
    return build_vocab_from_iterator(tok_list)

Where I don't use insert_tokens and insert method to put the special tokens

Feel free to use whatever you see it works.

@akurniawan
Copy link
Copy Markdown
Contributor Author

Feel free to use whatever you see it works.

Cool, it's ready for review then

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants