This repository was archived by the owner on Sep 10, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 812
Experimental machine translation example #864
Open
akurniawan
wants to merge
35
commits into
pytorch:main
Choose a base branch
from
akurniawan:translation-example
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 19 commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
3ab5c1e
Merge pull request #1 from pytorch/master
akurniawan db1557f
Merge branch 'master' of https://github.com/pytorch/text
d2bac2b
Merge branch 'master' of https://github.com/pytorch/text
0a39944
Merge branch 'master' of https://github.com/pytorch/text
9259228
Merge branch 'master' of https://github.com/pytorch/text
akurniawan 4791ccf
first commit for machine translation example
akurniawan 9dbb558
adding word version for target
akurniawan 8ba7975
add word vocab to the training dataset
akurniawan 780d837
wrapping up training and evaluation code
akurniawan 6af29b1
add README
akurniawan 55523e2
add bleu score
akurniawan 24b4534
add device to inputs
akurniawan d35b12f
run full training
akurniawan 5fa01e5
add tqdm for training and evaluation bar visualization
akurniawan 90e3200
add seed to ensure reproducibility
akurniawan 1d5a0ce
add remove extra whitespace preprocessing
akurniawan 1e6cb1f
add param on testing data
akurniawan ea94065
fix printing format in test
akurniawan e2a2386
add result for the machine translation example
akurniawan 7e89051
add argparse and move collate fn
akurniawan 17da84b
rename train.py to train_char to differentiate between character leve…
akurniawan fd81ecb
add train_word for word level training in machine translation
akurniawan 78ef553
add more complete todo message
akurniawan 66428da
add case to handle whitespaces
akurniawan 88d6332
fix wrong calculation by removing first index
akurniawan 1e14007
fix wrong learning rate
akurniawan b72f530
add saving functionality
akurniawan b4f9851
Merge branch 'master' of https://github.com/pytorch/text into transla…
akurniawan 55eec60
change wrong index in testing data
akurniawan 27dbf43
add complete experiments output for both char and word version
akurniawan 2724231
Merge branch 'master' of https://github.com/pytorch/text into transla…
akurniawan 0aa87f4
add more explanations on char vs word example
akurniawan ae52c0d
Merge branch 'master' of https://github.com/pytorch/text into transla…
akurniawan 5eba34e
char_transform with partial and map
akurniawan 0886a68
remove unused imports
akurniawan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| # This is an example to create a machine translation dataset and train a translation model. | ||
|
|
||
| Users will use the training data in the raw file from Multi30k dataset to train a machine translation model with the character composition method. | ||
|
|
||
| To try the example, simply run the following commands: | ||
|
|
||
| ```bash | ||
| python train.py | ||
| ``` | ||
|
|
||
| ## Experiment Result | ||
|
|
||
| The following is the output example for running `train.py` | ||
|
|
||
| ``` | ||
| Epoch: 01 | Time: 2m 10s | ||
| Train Loss: 5.277 | Train PPL: 195.798 | Train BLEU: 0.001 | ||
| Val. Loss: 4.088 | Val. PPL: 59.598 | Val. BLEU: 0.006 | ||
| Epoch: 02 | Time: 2m 29s | ||
| Train Loss: 3.711 | Train PPL: 40.877 | Train BLEU: 0.022 | ||
| Val. Loss: 2.964 | Val. PPL: 19.369 | Val. BLEU: 0.048 | ||
| Epoch: 03 | Time: 2m 32s | ||
| Train Loss: 2.901 | Train PPL: 18.189 | Train BLEU: 0.055 | ||
| Val. Loss: 2.172 | Val. PPL: 8.774 | Val. BLEU: 0.111 | ||
| Epoch: 04 | Time: 2m 46s | ||
| Train Loss: 2.391 | Train PPL: 10.927 | Train BLEU: 0.092 | ||
| Val. Loss: 1.766 | Val. PPL: 5.849 | Val. BLEU: 0.164 | ||
| Epoch: 05 | Time: 2m 40s | ||
| Train Loss: 2.085 | Train PPL: 8.042 | Train BLEU: 0.118 | ||
| Val. Loss: 1.503 | Val. PPL: 4.494 | Val. BLEU: 0.196 | ||
| Epoch: 06 | Time: 2m 39s | ||
| Train Loss: 1.856 | Train PPL: 6.398 | Train BLEU: 0.140 | ||
| Val. Loss: 1.302 | Val. PPL: 3.678 | Val. BLEU: 0.229 | ||
| Epoch: 07 | Time: 2m 40s | ||
| Train Loss: 1.683 | Train PPL: 5.383 | Train BLEU: 0.157 | ||
| Val. Loss: 1.164 | Val. PPL: 3.202 | Val. BLEU: 0.250 | ||
| Epoch: 08 | Time: 2m 44s | ||
| Train Loss: 1.554 | Train PPL: 4.730 | Train BLEU: 0.168 | ||
| Val. Loss: 1.075 | Val. PPL: 2.930 | Val. BLEU: 0.263 | ||
| Epoch: 09 | Time: 2m 38s | ||
| Train Loss: 1.455 | Train PPL: 4.283 | Train BLEU: 0.178 | ||
| Val. Loss: 1.016 | Val. PPL: 2.763 | Val. BLEU: 0.271 | ||
| Epoch: 10 | Time: 2m 46s | ||
| Train Loss: 1.373 | Train PPL: 3.948 | Train BLEU: 0.187 | ||
| Val. Loss: 0.972 | Val. PPL: 2.644 | Val. BLEU: 0.280 | ||
| | Test Loss: 1.011 | Test PPL: 2.748 | Test BLEU: 0.273 | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| import itertools | ||
| import re | ||
|
|
||
| import torch | ||
| from torch.utils.data import DataLoader | ||
|
|
||
| from torchtext.data.utils import get_tokenizer | ||
| from torchtext.experimental.datasets.translation import DATASETS, TranslationDataset | ||
| from torchtext.experimental.functional import sequential_transforms, vocab_func | ||
| from torchtext.vocab import build_vocab_from_iterator | ||
|
|
||
|
|
||
| def build_word_vocab(data, transforms, index, init_token="<w>", eos_token="</w>"): | ||
| tok_list = [[init_token], [eos_token]] | ||
| for line in data: | ||
| tok_list.append(transforms(line[index])) | ||
| return build_vocab_from_iterator(tok_list) | ||
|
|
||
|
|
||
| def build_char_vocab( | ||
| data, transforms, index, init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>", | ||
| ): | ||
| tok_list = [ | ||
| [init_word_token], | ||
| [eos_word_token], | ||
| [init_sent_token], | ||
| [eos_sent_token], | ||
| ] | ||
| for line in data: | ||
| tokens = list(itertools.chain.from_iterable(transforms(line[index]))) | ||
| tok_list.append(tokens) | ||
| return build_vocab_from_iterator(tok_list) | ||
|
|
||
|
|
||
| def char_vocab_func(vocab): | ||
| def func(tok_iter): | ||
| return [[vocab[char] for char in word] for word in tok_iter] | ||
|
|
||
| return func | ||
|
|
||
|
|
||
| def special_char_tokens_func( | ||
| init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>", | ||
| ): | ||
| def func(tok_iter): | ||
| result = [[init_word_token, init_sent_token, eos_word_token]] | ||
| result += [[init_word_token] + word + [eos_word_token] for word in tok_iter] | ||
| result += [[init_word_token, eos_sent_token, eos_word_token]] | ||
| return result | ||
|
|
||
| return func | ||
|
|
||
|
|
||
| def special_word_token_func(init_word_token="<w>", eos_word_token="</w>"): | ||
| def func(tok_iter): | ||
| return [init_word_token] + tok_iter + [eos_word_token] | ||
|
|
||
| return func | ||
|
|
||
|
|
||
| def parallel_transforms(*transforms): | ||
| def func(txt_input): | ||
| result = [] | ||
| for transform in transforms: | ||
| result.append(transform(txt_input)) | ||
| return tuple(result) | ||
|
|
||
| return func | ||
|
|
||
|
|
||
| def get_dataset(): | ||
| # Get the raw dataset first. This will give us the text | ||
| # version of the dataset | ||
| train, test, val = DATASETS["Multi30k"]() | ||
| # Cache training data for vocabulary construction | ||
| train_data = [line for line in train] | ||
| val_data = [line for line in val] | ||
| test_data = [line for line in test] | ||
| # Setup word tokenizer | ||
| src_tokenizer = get_tokenizer("spacy", language="de_core_news_sm") | ||
| tgt_tokenizer = get_tokenizer("spacy", language="en_core_web_sm") | ||
| # Setup char tokenizer | ||
|
|
||
| def char_tokenizer(words): | ||
| return [list(word) for word in words] | ||
|
|
||
| def remove_extra_whitespace(line): | ||
| return re.sub(" {2,}", " ", line) | ||
|
|
||
| src_char_transform = sequential_transforms(remove_extra_whitespace, src_tokenizer, char_tokenizer) | ||
| tgt_char_transform = sequential_transforms(remove_extra_whitespace, tgt_tokenizer, char_tokenizer) | ||
| tgt_word_transform = sequential_transforms(remove_extra_whitespace, tgt_tokenizer) | ||
|
|
||
| # Setup vocabularies (both words and chars) | ||
| src_char_vocab = build_char_vocab(train_data, src_char_transform, index=0) | ||
| tgt_char_vocab = build_char_vocab(train_data, tgt_char_transform, index=1) | ||
| tgt_word_vocab = build_word_vocab(train_data, tgt_word_transform, index=1) | ||
|
|
||
| # Building the dataset with character level tokenization | ||
| src_char_transform = sequential_transforms( | ||
| src_char_transform, special_char_tokens_func(), char_vocab_func(src_char_vocab) | ||
| ) | ||
| tgt_char_transform = sequential_transforms( | ||
| tgt_char_transform, special_char_tokens_func(), char_vocab_func(tgt_char_vocab) | ||
| ) | ||
| tgt_word_transform = sequential_transforms( | ||
| tgt_word_transform, special_word_token_func(), vocab_func(tgt_word_vocab) | ||
| ) | ||
| tgt_transform = parallel_transforms(tgt_char_transform, tgt_word_transform) | ||
| train_dataset = TranslationDataset( | ||
| train_data, (src_char_vocab, tgt_char_vocab, tgt_word_vocab), (src_char_transform, tgt_transform) | ||
| ) | ||
| val_dataset = TranslationDataset( | ||
| val_data, (src_char_vocab, tgt_char_vocab, tgt_word_vocab), (src_char_transform, tgt_transform) | ||
| ) | ||
| test_dataset = TranslationDataset( | ||
| test_data, (src_char_vocab, tgt_char_vocab, tgt_word_vocab), (src_char_transform, tgt_transform) | ||
| ) | ||
|
|
||
| return train_dataset, val_dataset, test_dataset |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| from typing import NamedTuple | ||
|
|
||
| import torch.nn as nn | ||
|
|
||
|
|
||
| class WordCharCNNEmbedding(nn.Module): | ||
| """The character embedding is built upon CNN and pooling layer | ||
| with dropout applied before the convolution and after the pooling. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| ntokens: int, | ||
| char_embedding_dim: int = 30, | ||
| char_padding_idx: int = 1, | ||
| dropout: float = 0.5, | ||
| kernel_size: int = 3, | ||
| out_channels: int = 30, | ||
| target_emb: int = 300, | ||
| use_highway: bool = False, | ||
| ): | ||
| super(WordCharCNNEmbedding, self).__init__() | ||
| self._use_highway = use_highway | ||
|
|
||
| if self._use_highway and out_channels != target_emb: | ||
| raise ValueError("out_channels and target_emb must be " "equal in highway setting") | ||
|
|
||
| self.char_embedding = nn.Embedding(ntokens, char_embedding_dim, char_padding_idx) | ||
| self.conv_embedding = nn.Sequential( | ||
| nn.Dropout(p=dropout), | ||
| nn.Conv1d( | ||
| in_channels=char_embedding_dim, | ||
| out_channels=out_channels, | ||
| kernel_size=kernel_size, | ||
| padding=kernel_size - 1, | ||
| ), | ||
| nn.AdaptiveMaxPool1d(1), | ||
| ) | ||
| self.proj_layer = nn.Linear(out_channels, target_emb) | ||
| self.out_dropout = nn.Dropout(p=dropout) | ||
| self._char_padding_idx = char_padding_idx | ||
|
|
||
| self.init_weights() | ||
|
|
||
| def init_weights(self): | ||
| """Initialize the weight of character embedding with xavier | ||
| and reinitalize the padding vectors to zero | ||
| """ | ||
|
|
||
| self.char_embedding.weight.data.uniform_(-0.1, 0.1) | ||
| # Reinitialize vectors at padding_idx to have 0 value | ||
| self.char_embedding.weight.data[self._char_padding_idx].uniform_(0, 0) | ||
|
|
||
| def forward(self, chars): | ||
| """Run the forward calculation of the char-cnn embedding | ||
| model. | ||
| Args: | ||
| chars (torch.Tensor): An integer tensor with the size of | ||
| [seq_len x batch x char_size] | ||
| Returns: | ||
| char_embedding_vec (torch.Tensor): An embedding tensor with | ||
| the size of [batch x seq_len x out_channels] | ||
| """ | ||
| char_embedding_vec = self.char_embedding(chars) | ||
| # Reshape the character embedding to the size of | ||
| # [batch * seq_len, char_len, char_dim] | ||
| char_embedding_vec = char_embedding_vec.view( | ||
| -1, char_embedding_vec.size(2), char_embedding_vec.size(3) | ||
| ).contiguous() | ||
| # Transpose the embedding into [batch * seq_len, char_dim, char_len] | ||
| char_embedding_vec = char_embedding_vec.transpose(1, 2).contiguous() | ||
| # Apply char embedding with dropout and convolution | ||
| # layers so the dim now will be [batch * seq_len, out_channel, new_len] | ||
| char_embedding_vec = self.conv_embedding(char_embedding_vec) | ||
| char_embedding_vec = char_embedding_vec.squeeze(-1) | ||
| # Revert the size back to [seq_len, batch, out_channel] | ||
| char_embedding_vec = char_embedding_vec.view(chars.size(0), chars.size(1), -1).contiguous() | ||
| char_embedding_vec = self.out_dropout(char_embedding_vec) | ||
| proj_char_embedding_vec = self.proj_layer(char_embedding_vec) | ||
| # Apply highway connection between projection layer and | ||
| # pooling layer | ||
| if self._use_highway: | ||
| proj_char_embedding_vec += char_embedding_vec | ||
|
|
||
| return proj_char_embedding_vec |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include a metric for the test/valid datasets with the trained model? See my comments about
bleu_scorebelow.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be better. However, I may not be able to run a full blown training as my resource is quite limited. Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. I will find time and work on it this half. Then, I can update this. Just to make sure that you set up the model/training correctly by checking the learning curve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Sorry for the trouble 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
borrowed a resource to run 10 epochs, already putting the result on README. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a little bit too long. Should we just include the final test result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of removing the whole training metrics from the docs I trim it so that I only include the first and the last training output to give users some idea on the loss value while running the example