Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
Open
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3ab5c1e
Merge pull request #1 from pytorch/master
akurniawan Feb 13, 2019
db1557f
Merge branch 'master' of https://github.com/pytorch/text
Jun 21, 2020
d2bac2b
Merge branch 'master' of https://github.com/pytorch/text
Jun 23, 2020
0a39944
Merge branch 'master' of https://github.com/pytorch/text
Jun 25, 2020
9259228
Merge branch 'master' of https://github.com/pytorch/text
akurniawan Jun 29, 2020
4791ccf
first commit for machine translation example
akurniawan Jul 2, 2020
9dbb558
adding word version for target
akurniawan Jul 2, 2020
8ba7975
add word vocab to the training dataset
akurniawan Jul 2, 2020
780d837
wrapping up training and evaluation code
akurniawan Jul 2, 2020
6af29b1
add README
akurniawan Jul 2, 2020
55523e2
add bleu score
akurniawan Jul 6, 2020
24b4534
add device to inputs
akurniawan Jul 6, 2020
d35b12f
run full training
akurniawan Jul 6, 2020
5fa01e5
add tqdm for training and evaluation bar visualization
akurniawan Jul 6, 2020
90e3200
add seed to ensure reproducibility
akurniawan Jul 6, 2020
1d5a0ce
add remove extra whitespace preprocessing
akurniawan Jul 6, 2020
1e6cb1f
add param on testing data
akurniawan Jul 6, 2020
ea94065
fix printing format in test
akurniawan Jul 6, 2020
e2a2386
add result for the machine translation example
akurniawan Jul 6, 2020
7e89051
add argparse and move collate fn
akurniawan Jul 7, 2020
17da84b
rename train.py to train_char to differentiate between character leve…
akurniawan Jul 7, 2020
fd81ecb
add train_word for word level training in machine translation
akurniawan Jul 7, 2020
78ef553
add more complete todo message
akurniawan Jul 7, 2020
66428da
add case to handle whitespaces
akurniawan Jul 8, 2020
88d6332
fix wrong calculation by removing first index
akurniawan Jul 8, 2020
1e14007
fix wrong learning rate
akurniawan Jul 8, 2020
b72f530
add saving functionality
akurniawan Jul 8, 2020
b4f9851
Merge branch 'master' of https://github.com/pytorch/text into transla…
akurniawan Jul 8, 2020
55eec60
change wrong index in testing data
akurniawan Jul 8, 2020
27dbf43
add complete experiments output for both char and word version
akurniawan Jul 8, 2020
2724231
Merge branch 'master' of https://github.com/pytorch/text into transla…
akurniawan Jul 9, 2020
0aa87f4
add more explanations on char vs word example
akurniawan Jul 10, 2020
ae52c0d
Merge branch 'master' of https://github.com/pytorch/text into transla…
akurniawan Jul 29, 2020
5eba34e
char_transform with partial and map
akurniawan Jul 29, 2020
0886a68
remove unused imports
akurniawan Jul 31, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions examples/machine_translation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# This is an example to create a machine translation dataset and train a translation model.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include a metric for the test/valid datasets with the trained model? See my comments about bleu_score below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better. However, I may not be able to run a full blown training as my resource is quite limited. Do you have any suggestions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I will find time and work on it this half. Then, I can update this. Just to make sure that you set up the model/training correctly by checking the learning curve.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Sorry for the trouble 🙏

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

borrowed a resource to run 10 epochs, already putting the result on README. wdyt?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a little bit too long. Should we just include the final test result?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing the whole training metrics from the docs I trim it so that I only include the first and the last training output to give users some idea on the loss value while running the example

Users will use the training data in the raw file from Multi30k dataset to train a machine translation model with the character composition method.

To try the example, simply run the following commands:

```bash
python train.py
```

## Experiment Result

The following is the output example for running `train.py`

```
Epoch: 01 | Time: 2m 10s
Train Loss: 5.277 | Train PPL: 195.798 | Train BLEU: 0.001
Val. Loss: 4.088 | Val. PPL: 59.598 | Val. BLEU: 0.006
Epoch: 02 | Time: 2m 29s
Train Loss: 3.711 | Train PPL: 40.877 | Train BLEU: 0.022
Val. Loss: 2.964 | Val. PPL: 19.369 | Val. BLEU: 0.048
Epoch: 03 | Time: 2m 32s
Train Loss: 2.901 | Train PPL: 18.189 | Train BLEU: 0.055
Val. Loss: 2.172 | Val. PPL: 8.774 | Val. BLEU: 0.111
Epoch: 04 | Time: 2m 46s
Train Loss: 2.391 | Train PPL: 10.927 | Train BLEU: 0.092
Val. Loss: 1.766 | Val. PPL: 5.849 | Val. BLEU: 0.164
Epoch: 05 | Time: 2m 40s
Train Loss: 2.085 | Train PPL: 8.042 | Train BLEU: 0.118
Val. Loss: 1.503 | Val. PPL: 4.494 | Val. BLEU: 0.196
Epoch: 06 | Time: 2m 39s
Train Loss: 1.856 | Train PPL: 6.398 | Train BLEU: 0.140
Val. Loss: 1.302 | Val. PPL: 3.678 | Val. BLEU: 0.229
Epoch: 07 | Time: 2m 40s
Train Loss: 1.683 | Train PPL: 5.383 | Train BLEU: 0.157
Val. Loss: 1.164 | Val. PPL: 3.202 | Val. BLEU: 0.250
Epoch: 08 | Time: 2m 44s
Train Loss: 1.554 | Train PPL: 4.730 | Train BLEU: 0.168
Val. Loss: 1.075 | Val. PPL: 2.930 | Val. BLEU: 0.263
Epoch: 09 | Time: 2m 38s
Train Loss: 1.455 | Train PPL: 4.283 | Train BLEU: 0.178
Val. Loss: 1.016 | Val. PPL: 2.763 | Val. BLEU: 0.271
Epoch: 10 | Time: 2m 46s
Train Loss: 1.373 | Train PPL: 3.948 | Train BLEU: 0.187
Val. Loss: 0.972 | Val. PPL: 2.644 | Val. BLEU: 0.280
| Test Loss: 1.011 | Test PPL: 2.748 | Test BLEU: 0.273
```
120 changes: 120 additions & 0 deletions examples/machine_translation/dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
import itertools
import re

import torch
from torch.utils.data import DataLoader

from torchtext.data.utils import get_tokenizer
from torchtext.experimental.datasets.translation import DATASETS, TranslationDataset
from torchtext.experimental.functional import sequential_transforms, vocab_func
from torchtext.vocab import build_vocab_from_iterator


def build_word_vocab(data, transforms, index, init_token="<w>", eos_token="</w>"):
tok_list = [[init_token], [eos_token]]
for line in data:
tok_list.append(transforms(line[index]))
return build_vocab_from_iterator(tok_list)


def build_char_vocab(
data, transforms, index, init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>",
):
tok_list = [
[init_word_token],
[eos_word_token],
[init_sent_token],
[eos_sent_token],
]
for line in data:
tokens = list(itertools.chain.from_iterable(transforms(line[index])))
tok_list.append(tokens)
return build_vocab_from_iterator(tok_list)


def char_vocab_func(vocab):
def func(tok_iter):
return [[vocab[char] for char in word] for word in tok_iter]

return func


def special_char_tokens_func(
init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>",
):
def func(tok_iter):
result = [[init_word_token, init_sent_token, eos_word_token]]
result += [[init_word_token] + word + [eos_word_token] for word in tok_iter]
result += [[init_word_token, eos_sent_token, eos_word_token]]
return result

return func


def special_word_token_func(init_word_token="<w>", eos_word_token="</w>"):
def func(tok_iter):
return [init_word_token] + tok_iter + [eos_word_token]

return func


def parallel_transforms(*transforms):
def func(txt_input):
result = []
for transform in transforms:
result.append(transform(txt_input))
return tuple(result)

return func


def get_dataset():
# Get the raw dataset first. This will give us the text
# version of the dataset
train, test, val = DATASETS["Multi30k"]()
# Cache training data for vocabulary construction
train_data = [line for line in train]
val_data = [line for line in val]
test_data = [line for line in test]
# Setup word tokenizer
src_tokenizer = get_tokenizer("spacy", language="de_core_news_sm")
tgt_tokenizer = get_tokenizer("spacy", language="en_core_web_sm")
# Setup char tokenizer

def char_tokenizer(words):
return [list(word) for word in words]

def remove_extra_whitespace(line):
return re.sub(" {2,}", " ", line)

src_char_transform = sequential_transforms(remove_extra_whitespace, src_tokenizer, char_tokenizer)
tgt_char_transform = sequential_transforms(remove_extra_whitespace, tgt_tokenizer, char_tokenizer)
tgt_word_transform = sequential_transforms(remove_extra_whitespace, tgt_tokenizer)

# Setup vocabularies (both words and chars)
src_char_vocab = build_char_vocab(train_data, src_char_transform, index=0)
tgt_char_vocab = build_char_vocab(train_data, tgt_char_transform, index=1)
tgt_word_vocab = build_word_vocab(train_data, tgt_word_transform, index=1)

# Building the dataset with character level tokenization
src_char_transform = sequential_transforms(
src_char_transform, special_char_tokens_func(), char_vocab_func(src_char_vocab)
)
tgt_char_transform = sequential_transforms(
tgt_char_transform, special_char_tokens_func(), char_vocab_func(tgt_char_vocab)
)
tgt_word_transform = sequential_transforms(
tgt_word_transform, special_word_token_func(), vocab_func(tgt_word_vocab)
)
tgt_transform = parallel_transforms(tgt_char_transform, tgt_word_transform)
train_dataset = TranslationDataset(
train_data, (src_char_vocab, tgt_char_vocab, tgt_word_vocab), (src_char_transform, tgt_transform)
)
val_dataset = TranslationDataset(
val_data, (src_char_vocab, tgt_char_vocab, tgt_word_vocab), (src_char_transform, tgt_transform)
)
test_dataset = TranslationDataset(
test_data, (src_char_vocab, tgt_char_vocab, tgt_word_vocab), (src_char_transform, tgt_transform)
)

return train_dataset, val_dataset, test_dataset
85 changes: 85 additions & 0 deletions examples/machine_translation/embedding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
from typing import NamedTuple

import torch.nn as nn


class WordCharCNNEmbedding(nn.Module):
"""The character embedding is built upon CNN and pooling layer
with dropout applied before the convolution and after the pooling.
"""

def __init__(
self,
ntokens: int,
char_embedding_dim: int = 30,
char_padding_idx: int = 1,
dropout: float = 0.5,
kernel_size: int = 3,
out_channels: int = 30,
target_emb: int = 300,
use_highway: bool = False,
):
super(WordCharCNNEmbedding, self).__init__()
self._use_highway = use_highway

if self._use_highway and out_channels != target_emb:
raise ValueError("out_channels and target_emb must be " "equal in highway setting")

self.char_embedding = nn.Embedding(ntokens, char_embedding_dim, char_padding_idx)
self.conv_embedding = nn.Sequential(
nn.Dropout(p=dropout),
nn.Conv1d(
in_channels=char_embedding_dim,
out_channels=out_channels,
kernel_size=kernel_size,
padding=kernel_size - 1,
),
nn.AdaptiveMaxPool1d(1),
)
self.proj_layer = nn.Linear(out_channels, target_emb)
self.out_dropout = nn.Dropout(p=dropout)
self._char_padding_idx = char_padding_idx

self.init_weights()

def init_weights(self):
"""Initialize the weight of character embedding with xavier
and reinitalize the padding vectors to zero
"""

self.char_embedding.weight.data.uniform_(-0.1, 0.1)
# Reinitialize vectors at padding_idx to have 0 value
self.char_embedding.weight.data[self._char_padding_idx].uniform_(0, 0)

def forward(self, chars):
"""Run the forward calculation of the char-cnn embedding
model.
Args:
chars (torch.Tensor): An integer tensor with the size of
[seq_len x batch x char_size]
Returns:
char_embedding_vec (torch.Tensor): An embedding tensor with
the size of [batch x seq_len x out_channels]
"""
char_embedding_vec = self.char_embedding(chars)
# Reshape the character embedding to the size of
# [batch * seq_len, char_len, char_dim]
char_embedding_vec = char_embedding_vec.view(
-1, char_embedding_vec.size(2), char_embedding_vec.size(3)
).contiguous()
# Transpose the embedding into [batch * seq_len, char_dim, char_len]
char_embedding_vec = char_embedding_vec.transpose(1, 2).contiguous()
# Apply char embedding with dropout and convolution
# layers so the dim now will be [batch * seq_len, out_channel, new_len]
char_embedding_vec = self.conv_embedding(char_embedding_vec)
char_embedding_vec = char_embedding_vec.squeeze(-1)
# Revert the size back to [seq_len, batch, out_channel]
char_embedding_vec = char_embedding_vec.view(chars.size(0), chars.size(1), -1).contiguous()
char_embedding_vec = self.out_dropout(char_embedding_vec)
proj_char_embedding_vec = self.proj_layer(char_embedding_vec)
# Apply highway connection between projection layer and
# pooling layer
if self._use_highway:
proj_char_embedding_vec += char_embedding_vec

return proj_char_embedding_vec
Loading