Hi, Wang-Cheng,
Thanks for your awesome work SASRec! Recently, we are planning to develop a standardized sequential recommendation benchmark repo, where SASRec is the model we are going to implement first for sure. However, we found some potential issues in this official code, mainly related to the LayerNorm (LN) implementation.
As the original paper describes, the Pre-LN was used in SASRec, i.e.,
$$
g'(x) = x + \text{Dropout}(g(\text{LN}(x)))
$$
However, in the official code, only query (Q) is normalized before the attention layer, while the key and value (K, V) are not (cf. lines 79-83 in model.py):
Q = self.attention_layernorms[i](seqs)
mha_outputs, _ = self.attention_layers[i](Q, seqs, seqs,
attn_mask=attention_mask)
seqs = Q + mha_outputs # Q is LN(seqs)
In the standard Pre-LN, the K and V should also be normalized, i.e., Q = K = V = LN(seqs). We have indeed found a previous issue #33. Nonetheless, we believe that at least the V should be normalized.
We conducted some tiny experiments to evaluate the performance of the original normalization (Original), the standard Pre-LN (Pre Norm), and the Post-LN (Post Norm) implementations, where the Post-LN is the same as the original Transformer and BERT. The results can be reproduced by adopting the same hyperparameters as the original SASRec paper, using two processed datasets Amazon-Beauty and Movielens-1M.
-
Num_Block = 2
|
Beauty |
Movielens-1M |
| NDCG@10 |
HR@10 |
NDCG@10 |
HR@10 |
| Original |
0.05244 |
0.09345 |
0.16997 |
0.29685 |
| Pre Norm |
0.05138 |
0.09265 |
0.16629 |
0.29122 |
| Post Norm |
0.05275 |
0.09363 |
0.17199 |
0.29719 |
-
Num_Block = 3
|
Beauty |
Movielens-1M |
| NDCG@10 |
HR@10 |
NDCG@10 |
HR@10 |
| Original |
0.05095 |
0.09193 |
0.17113 |
0.30017 |
| Pre Norm |
0.05314 |
0.09475 |
0.16916 |
0.30248 |
| Post Norm |
0.05124 |
0.09215 |
0.17257 |
0.3050 |
Experiments show that the Post-LN implementation outperforms the original one, while the performance of the standard Pre-LN is not stable. We anticipate that this is due to the limited number of transformer blocks (<= 3) in SASRec, i.e., training SASRec with Post-LN may not suffer from the instability issue, while the expressive power of Pre-LN is limited (cf. this blog).
Given the above results, we suggest applying the standard Post-LN to the SASRec (or at least provide an option in the code, or clarification in the README about the LN implementation). The code modifications are simple:
# outputs += inputs # Remove this line
mha_outputs, _ = self.attention_layers[i](seqs, seqs, seqs,
attn_mask=attention_mask)
seqs = seqs + mha_outputs
seqs = self.attention_layernorms[i](seqs) # Apply Post-LN
seqs = torch.transpose(seqs, 0, 1)
seqs = seqs + self.forward_layers[i](seqs)
seqs = self.forward_layernorms[i](seqs) # Apply Post-LN
Thank you for your time and consideration.
Hi, Wang-Cheng,
Thanks for your awesome work SASRec! Recently, we are planning to develop a standardized sequential recommendation benchmark repo, where SASRec is the model we are going to implement first for sure. However, we found some potential issues in this official code, mainly related to the LayerNorm (LN) implementation.
As the original paper describes, the Pre-LN was used in SASRec, i.e.,
However, in the official code, only query (Q) is normalized before the attention layer, while the key and value (K, V) are not (cf. lines 79-83 in model.py):
In the standard Pre-LN, the K and V should also be normalized, i.e.,
Q = K = V = LN(seqs). We have indeed found a previous issue #33. Nonetheless, we believe that at least the V should be normalized.We conducted some tiny experiments to evaluate the performance of the original normalization (
Original), the standard Pre-LN (Pre Norm), and the Post-LN (Post Norm) implementations, where the Post-LN is the same as the original Transformer and BERT. The results can be reproduced by adopting the same hyperparameters as the original SASRec paper, using two processed datasets Amazon-Beauty and Movielens-1M.Num_Block = 2
Num_Block = 3
Experiments show that the Post-LN implementation outperforms the original one, while the performance of the standard Pre-LN is not stable. We anticipate that this is due to the limited number of transformer blocks (<= 3) in SASRec, i.e., training SASRec with Post-LN may not suffer from the instability issue, while the expressive power of Pre-LN is limited (cf. this blog).
Given the above results, we suggest applying the standard Post-LN to the SASRec (or at least provide an option in the code, or clarification in the README about the LN implementation). The code modifications are simple:
# outputs += inputs # Remove this lineThank you for your time and consideration.