Questions on LN in SASRec

Hi, Wang-Cheng,

Thanks for your awesome work SASRec! Recently, we are planning to develop a standardized sequential recommendation benchmark repo, where SASRec is the model we are going to implement first for sure. However, we found some potential issues in this official code, mainly related to the LayerNorm (LN) implementation.

As the [original paper](https://cseweb.ucsd.edu/~jmcauley/pdfs/icdm18.pdf) describes, the [Pre-LN](https://arxiv.org/abs/2002.04745) was used in SASRec, i.e.,

$$
g'(x) = x + \text{Dropout}(g(\text{LN}(x)))
$$

However, in the official code, only query (Q) is normalized before the attention layer, while the key and value (K, V) are not (cf. [lines 79-83 in model.py](https://github.com/pmixer/SASRec.pytorch/blob/main/python/model.py#L79-L83)):

```python
Q = self.attention_layernorms[i](seqs)
mha_outputs, _ = self.attention_layers[i](Q, seqs, seqs,
                                attn_mask=attention_mask)
seqs = Q + mha_outputs  # Q is LN(seqs)
```

In the standard Pre-LN, the K and V should also be normalized, i.e., `Q = K = V = LN(seqs)`. We have indeed found a previous issue [#33](https://github.com/pmixer/SASRec.pytorch/issues/33). Nonetheless, we believe that at least the V should be normalized.

We conducted some tiny experiments to evaluate the performance of the original normalization (`Original`), the standard Pre-LN (`Pre Norm`), and the Post-LN (`Post Norm`) implementations, where the Post-LN is the same as the original Transformer and BERT. The results can be reproduced by adopting the same hyperparameters as the original SASRec paper, using two processed datasets [Amazon-Beauty](https://github.com/Tiny-Snow/SeqRecBenchmark-Datasets/tree/main/data/amazon2014-beauty) and [Movielens-1M](https://github.com/Tiny-Snow/SeqRecBenchmark-Datasets/tree/main/data/movielens-1m).

- Num_Block = 2

  <table class="tg"><thead>
    <tr>
      <th class="tg-c3ow" rowspan="2"></th>
      <th class="tg-c3ow" colspan="2">Beauty</th>
      <th class="tg-c3ow" colspan="2">Movielens-1M</th>
    </tr>
    <tr>
      <th class="tg-c3ow">NDCG@10</th>
      <th class="tg-c3ow">HR@10</th>
      <th class="tg-c3ow">NDCG@10</th>
      <th class="tg-c3ow">HR@10</th>
    </tr></thead>
  <tbody>
    <tr>
      <td class="tg-c3ow">Original</td>
      <td class="tg-c3ow">0.05244</td>
      <td class="tg-c3ow">0.09345</td>
      <td class="tg-c3ow">0.16997</td>
      <td class="tg-c3ow"><span style="font-weight:400;font-style:normal">0.29685</span></td>
    </tr>
    <tr>
      <td class="tg-c3ow">Pre Norm</td>
      <td class="tg-c3ow">0.05138</td>
      <td class="tg-c3ow">0.09265</td>
      <td class="tg-c3ow">0.16629</td>
      <td class="tg-c3ow"><span style="font-weight:inherit;font-style:inherit">0.29122</span></td>
    </tr>
    <tr>
      <td class="tg-c3ow">Post Norm</td>
      <td class="tg-c3ow"><span style="font-weight:400;font-style:normal">0.05275</span></td>
      <td class="tg-c3ow">0.09363</td>
      <td class="tg-c3ow">0.17199</td>
      <td class="tg-c3ow"><span style="font-weight:400;font-style:normal">0.29719</span></td>
    </tr>
  </tbody></table>

- Num_Block = 3 

  <table class="tg"><thead>
    <tr>
      <th class="tg-c3ow" rowspan="2"></th>
      <th class="tg-c3ow" colspan="2">Beauty</th>
      <th class="tg-c3ow" colspan="2">Movielens-1M</th>
    </tr>
    <tr>
      <th class="tg-c3ow">NDCG@10</th>
      <th class="tg-c3ow">HR@10</th>
      <th class="tg-c3ow">NDCG@10</th>
      <th class="tg-c3ow">HR@10</th>
    </tr></thead>
  <tbody>
    <tr>
      <td class="tg-c3ow">Original</td>
      <td class="tg-c3ow">0.05095</td>
      <td class="tg-c3ow"><span style="font-weight:400;font-style:normal">0.09193</span></td>
      <td class="tg-c3ow"><span style="font-weight:400;font-style:normal">0.17113</span></td>
      <td class="tg-c3ow"><span style="font-weight:400;font-style:normal">0.30017</span></td>
    </tr>
    <tr>
      <td class="tg-c3ow">Pre Norm</td>
      <td class="tg-c3ow">0.05314</td>
      <td class="tg-c3ow"><span style="font-weight:400;font-style:normal">0.09475</span></td>
      <td class="tg-c3ow">0.16916</td>
      <td class="tg-c3ow"><span style="font-weight:400;font-style:normal">0.30248</span></td>
    </tr>
    <tr>
      <td class="tg-c3ow">Post Norm</td>
      <td class="tg-c3ow">0.05124</td>
      <td class="tg-c3ow"><span style="font-weight:400;font-style:normal">0.09215</span></td>
      <td class="tg-c3ow"><span style="font-weight:400;font-style:normal">0.17257</span></td>
      <td class="tg-c3ow"><span style="font-weight:400;font-style:normal">0.3050</span></td>
    </tr>
  </tbody></table>

Experiments show that the Post-LN implementation outperforms the original one, while the performance of the standard Pre-LN is not stable. We anticipate that this is due to the limited number of transformer blocks (<= 3) in SASRec, i.e., training SASRec with Post-LN may not suffer from the instability issue, while the expressive power of Pre-LN is limited (cf. [this blog](https://kexue.fm/archives/9009)).

Given the above results, we suggest applying the standard Post-LN to the SASRec (or at least provide an option in the code, or clarification in the README about the LN implementation). The code modifications are simple:

- Line 19:

```python
# outputs += inputs  # Remove this line
```

- Lines 79-87:

```python
mha_outputs, _ = self.attention_layers[i](seqs, seqs, seqs,
                                  attn_mask=attention_mask)
seqs = seqs + mha_outputs
seqs = self.attention_layernorms[i](seqs) # Apply Post-LN
seqs = torch.transpose(seqs, 0, 1)

seqs = seqs + self.forward_layers[i](seqs)
seqs = self.forward_layernorms[i](seqs)   # Apply Post-LN
```

Thank you for your time and consideration.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on LN in SASRec #47

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	Beauty		Movielens-1M
	NDCG@10	HR@10	NDCG@10	HR@10
Original	0.05244	0.09345	0.16997	0.29685
Pre Norm	0.05138	0.09265	0.16629	0.29122
Post Norm	0.05275	0.09363	0.17199	0.29719

	Beauty		Movielens-1M
	NDCG@10	HR@10	NDCG@10	HR@10
Original	0.05095	0.09193	0.17113	0.30017
Pre Norm	0.05314	0.09475	0.16916	0.30248
Post Norm	0.05124	0.09215	0.17257	0.3050

Questions on LN in SASRec #47

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions