Thanks for your great work!
I have read your paper, but I am a bit confused about two things.
(1) The instantiation of Multi-head PA. How can we instantiate Multi-head PA (r=30) to make it have the same quantity of tuned parameters as PA (attn, r=30) according to Table 4 in the main paper? My initial thought is that Multi-head PA's tuned parameters will be N_h times those of PA.
(2) The design choice of MAM adapter. According to my understanding, MH PA (attn, r = 30) is slightly better than prefix tuning (l = 30) based on the result in Table 4 (35.3>35.2), and according to previous papers like LoRA, prefix tuning is not stable to optimize. However, MAM adopts prefix tuning. Is there a specific reason for this?
Would you mind giving me any clues about these two questions?
Thanks for your great work!
I have read your paper, but I am a bit confused about two things.
(1) The instantiation of Multi-head PA. How can we instantiate Multi-head PA (r=30) to make it have the same quantity of tuned parameters as PA (attn, r=30) according to Table 4 in the main paper? My initial thought is that Multi-head PA's tuned parameters will be N_h times those of PA.
(2) The design choice of MAM adapter. According to my understanding, MH PA (attn, r = 30) is slightly better than prefix tuning (l = 30) based on the result in Table 4 (35.3>35.2), and according to previous papers like LoRA, prefix tuning is not stable to optimize. However, MAM adopts prefix tuning. Is there a specific reason for this?
Would you mind giving me any clues about these two questions?