❓ Questions and Help
Overall goal: I am trying to extract visual attention map from vilbert to explore where the vilbert is looking at the image.
My question
Question 1:
I know vilbert has three kinds of attention: image attention, text attention, and co-attention. I don't know if I should go with image attention or co-attention. Currently, I go with image attention.
Question 2:
I know for image attention, it outputs 6 vectors, each of the vector has a size (1,8,100,100). I would like to know (1) what does the 8, 100, 100 represent. (2) which vector should I select (3) and how can I visualize attention map with the image attention weights.
My understanding for Question 2:
According to https://github.com/facebookresearch/mmf/blob/3947693aafcc9cc2a16d7c1c5e1479bf0f88ed4b/mmf/configs/models/vilbert/defaults.yaml, it seems that 8 represents the number of attention heads. My guessing is 1 represents the batch size (I changed the batch size to 1), 100 is the image width and height.
If that is correct, then my question 2 becomes "how to deal with multiple attention heads?"
Possible solution for Question2:
I know how to visualize attention map if the attention weights are 1d array or 2d array....For 4d, I am not sure if it makes sense to directly use squeeze() to transform 4d into 2d for visualization. Or I should average multi-heads attention to get 2D attention weights?
Other questions
(1) I am worried about the way they represent the image in transformers makes it impossible to visualize the image attention map for vilbert:

(2) I got two image attention weights from Pythia, which one should I use for visualization?
Thank you in advance!
❓ Questions and Help
Overall goal: I am trying to extract visual attention map from vilbert to explore where the vilbert is looking at the image.
My question
Question 1:
I know vilbert has three kinds of attention: image attention, text attention, and co-attention. I don't know if I should go with image attention or co-attention. Currently, I go with image attention.
Question 2:
I know for image attention, it outputs 6 vectors, each of the vector has a size (1,8,100,100). I would like to know (1) what does the 8, 100, 100 represent. (2) which vector should I select (3) and how can I visualize attention map with the image attention weights.
My understanding for Question 2:
According to https://github.com/facebookresearch/mmf/blob/3947693aafcc9cc2a16d7c1c5e1479bf0f88ed4b/mmf/configs/models/vilbert/defaults.yaml, it seems that 8 represents the number of attention heads. My guessing is 1 represents the batch size (I changed the batch size to 1), 100 is the image width and height.
If that is correct, then my question 2 becomes "how to deal with multiple attention heads?"
Possible solution for Question2:
I know how to visualize attention map if the attention weights are 1d array or 2d array....For 4d, I am not sure if it makes sense to directly use squeeze() to transform 4d into 2d for visualization. Or I should average multi-heads attention to get 2D attention weights?
Other questions
(1) I am worried about the way they represent the image in transformers makes it impossible to visualize the image attention map for vilbert:

(2) I got two image attention weights from Pythia, which one should I use for visualization?
Thank you in advance!