How to visualize attention map for model with multiple heads attention (e.g., vilbert)

## ❓ Questions and Help

### Overall goal: I am trying to extract visual attention map from vilbert to explore where the vilbert is looking at the image.
 
### My question
Question 1: 
I know vilbert has three kinds of attention: image attention, text attention, and co-attention. I don't know if I should go with image attention or co-attention. Currently, I go with image attention. 
Question 2: 
I know for image attention, it outputs 6 vectors, each of the vector has a size (1,8,100,100). I would like to know (1) what does the 8, 100, 100 represent. (2) which vector should I select (3) and how can I visualize attention map with the image attention weights. 

My understanding for Question 2:  
According to https://github.com/facebookresearch/mmf/blob/3947693aafcc9cc2a16d7c1c5e1479bf0f88ed4b/mmf/configs/models/vilbert/defaults.yaml, it seems that 8 represents the number of attention heads. My guessing is 1 represents the batch size (I changed the batch size to 1), 100 is the image width and height. 
If that is correct, then my question 2 becomes "how to deal with multiple attention heads?" 

Possible solution for Question2:
I know how to visualize attention map if the attention weights are 1d array or 2d array....For 4d, I am not sure if it makes sense to directly use squeeze() to transform 4d into 2d for visualization.  Or I should average multi-heads attention to get 2D attention weights?

### Other questions
(1) I am worried about the way they represent the image in transformers makes it impossible to visualize the image attention map for vilbert: 
![image](https://user-images.githubusercontent.com/21120604/116678492-fbbe9500-a96e-11eb-9007-58dc14b87722.png)

(2) I got two image attention weights from Pythia, which one should I use for visualization?

Thank you in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to visualize attention map for model with multiple heads attention (e.g., vilbert) #917

❓ Questions and Help

Overall goal: I am trying to extract visual attention map from vilbert to explore where the vilbert is looking at the image.

My question

Other questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to visualize attention map for model with multiple heads attention (e.g., vilbert) #917

Description

❓ Questions and Help

Overall goal: I am trying to extract visual attention map from vilbert to explore where the vilbert is looking at the image.

My question

Other questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions