🚀 Feature
Multimodal Alignment or Sentence Image Prediction loss for ViLBERT/VisualBERT
Motivation
VilBERT model uses two pretraining losses. Current MMF implementation uses masked multimodal modeling loss but not the multimodal alignment loss. Similarly VisualBERT model also uses a similar loss which they call sentence-image prediction loss. Task is to add this multimodal alignment loss to Vilbert and VisualBERT models.
Pitch
To be able to reproduce Vilbert/VisualBERT model results multimodal alignment loss should be added. Also in order to extend to retrieval downstream tasks, this multimodal alignment loss will be important.
Additional context
The task will involve adding this loss to the models, modify any dataset side changes and test the implementation is working as expected.
🚀 Feature
Multimodal Alignment or Sentence Image Prediction loss for ViLBERT/VisualBERT
Motivation
VilBERT model uses two pretraining losses. Current MMF implementation uses
masked multimodal modelingloss but not themultimodal alignmentloss. Similarly VisualBERT model also uses a similar loss which they callsentence-image predictionloss. Task is to add this multimodal alignment loss to Vilbert and VisualBERT models.Pitch
To be able to reproduce Vilbert/VisualBERT model results
multimodal alignmentloss should be added. Also in order to extend to retrieval downstream tasks, thismultimodal alignmentloss will be important.Additional context
The task will involve adding this loss to the models, modify any dataset side changes and test the implementation is working as expected.