Skip to content

Multimodal Alignment loss for Vilbert/VisualBERT #466

@vedanuj

Description

@vedanuj

🚀 Feature

Multimodal Alignment or Sentence Image Prediction loss for ViLBERT/VisualBERT

Motivation

VilBERT model uses two pretraining losses. Current MMF implementation uses masked multimodal modeling loss but not the multimodal alignment loss. Similarly VisualBERT model also uses a similar loss which they call sentence-image prediction loss. Task is to add this multimodal alignment loss to Vilbert and VisualBERT models.

Pitch

To be able to reproduce Vilbert/VisualBERT model results multimodal alignment loss should be added. Also in order to extend to retrieval downstream tasks, this multimodal alignment loss will be important.

Additional context

The task will involve adding this loss to the models, modify any dataset side changes and test the implementation is working as expected.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions