Multimodal Alignment loss for Vilbert/VisualBERT

## 🚀 Feature
Multimodal Alignment or Sentence Image Prediction loss for ViLBERT/VisualBERT

## Motivation

[VilBERT](https://arxiv.org/abs/1908.02265) model uses two pretraining losses. Current MMF implementation uses `masked multimodal modeling` loss but not the `multimodal alignment` loss. Similarly [VisualBERT](https://arxiv.org/abs/1908.03557) model also uses a similar loss which they call  `sentence-image prediction` loss. Task is to add this multimodal alignment loss to Vilbert and VisualBERT models.

## Pitch

To be able to reproduce Vilbert/VisualBERT model results `multimodal alignment` loss should be added. Also in order to extend to retrieval downstream tasks, this `multimodal alignment` loss will be important.


## Additional context

The task will involve adding this loss to the models, modify any dataset side changes and test the implementation is working as expected. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal Alignment loss for Vilbert/VisualBERT #466

🚀 Feature

Motivation

Pitch

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multimodal Alignment loss for Vilbert/VisualBERT #466

Description

🚀 Feature

Motivation

Pitch

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions