Skip to content

Latest commit

 

History

History
102 lines (67 loc) · 6.75 KB

multimodal.md

File metadata and controls

102 lines (67 loc) · 6.75 KB

Multimodal

Multimodal Emotion Recognition

IEMOCAP

The IEMOCAP (Busso et al., 2008) contains the acts of 10 speakers in a two-way conversation segmented into utterances. The medium of the conversations in all the videos is English. The database contains the following categorical labels: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise, and other.

Monologue:

Model Accuracy Paper / Source
CHFusion (Poria et al., 2017) 76.5% Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling
bc-LSTM (Poria et al., 2017) 74.10% Context-Dependent Sentiment Analysis in User-Generated Videos

Conversational: Conversational setting enables the models to capture emotions expressed by the speakers in a conversation. Inter speaker dependencies are considered in this setting.

Model Weighted Accuracy (WAA) Paper / Source
CMN (Hazarika et al., 2018) 77.62% Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos
Memn2n 75.08 Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos

Multimodal Metaphor Recognition

Mohammad et. al, 2016 created a dataset of verb-noun pairs from WordNet that had multiple senses. They annoted these pairs for metaphoricity (metaphor or not a metaphor). Dataset is in English.

Model F1 Score Paper / Source Code
5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec 0.75 Shutova et. al, 2016 Unavailable

Tsvetkov et. al, 2014 created a dataset of adjective-noun pairs that they then annotated for metaphoricity. Dataset is in English.

Model F1 Score Paper / Source Code
5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec 0.79 Shutova et. al, 2016 Unavailable

Multimodal Sentiment Analysis

MOSI

The MOSI dataset (Zadeh et al., 2016) is a dataset rich in sentimental expressions where 93 people review topics in English. The videos are segmented with each segments sentiment label scored between +3 (strong positive) to -3 (strong negative) by 5 annotators.

Model Accuracy Paper / Source
bc-LSTM (Poria et al., 2017) 80.3% Context-Dependent Sentiment Analysis in User-Generated Videos
MARN (Zadeh et al., 2018) 77.1% Multi-attention Recurrent Network for Human Communication Comprehension

Visual Question Answering

VQAv2

Given an image and a natural language question about the image, the task is to provide an accurate natural language answer

Model Accuracy Paper / Source Code
UNITER (Chen et al., 2019) 73.4 UNITER: LEARNING UNIVERSAL IMAGE-TEXT REPRESENTATIONS Link
LXMERT (Tan et al., 2019) 72.54 LXMERT: Learning Cross-Modality Encoder Representations from Transformers Link

GQA - Visual Reasoning in the Real World

GQA focuses on real-world compositional reasoning.

Model Accuracy Paper / Source Code
KaKao Brain 73.24 GQA Challenge Unavailable
LXMERT (Tan et al., 2019) 60.3 LXMERT: Learning Cross-Modality Encoder Representations from Transformers Link

TextVQA

TextVQA requires models to read and reason about text in an image to answer questions based on them.

Model Accuracy Paper / Source Code
M4C (Hu et al., 2020) 40.46 Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA Link

VizWiz dataset

This task focuses on answering visual questions that originate from a real use case where blind people were submitting images with recorded spoken questions in order to learn about their physical surroundings.

Model Accuracy Paper / Source Code
Pythia 54.22 FB's Pythia repository Link
BUTD Vizwiz (Gurari et al., 2018) 46.9 VizWiz Grand Challenge: Answering Visual Questions from Blind People Unavailable

Other multimodal resources

Go back to the README