Proceedings of the Third Workshop on Multimodal Artificial Intelligence

Amir Zadeh, Louis-Philippe Morency, Paul Pu Liang, Candace Ross, Ruslan Salakhutdinov, Soujanya Poria, Erik Cambria, Kelly Shi (Editors)

Anthology ID:
Mexico City, Mexico
NAACL | maiworkshop
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the Third Workshop on Multimodal Artificial Intelligence
Amir Zadeh | Louis-Philippe Morency | Paul Pu Liang | Candace Ross | Ruslan Salakhutdinov | Soujanya Poria | Erik Cambria | Kelly Shi

pdf bib
Multi Task Learning based Framework for Multimodal Classification
Danting Zeng

Large-scale multi-modal classification aim to distinguish between different multi-modal data, and it has drawn dramatically attentions since last decade. In this paper, we propose a multi-task learning-based framework for the multimodal classification task, which consists of two branches : multi-modal autoencoder branch and attention-based multi-modal modeling branch. Multi-modal autoencoder can receive multi-modal features and obtain the interactive information which called multi-modal encoder feature, and use this feature to reconstitute all the input data. Besides, multi-modal encoder feature can be used to enrich the raw dataset, and improve the performance of downstream tasks (such as classification task). As for attention-based multimodal modeling branch, we first employ attention mechanism to make the model focused on important features, then we use the multi-modal encoder feature to enrich the input information, achieve a better performance. We conduct extensive experiments on different dataset, the results demonstrate the effectiveness of proposed framework.

pdf bib
A Package for Learning on Tabular and Text Data with Transformers
Ken Gu | Akshay Budhkar

Recent progress in natural language processing has led to Transformer architectures becoming the predominant model used for natural language tasks. However, in many real- world datasets, additional modalities are included which the Transformer does not directly leverage. We present Multimodal- Toolkit, an open-source Python package to incorporate text and tabular (categorical and numerical) data with Transformers for downstream applications. Our toolkit integrates well with Hugging Face’s existing API such as tokenization and the model hub which allows easy download of different pre-trained models.

pdf bib
Learning to Select Question-Relevant Relations for Visual Question Answering
Jaewoong Lee | Heejoon Lee | Hwanhee Lee | Kyomin Jung

Previous existing visual question answering (VQA) systems commonly use graph neural networks(GNNs) to extract visual relationships such as semantic relations or spatial relations. However, studies that use GNNs typically ignore the importance of each relation and simply concatenate outputs from multiple relation encoders. In this paper, we propose a novel layer architecture that fuses multiple visual relations through an attention mechanism to address this issue. Specifically, we develop a model that uses question embedding and joint embedding of the encoders to obtain dynamic attention weights with regard to the type of questions. Using the learnable attention weights, the proposed model can efficiently use the necessary visual relation features for a given question. Experimental results on the VQA 2.0 dataset demonstrate that the proposed model outperforms existing graph attention network-based architectures. Additionally, we visualize the attention weight and show that the proposed model assigns a higher weight to relations that are more relevant to the question.