Proceedings of the First International Workshop on Natural Language Processing Beyond Text

Giuseppe Castellucci, Simone Filice, Soujanya Poria, Erik Cambria, Lucia Specia (Editors)

Anthology ID:
EMNLP | nlpbt
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the First International Workshop on Natural Language Processing Beyond Text
Giuseppe Castellucci | Simone Filice | Soujanya Poria | Erik Cambria | Lucia Specia

pdf bib
MAST : Multimodal Abstractive Summarization with Trimodal Hierarchical AttentionMAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention
Aman Khullar | Udit Arora

This paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities text, audio and video in a multimodal video. Prior work on multimodal abstractive text summarization only utilized information from the text and video modalities. We examine the usefulness and challenges of deriving information from the audio modality and present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges by letting the model pay more attention to the text modality. MAST outperforms the current state of the art model (video-text) by 2.51 points in terms of Content F1 score and 1.00 points in terms of Rouge-L score on the How2 dataset for multimodal language understanding.

pdf bib
Reasoning Over History : Context Aware Visual Dialog
Muhammad Shah | Shikib Mehri | Tejas Srinivasan

While neural models have been shown to exhibit strong performance on single-turn visual question answering (VQA) tasks, extending VQA to a multi-turn, conversational setting remains a challenge. One way to address this challenge is to augment existing strong neural VQA models with the mechanisms that allow them to retain information from previous dialog turns. One strong VQA model is the MAC network, which decomposes a task into a series of attention-based reasoning steps. However, since the MAC network is designed for single-turn question answering, it is not capable of referring to past dialog turns. More specifically, it struggles with tasks that require reasoning over the dialog history, particularly coreference resolution. We extend the MAC network architecture with Context-aware Attention and Memory (CAM), which attends over control states in past dialog turns to determine the necessary reasoning operations for the current question. MAC nets with CAM achieve up to 98.25 % accuracy on the CLEVR-Dialog dataset, beating the existing state-of-the-art by 30 % (absolute). Our error analysis indicates that with CAM, the model’s performance particularly improved on questions that required coreference resolution.