Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

Lucia Donatelli, Nikhil Krishnaswamy, Kenneth Lai, James Pustejovsky (Editors)


Anthology ID:
2021.mmsr-1
Month:
June
Year:
2021
Address:
Groningen, Netherlands (Online)
Venues:
IWCS | MMSR
SIG:
SIGSEM
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2021.mmsr-1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2021.mmsr-1.pdf

pdf bib
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)
Lucia Donatelli | Nikhil Krishnaswamy | Kenneth Lai | James Pustejovsky

pdf bib
What is Multimodality?
Letitia Parcalabescu | Nils Trost | Anette Frank

The last years have shown rapid developments in the field of multimodal machine learning, combining e.g., vision, text or speech. In this position paper we explain how the field uses outdated definitions of multimodality that prove unfit for the machine learning era. We propose a new task-relative definition of (multi)modality in the context of multimodal machine learning that focuses on representations and information that are relevant for a given machine learning task. With our new definition of multimodality we aim to provide a missing foundation for multimodal research, an important component of language grounding and a crucial milestone towards NLU.

pdf bib
Seeing past words : Testing the cross-modal capabilities of pretrained V&L models on counting tasksV&L models on counting tasks
Letitia Parcalabescu | Albert Gatt | Anette Frank | Iacer Calixto

We investigate the reasoning ability of pretrained vision and language (V&L) models in two tasks that require multimodal integration : (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks : ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models are pretrained on task (1). However, none of the pretrained V&L models is able to adequately solve task (2), our counting probe, and they can not generalise to out-of-distribution quantities. We propose a number of explanations for these findings : LXMERT (and to some extent ViLBERT 12-in-1) show some evidence of catastrophic forgetting on task (1). Concerning our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input. While a selling point of pretrained V&L models is their ability to solve complex tasks, our findings suggest that understanding their reasoning and grounding capabilities requires more targeted investigations on specific phenomena.

pdf bib
How Vision Affects Language : Comparing Masked Self-Attention in Uni-Modal and Multi-Modal Transformer
Nikolai Ilinykh | Simon Dobnik

The problem of interpretation of knowledge learned by multi-head self-attention in transformers has been one of the central questions in NLP. However, a lot of work mainly focused on models trained for uni-modal tasks, e.g. machine translation. In this paper, we examine masked self-attention in a multi-modal transformer trained for the task of image captioning. In particular, we test whether the multi-modality of the task objective affects the learned attention patterns. Our visualisations of masked self-attention demonstrate that (i) it can learn general linguistic knowledge of the textual input, and (ii) its attention patterns incorporate artefacts from visual modality even though it has never accessed it directly. We compare our transformer’s attention patterns with masked attention in distilgpt-2 tested for uni-modal text generation of image captions. Based on the maps of extracted attention weights, we argue that masked self-attention in image captioning transformer seems to be enhanced with semantic knowledge from images, exemplifying joint language-and-vision information in its attention patterns.

pdf bib
EMISSOR : A platform for capturing multimodal interactions as Episodic Memories and Interpretations with Situated Scenario-based Ontological ReferencesEMISSOR: A platform for capturing multimodal interactions as Episodic Memories and Interpretations with Situated Scenario-based Ontological References
Selene Baez Santamaria | Thomas Baier | Taewoon Kim | Lea Krause | Jaap Kruijt | Piek Vossen

We present EMISSOR : a platform to capture multimodal interactions as recordings of episodic experiences with explicit referential interpretations that also yield an episodic Knowledge Graph (eKG). The platform stores streams of multiple modalities as parallel signals. Each signal is segmented and annotated independently with interpretation. Annotations are eventually mapped to explicit identities and relations in the eKG. As we ground signal segments from different modalities to the same instance representations, we also ground different modalities across each other. Unique to our eKG is that it accepts different interpretations across modalities, sources and experiences and supports reasoning over conflicting information and uncertainties that may result from multimodal experiences. EMISSOR can record and annotate experiments in virtual and real-world, combine data, evaluate system behavior and their performance for preset goals but also model the accumulation of knowledge and interpretations in the Knowledge Graph as a result of these episodic experiences.

pdf bib
Incremental Unit Networks for Multimodal, Fine-grained Information State Representation
Casey Kennington | David Schlangen

We offer a fine-grained information state annotation scheme that follows directly from the Incremental Unit abstract model of dialogue processing when used within a multimodal, co-located, interactive setting. We explain the Incremental Unit model and give an example application using the Localized Narratives dataset, then offer avenues for future research.

pdf bib
Teaching Arm and Head Gestures to a Humanoid Robot through Interactive Demonstration and Spoken Instruction
Michael Brady | Han Du

We describe work in progress for training a humanoid robot to produce iconic arm and head gestures as part of task-oriented dialogue interaction. This involves the development and use of a multimodal dialog manager for non-experts to quickly ‘program’ the robot through speech and vision. Using this dialog manager, videos of gesture demonstrations are collected. Motor positions are extracted from these videos to specify motor trajectories where collections of motor trajectories are used to produce robot gestures following a Gaussian mixtures approach. Concluding discussion considers how learned representations may be used for gesture recognition by the robot, and how the framework may mature into a system to address language grounding and semantic representation.