Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

Aditya Mogadala, Dietrich Klakow, Sandro Pezzelle, Marie-Francine Moens (Editors)


Anthology ID:
D19-64
Month:
November
Year:
2019
Address:
Hong Kong, China
Venues:
EMNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/D19-64
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/D19-64.pdf

pdf bib
Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
Aditya Mogadala | Dietrich Klakow | Sandro Pezzelle | Marie-Francine Moens

pdf bib
Big Generalizations with Small Data : Exploring the Role of Training Samples in Learning Adjectives of Size
Sandro Pezzelle | Raquel Fern√°ndez

In this paper, we experiment with a recently proposed visual reasoning task dealing with quantities modeling the multimodal, contextually-dependent meaning of size adjectives (‘big’, ‘small’) and explore the impact of varying the training data on the learning behavior of a state-of-art system. In previous work, models have been shown to fail in generalizing to unseen adjective-noun combinations. Here, we investigate whether, and to what extent, seeing some of these cases during training helps a model understand the rule subtending the task, i.e., that being big implies being not small, and vice versa. We show that relatively few examples are enough to understand this relationship, and that developing a specific, mutually exclusive representation of size adjectives is beneficial to the task.

pdf bib
On the Role of Scene Graphs in Image Captioning
Dalin Wang | Daniel Beck | Trevor Cohn

Scene graphs represent semantic information in images, which can help image captioning system to produce more descriptive outputs versus using only the image as context. Recent captioning approaches rely on ad-hoc approaches to obtain graphs for images. However, those graphs introduce noise and it is unclear the effect of parser errors on captioning accuracy. In this work, we investigate to what extent scene graphs can help image captioning. Our results show that a state-of-the-art scene graph parser can boost performance almost as much as the ground truth graphs, showing that the bottleneck currently resides more on the captioning models than on the performance of the scene graph parser.

pdf bib
Understanding the Effect of Textual Adversaries in Multimodal Machine Translation
Koel Dutta Chowdhury | Desmond Elliott

It is assumed that multimodal machine translation systems are better than text-only systems at translating phrases that have a direct correspondence in the image. This assumption has been challenged in experiments demonstrating that state-of-the-art multimodal systems perform equally well in the presence of randomly selected images, but, more recently, it has been shown that masking entities from the source language sentence during training can help to overcome this problem. In this paper, we conduct experiments with both visual and textual adversaries in order to understand the role of incorrect textual inputs to such systems. Our results show that when the source language sentence contains mistakes, multimodal translation systems do not leverage the additional visual signal to produce the correct translation. We also find that the degradation of translation performance caused by textual adversaries is significantly higher than by visual adversaries.

pdf bib
Learning to request guidance in emergent language
Benjamin Kolb | Leon Lang | Henning Bartsch | Arwin Gansekoele | Raymond Koopmanschap | Leonardo Romor | David Speck | Mathijs Mul | Elia Bruni

Previous research into agent communication has shown that a pre-trained guide can speed up the learning process of an imitation learning agent. The guide achieves this by providing the agent with discrete messages in an emerged language about how to solve the task. We extend this one-directional communication by a one-bit communication channel from the learner back to the guide : It is able to ask the guide for help, and we limit the guidance by penalizing the learner for these requests. During training, the agent learns to control this gate based on its current observation. We find that the amount of requested guidance decreases over time and guidance is requested in situations of high uncertainty. We investigate the agent’s performance in cases of open and closed gates and discuss potential motives for the observed gating behavior.

pdf bib
Seeded self-play for language learning
Abhinav Gupta | Ryan Lowe | Jakob Foerster | Douwe Kiela | Joelle Pineau

How can we teach artificial agents to use human language flexibly to solve problems in real-world environments? We have an example of this in nature : human babies eventually learn to use human language to solve problems, and they are taught with an adult human-in-the-loop. Unfortunately, current machine learning methods (e.g. from deep reinforcement learning) are too data inefficient to learn language in this way. An outstanding goal is finding an algorithm with a suitable ‘language learning prior’ that allows it to learn human language, while minimizing the number of on-policy human interactions. In this paper, we propose to learn such a prior in simulation using an approach we call, Learning to Learn to Communicate (L2C). Specifically, in L2C we train a meta-learning agent in simulation to interact with populations of pre-trained agents, each with their own distinct communication protocol. Once the meta-learning agent is able to quickly adapt to each population of agents, it can be deployed in new populations, including populations speaking human language. Our key insight is that such populations can be obtained via self-play, after pre-training agents with imitation learning on a small amount of off-policy human language data. We call this latter technique Seeded Self-Play (S2P). Our preliminary experiments show that agents trained with L2C and S2P need fewer on-policy samples to learn a compositional language in a Lewis signaling game.