Sina Zarrieß


2021

pdf bib
Decoupling Pragmatics : Discriminative Decoding for Referring Expression Generation
Simeon Schüz | Sina Zarrieß
Proceedings of the Reasoning and Interaction Conference (ReInAct 2021)

The shift to neural models in Referring Expression Generation (REG) has enabled more natural set-ups, but at the cost of interpretability. We argue that integrating pragmatic reasoning into the inference of context-agnostic generation models could reconcile traits of traditional and neural REG, as this offers a separation between context-independent, literal information and pragmatic adaptation to context. With this in mind, we apply existing decoding strategies from discriminative image captioning to REG and evaluate them in terms of pragmatic informativity, likelihood to ground-truth annotations and linguistic diversity. Our results show general effectiveness, but a relatively small gain in informativity, raising important questions for REG in general.

pdf bib
Proceedings of the 14th International Conference on Computational Semantics (IWCS)
Sina Zarrieß | Johan Bos | Rik van Noord | Lasha Abzianidze
Proceedings of the 14th International Conference on Computational Semantics (IWCS)

pdf bib
What Did This Castle Look like before? Exploring Referential Relations in Naturally Occurring Multimodal Texts
Ronja Utescher | Sina Zarrieß
Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

Multi-modal texts are abundant and diverse in structure, yet Language & Vision research of these naturally occurring texts has mostly focused on genres that are comparatively light on text, like tweets. In this paper, we discuss the challenges and potential benefits of a L&V framework that explicitly models referential relations, taking Wikipedia articles about buildings as an example. We briefly survey existing related tasks in L&V and propose multi-modal information extraction as a general direction for future research.

pdf bib
Challenges in Designing Natural Language Interfaces for Complex Visual Models
Henrik Voigt | Monique Meuschke | Kai Lawonn | Sina Zarrieß
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

Intuitive interaction with visual models becomes an increasingly important task in the field of Visualization (VIS) and verbal interaction represents a significant aspect of it. Vice versa, modeling verbal interaction in visual environments is a major trend in ongoing research in NLP. To date, research on Language & Vision, however, mostly happens at the intersection of NLP and Computer Vision (CV), and much less at the intersection of NLP and Visualization, which is an important area in Human-Computer Interaction (HCI). This paper presents a brief survey of recent work on interactive tasks and set-ups in NLP and Visualization. We discuss the respective methods, show interesting gaps, and conclude by suggesting neural, visually grounded dialogue modeling as a promising potential for NLIs for visual models.

2020

pdf bib
From Before to After : Generating Natural Language Instructions from Image Pairs in a Simple Visual Domain
Robin Rojowiec | Jana Götze | Philipp Sadler | Henrik Voigt | Sina Zarrieß | David Schlangen
Proceedings of the 13th International Conference on Natural Language Generation

While certain types of instructions can be com-pactly expressed via images, there are situations where one might want to verbalise them, for example when directing someone. We investigate the task of Instruction Generation from Before / After Image Pairs which is to derive from images an instruction for effecting the implied change. For this, we make use of prior work on instruction following in a visual environment. We take an existing dataset, the BLOCKS data collected by Bisk et al. (2016) and investigate whether it is suitable for training an instruction generator as well. We find that it is, and investigate several simple baselines, taking these from the related task of image captioning. Through a series of experiments that simplify the task (by making image processing easier or completely side-stepping it ; and by creating template-based targeted instructions), we investigate areas for improvement. We find that captioning models get some way towards solving the task, but have some difficulty with it, and future improvements must lie in the way the change is detected in the instruction.

2019

pdf bib
Tell Me More : A Dataset of Visual Scene Description Sequences
Nikolai Ilinykh | Sina Zarrieß | David Schlangen
Proceedings of the 12th International Conference on Natural Language Generation

We present a dataset consisting of what we call image description sequences, which are multi-sentence descriptions of the contents of an image. These descriptions were collected in a pseudo-interactive setting, where the describer was told to describe the given image to a listener who needs to identify the image within a set of images, and who successively asks for more information. As we show, this setup produced nicely structured data that, we think, will be useful for learning models capable of planning and realising such description discourses.

2018

pdf bib
The Task Matters : Comparing Image Captioning and Task-Based Dialogical Image Description
Nikolai Ilinykh | Sina Zarrieß | David Schlangen
Proceedings of the 11th International Conference on Natural Language Generation

Image captioning models are typically trained on data that is collected from people who are asked to describe an image, without being given any further task context. As we argue here, this context independence is likely to cause problems for transferring to task settings in which image description is bound by task demands. We demonstrate that careful design of data collection is required to obtain image descriptions which are contextually bounded to a particular meta-level task. As a task, we use MeetUp !, a text-based communication game where two players have the goal of finding each other in a visual environment. To reach this goal, the players need to describe images representing their current location. We analyse a dataset from this domain and show that the nature of image descriptions found in MeetUp ! is diverse, dynamic and rich with phenomena that are not present in descriptions obtained through a simple image captioning task, which we ran for comparison.

pdf bib
Decoding Strategies for Neural Referring Expression Generation
Sina Zarrieß | David Schlangen
Proceedings of the 11th International Conference on Natural Language Generation

RNN-based sequence generation is now widely used in NLP and NLG (natural language generation). Most work focusses on how to train RNNs, even though also decoding is not necessarily straightforward : previous work on neural MT found seq2seq models to radically prefer short candidates, and has proposed a number of beam search heuristics to deal with this. In this work, we assess decoding strategies for referring expression generation with neural models. Here, expression length is crucial : output should neither contain too much or too little information, in order to be pragmatically adequate. We find that most beam search heuristics developed for MT do not generalize well to referring expression generation (REG), and do not generally outperform greedy decoding. We observe that beam search heuristics for termination seem to override the model’s knowledge of what a good stopping point is. Therefore, we also explore a recent approach called trainable decoding, which uses a small network to modify the RNN’s hidden state for better decoding results. We find this approach to consistently outperform greedy decoding for REG.

pdf bib
Being data-driven is not enough : Revisiting interactive instruction giving as a challenge for NLGNLG
Sina Zarrieß | David Schlangen
Proceedings of the Workshop on NLG for Human–Robot Interaction

Modeling traditional NLG tasks with data-driven techniques has been a major focus of research in NLG in the past decade. We argue that existing modeling techniques are mostly tailored to textual data and are not sufficient to make NLG technology meet the requirements of agents which target fluid interaction and collaboration in the real world. We revisit interactive instruction giving as a challenge for datadriven NLG and, based on insights from previous GIVE challenges, propose that instruction giving should be addressed in a setting that involves visual grounding and spoken language. These basic design decisions will require NLG frameworks that are capable of monitoring their environment as well as timing and revising their verbal output. We believe that these are core capabilities for making NLG technology transferrable to interactive systems.

2017

pdf bib
Obtaining referential word meanings from visual and distributional information : Experiments on object naming
Sina Zarrieß | David Schlangen
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate object naming, which is an important sub-task of referring expression generation on real-world images. As opposed to mutually exclusive labels used in object recognition, object names are more flexible, subject to communicative preferences and semantically related to each other. Therefore, we investigate models of referential word meaning that link visual to lexical information which we assume to be given through distributional word embeddings. We present a model that learns individual predictors for object names that link visual and distributional aspects of word meaning during training. We show that this is particularly beneficial for zero-shot learning, as compared to projecting visual objects directly into the distributional space. In a standard object naming task, we find that different ways of combining lexical and visual information achieve very similar performance, though experiments on model combination suggest that they capture complementary aspects of referential meaning.

pdf bib
Refer-iTTS : A System for Referring in Spoken Installments to Objects in Real-World ImagesTTS: A System for Referring in Spoken Installments to Objects in Real-World Images
Sina Zarrieß | M. Soledad López Gambino | David Schlangen
Proceedings of the 10th International Conference on Natural Language Generation

Current referring expression generation systems mostly deliver their output as one-shot, written expressions. We present on-going work on incremental generation of spoken expressions referring to objects in real-world images. This approach extends upon previous work using the words-as-classifier model for generation. We implement this generator in an incremental dialogue processing framework such that we can exploit an existing interface to incremental text-to-speech synthesis. Our system generates and synthesizes referring expressions while continuously observing non-verbal user reactions.

pdf bib
Beyond On-hold Messages : Conversational Time-buying in Task-oriented Dialogue
Soledad López Gambino | Sina Zarrieß | David Schlangen
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

A common convention in graphical user interfaces is to indicate a wait state, for example while a program is preparing a response, through a changed cursor state or a progress bar. What should the analogue be in a spoken conversational system? To address this question, we set up an experiment in which a human information provider (IP) was given their information only in a delayed and incremental manner, which systematically created situations where the IP had the turn but could not provide task-related information. Our data analysis shows that 1) IPs bridge the gap until they can provide information by re-purposing a whole variety of task- and grounding-related communicative actions (e.g. echoing the user’s request, signaling understanding, asserting partially relevant information), rather than being silent or explicitly asking for time (e.g. please wait), and that 2) IPs combined these actions productively to ensure an ongoing conversation. These results, we argue, indicate that natural conversational interfaces should also be able to manage their time flexibly using a variety of conversational resources.

pdf bib
Deriving continous grounded meaning representations from referentially structured multimodal contexts
Sina Zarrieß | David Schlangen
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Corpora of referring expressions paired with their visual referents are a good source for learning word meanings directly grounded in visual representations. Here, we explore additional ways of extracting from them word representations linked to multi-modal context : through expressions that refer to the same object, and through expressions that refer to different objects in the same scene. We show that continuous meaning representations derived from these contexts capture complementary aspects of similarity,, even if not outperforming textual embeddings trained on very large amounts of raw text when tested on standard similarity benchmarks. We propose a new task for evaluating grounded meaning representationsdetection of potentially co-referential phrasesand show that it requires precise denotational representations of attribute meanings, which our method provides.

pdf bib
Is this a Child, a Girl or a Car? Exploring the Contribution of Distributional Similarity to Learning Referential Word Meanings
Sina Zarrieß | David Schlangen
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

There has recently been a lot of work trying to use images of referents of words for improving vector space meaning representations derived from text. We investigate the opposite direction, as it were, trying to improve visual word predictors that identify objects in images, by exploiting distributional similarity information during training. We show that for certain words (such as entry-level nouns or hypernyms), we can indeed learn better referential word meanings by taking into account their semantic similarity to other words. For other words, there is no or even a detrimental effect, compared to a learning setup that presents even semantically related objects as negative instances.