Piek Vossen


2021

pdf bib
Variation in framing as a function of temporal reporting distance
Levi Remijnse | Marten Postma | Piek Vossen
Proceedings of the 14th International Conference on Computational Semantics (IWCS)

In this paper, we measure variation in framing as a function of foregrounding and backgrounding in a co-referential corpus with a range of temporal distance. In one type of experiment, frame-annotated corpora grouped under event types were contrasted, resulting in a ranking of frames with typicality rates. In contrasting between publication dates, a different ranking of frames emerged for documents that are close to or far from the event instance. In the second type of analysis, we trained a diagnostic classifier with frame occurrences in order to let it differentiate documents based on their temporal distance class (close to or far from the event instance). The classifier performs above chance and outperforms models with words.

pdf bib
Batavia asked for advice. Pretrained language models for Named Entity Recognition in historical texts.
Sophie I. Arnoult | Lodewijk Petram | Piek Vossen
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Pretrained language models like BERT have advanced the state of the art for many NLP tasks. For resource-rich languages, one has the choice between a number of language-specific models, while multilingual models are also worth considering. These models are well known for their crosslingual performance, but have also shown competitive in-language performance on some tasks. We consider monolingual and multilingual models from the perspective of historical texts, and in particular for texts enriched with editorial notes : how do language models deal with the historical and editorial content in these texts? We present a new Named Entity Recognition dataset for Dutch based on 17th and 18th century United East India Company (VOC) reports extended with modern editorial notes. Our experiments with multilingual and Dutch pretrained language models confirm the crosslingual abilities of multilingual models while showing that all language models can leverage mixed-variant data. In particular, language models successfully incorporate notes for the prediction of entities in historical texts. We also find that multilingual models outperform monolingual models on our data, but that this superiority is linked to the task at hand : multilingual models lose their advantage when confronted with more semantical tasks.

pdf bib
EMISSOR : A platform for capturing multimodal interactions as Episodic Memories and Interpretations with Situated Scenario-based Ontological ReferencesEMISSOR: A platform for capturing multimodal interactions as Episodic Memories and Interpretations with Situated Scenario-based Ontological References
Selene Baez Santamaria | Thomas Baier | Taewoon Kim | Lea Krause | Jaap Kruijt | Piek Vossen
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

We present EMISSOR : a platform to capture multimodal interactions as recordings of episodic experiences with explicit referential interpretations that also yield an episodic Knowledge Graph (eKG). The platform stores streams of multiple modalities as parallel signals. Each signal is segmented and annotated independently with interpretation. Annotations are eventually mapped to explicit identities and relations in the eKG. As we ground signal segments from different modalities to the same instance representations, we also ground different modalities across each other. Unique to our eKG is that it accepts different interpretations across modalities, sources and experiences and supports reasoning over conflicting information and uncertainties that may result from multimodal experiences. EMISSOR can record and annotate experiments in virtual and real-world, combine data, evaluate system behavior and their performance for preset goals but also model the accumulation of knowledge and interpretations in the Knowledge Graph as a result of these episodic experiences.

pdf bib
Proceedings of the 11th Global Wordnet Conference
Piek Vossen | Christiane Fellbaum
Proceedings of the 11th Global Wordnet Conference

2020

pdf bib
Annotating Perspectives on Vaccination
Roser Morante | Chantal van Son | Isa Maks | Piek Vossen
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper we present the Vaccination Corpus, a corpus of texts related to the online vaccination debate that has been annotated with three layers of information about perspectives : attribution, claims and opinions. Additionally, events related to the vaccination debate are also annotated. The corpus contains 294 documents from the Internet which reflect different views on vaccinations. It has been compiled to study the language of online debates, with the final goal of experimenting with methodologies to extract and contrast perspectives in the framework of the vaccination debate.

2019

pdf bib
Proceedings of the 10th Global Wordnet Conference
Piek Vossen | Christiane Fellbaum
Proceedings of the 10th Global Wordnet Conference

pdf bib
Towards interpretable, data-derived distributional meaning representations for reasoning : A dataset of properties and concepts
Pia Sommerauer | Antske Fokkens | Piek Vossen
Proceedings of the 10th Global Wordnet Conference

This paper proposes a framework for investigating which types of semantic properties are represented by distributional data. The core of our framework consists of relations between concepts and properties. We provide hypotheses on which properties are reflected in distributional data or not based on the type of relation. We outline strategies for creating a dataset of positive and negative examples for various semantic properties, which can not easily be separated on the basis of general similarity (e.g. fly : seagull, penguin). This way, a distributional model can only distinguish between positive and negative examples through evidence for a target property. Once completed, this dataset can be used to test our hypotheses and work towards data-derived interpretable representations.

2018

pdf bib
Systematic Study of Long Tail Phenomena in Entity Linking
Filip Ilievski | Piek Vossen | Stefan Schlobach
Proceedings of the 27th International Conference on Computational Linguistics

State-of-the-art entity linkers achieve high accuracy scores with probabilistic methods. However, these scores should be considered in relation to the properties of the datasets they are evaluated on. Until now, there has not been a systematic investigation of the properties of entity linking datasets and their impact on system performance. In this paper we report on a series of hypotheses regarding the long tail phenomena in entity linking datasets, their interaction, and their impact on system performance. Our systematic study of these hypotheses shows that evaluation datasets mainly capture head entities and only incidentally cover data from the tail, thus encouraging systems to overfit to popular / frequent and non-ambiguous cases. We find the most difficult cases of entity linking among the infrequent candidates of ambiguous forms. With our findings, we hope to inspire future designs of both entity linking systems and evaluation datasets. To support this goal, we provide a list of recommended actions for better inclusion of tail cases.

pdf bib
Measuring the Diversity of Automatic Image Descriptions
Emiel van Miltenburg | Desmond Elliott | Piek Vossen
Proceedings of the 27th International Conference on Computational Linguistics

Automatic image description systems typically produce generic sentences that only make use of a small subset of the vocabulary available to them. In this paper, we consider the production of generic descriptions as a lack of diversity in the output, which we quantify using established metrics and two new metrics that frame image description as a word recall task. This framing allows us to evaluate system performance on the head of the vocabulary, as well as on the long tail, where system performance degrades. We use these metrics to examine the diversity of the sentences generated by nine state-of-the-art systems on the MS COCO data set. We find that the systems trained with maximum likelihood objectives produce less diverse output than those trained with additional adversarial objectives. However, the adversarially-trained models only produce more types from the head of the vocabulary and not the tail. Besides vocabulary-based methods, we also look at the compositional capacity of the systems, specifically their ability to create compound nouns and prepositional phrases of different lengths. We conclude that there is still much room for improvement, and offer a toolkit to measure progress towards the goal of generating more diverse image descriptions.

pdf bib
Scoring and Classifying Implicit Positive Interpretations : A Challenge of Class Imbalance
Chantal van Son | Roser Morante | Lora Aroyo | Piek Vossen
Proceedings of the 27th International Conference on Computational Linguistics

This paper reports on a reimplementation of a system on detecting implicit positive meaning from negated statements. In the original regression experiment, different positive interpretations per negation are scored according to their likelihood. We convert the scores to classes and report our results on both the regression and classification tasks. We show that a baseline taking the mean score or most frequent class is hard to beat because of class imbalance in the dataset. Our error analysis indicates that an approach that takes the information structure into account (i.e. which information is new or contrastive) may be promising, which requires looking beyond the syntactic and semantic characteristics of negated statements.

pdf bib
Proceedings of the 9th Global Wordnet Conference
Francis Bond | Piek Vossen | Christiane Fellbaum
Proceedings of the 9th Global Wordnet Conference

pdf bib
NewsReader at SemEval-2018 Task 5 : Counting events by reasoning over event-centric-knowledge-graphsNewsReader at SemEval-2018 Task 5: Counting events by reasoning over event-centric-knowledge-graphs
Piek Vossen
Proceedings of The 12th International Workshop on Semantic Evaluation

In this paper, we describe the participation of the NewsReader system in the SemEval-2018 Task 5 on Counting Events and Participants in the Long Tail. NewsReader is a generic unsupervised text processing system that detects events with participants, time and place to generate Event Centric Knowledge Graphs (ECKGs). We minimally adapted these ECKGs to establish a baseline performance for the task. We first use the ECKGs to establish which documents report on the same incident and what event mentions are coreferential. Next, we aggregate ECKGs across coreferential mentions and use the aggregated knowledge to answer the questions of the task. Our participation tests the quality of NewsReader to create ECKGs, as well as the potential of ECKGs to establish event identity and reason over the result to answer the task queries.

pdf bib
Meaning_space at SemEval-2018 Task 10 : Combining explicitly encoded knowledge with information extracted from word embeddingsSemEval-2018 Task 10: Combining explicitly encoded knowledge with information extracted from word embeddings
Pia Sommerauer | Antske Fokkens | Piek Vossen
Proceedings of The 12th International Workshop on Semantic Evaluation

This paper presents the two systems submitted by the meaning space team in Task 10 of the SemEval competition 2018 entitled Capturing discriminative attributes. The systems consist of combinations of approaches exploiting explicitly encoded knowledge about concepts in WordNet and information encoded in distributional semantic vectors. Rather than aiming for high performance, we explore which kind of semantic knowledge is best captured by different methods. The results indicate that WordNet glosses on different levels of the hierarchy capture many attributes relevant for this task. In combination with exploiting word embedding similarities, this source of information yielded our best results. Our best performing system ranked 5th out of 13 final ranks. Our analysis yields insights into the different kinds of attributes represented by different sources of knowledge.

pdf bib
Proceedings of the Workshop Events and Stories in the News 2018
Tommaso Caselli | Ben Miller | Marieke van Erp | Piek Vossen | Martha Palmer | Eduard Hovy | Teruko Mitamura | David Caswell | Susan W. Brown | Claire Bonial
Proceedings of the Workshop Events and Stories in the News 2018

pdf bib
Talking about other people : an endless range of possibilities
Emiel van Miltenburg | Desmond Elliott | Piek Vossen
Proceedings of the 11th International Conference on Natural Language Generation

Image description datasets, such as Flickr30 K and MS COCO, show a high degree of variation in the ways that crowd-workers talk about the world. Although this gives us a rich and diverse collection of data to work with, it also introduces uncertainty about how the world should be described. This paper shows the extent of this uncertainty in the PEOPLE-domain. We present a taxonomy of different ways to talk about other people. This taxonomy serves as a reference point to think about how other people should be described, and can be used to classify and compute statistics about labels applied to people.

2017

pdf bib
Proceedings of the Events and Stories in the News Workshop
Tommaso Caselli | Ben Miller | Marieke van Erp | Piek Vossen | Martha Palmer | Eduard Hovy | Teruko Mitamura | David Caswell
Proceedings of the Events and Stories in the News Workshop

pdf bib
The Event StoryLine Corpus : A New Benchmark for Causal and Temporal Relation ExtractionStoryLine Corpus: A New Benchmark for Causal and Temporal Relation Extraction
Tommaso Caselli | Piek Vossen
Proceedings of the Events and Stories in the News Workshop

This paper reports on the Event StoryLine Corpus (ESC) v1.0, a new benchmark dataset for the temporal and causal relation detection. By developing this dataset, we also introduce a new task, the StoryLine Extraction from news data, which aims at extracting and classifying events relevant for stories, from across news documents spread in time and clustered around a single seminal event or topic. In addition to describing the dataset, we also report on three baselines systems whose results show the complexity of the task and suggest directions for the development of more robust systems.

pdf bib
Storyteller : Visual Analytics of Perspectives on Rich Text InterpretationsStoryteller: Visual Analytics of Perspectives on Rich Text Interpretations
Maarten van Meersbergen | Piek Vossen | Janneke van der Zwaan | Antske Fokkens | Willem van Hage | Inger Leemans | Isa Maks
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

Complexity of event data in texts makes it difficult to assess its content, especially when considering larger collections in which different sources report on the same or similar situations. We present a system that makes it possible to visually analyze complex event and emotion data extracted from texts. We show that we can abstract from different data models for events and emotions to a single data model that can show the complex relations in four dimensions. The visualization has been applied to analyze 1) dynamic developments in how people both conceive and express emotions in theater plays and 2) how stories are told from the perspectyive of their sources based on rich event data extracted from news or biographies.