Daniel Hershcovich


Challenges and Strategies in Cross-Cultural NLP
Daniel Hershcovich | Stella Frank | Heather Lent | Miryam de Lhoneux | Mostafa Abdou | Stephanie Brandl | Emanuele Bugliarello | Laura Cabello Piqueras | Ilias Chalkidis | Ruixiang Cui | Constanza Fierro | Katerina Margatina | Phillip Rust | Anders Søgaard
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these differences in order to better serve users of NLP systems. We propose a principled framework to frame these efforts, and survey existing and potential strategies.


Lexical Semantic Recognition
Nelson F. Liu | Daniel Hershcovich | Michael Kranzlein | Nathan Schneider
Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021)

In lexical semantics, full-sentence segmentation and segment labeling of various phenomena are generally treated separately, despite their interdependence. We hypothesize that a unified lexical semantic recognition task is an effective way to encapsulate previously disparate styles of annotation, including multiword expression identification / classification and supersense tagging. Using the STREUSLE corpus, we train a neural CRF sequence tagger and evaluate its performance along various axes of annotation. As the label set generalizes that of previous tasks (PARSEME, DiMSUM), we additionally evaluate how well the model generalizes to those test sets, finding that it approaches or surpasses existing models despite training only on STREUSLE. Our work also establishes baseline models and evaluation metrics for integrated and accurate modeling of lexical semantics, facilitating future work in this area.

A Multilingual Benchmark for Probing Negation-Awareness with Minimal Pairs
Mareike Hartmann | Miryam de Lhoneux | Daniel Hershcovich | Yova Kementchedjhieva | Lukas Nielsen | Chen Qiu | Anders Søgaard
Proceedings of the 25th Conference on Computational Natural Language Learning

Negation is one of the most fundamental concepts in human cognition and language, and several natural language inference (NLI) probes have been designed to investigate pretrained language models’ ability to detect and reason with negation. However, the existing probing datasets are limited to English only, and do not enable controlled probing of performance in the absence or presence of negation. In response, we present a multilingual (English, Bulgarian, German, French and Chinese) benchmark collection of NLI examples that are grammatical and correctly labeled, as a result of manual inspection and reformulation. We use the benchmark to probe the negation-awareness of multilingual language models and find that models that correctly predict examples with negation cues, often fail to correctly predict their counter-examples without negation cues, even when the cues are irrelevant for semantic inference.

How far can we get with one GPU in 100 hours? CoAStaL at MultiIndicMT Shared TaskGPU in 100 hours? CoAStaL at MultiIndicMT Shared Task
Rahul Aralikatte | Héctor Ricardo Murrieta Bello | Miryam de Lhoneux | Daniel Hershcovich | Marcel Bollmann | Anders Søgaard
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

This work shows that competitive translation results can be obtained in a constrained setting by incorporating the latest advances in memory and compute optimization. We train and evaluate large multilingual translation models using a single GPU for a maximum of 100 hours and get within 4-5 BLEU points of the top submission on the leaderboard. We also benchmark standard baselines on the PMI corpus and re-discover well-known shortcomings of translation systems and metrics.


Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing
Stephan Oepen | Omri Abend | Lasha Abzianidze | Johan Bos | Jan Hajič | Daniel Hershcovich | Bin Li | Tim O'Gorman | Nianwen Xue | Daniel Zeman
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

HUJI-KU at MRP 2020 : Two Transition-based Neural ParsersHUJI-KU at MRP 2020: Two Transition-based Neural Parsers
Ofir Arviv | Ruixiang Cui | Daniel Hershcovich
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

This paper describes the HUJI-KU system submission to the shared task on CrossFramework Meaning Representation Parsing (MRP) at the 2020 Conference for Computational Language Learning (CoNLL), employing TUPA and the HIT-SCIR parser, which were, respectively, the baseline system and winning system in the 2019 MRP shared task. Both are transition-based parsers using BERT contextualized embeddings. We generalized TUPA to support the newly-added MRP frameworks and languages, and experimented with multitask learning with the HIT-SCIR parser. We reached 4th place in both the crossframework and cross-lingual tracks.

Comparison by Conversion : Reverse-Engineering UCCA from Syntax and Lexical SemanticsUCCA from Syntax and Lexical Semantics
Daniel Hershcovich | Nathan Schneider | Dotan Dvir | Jakob Prange | Miryam de Lhoneux | Omri Abend
Proceedings of the 28th International Conference on Computational Linguistics

Building robust natural language understanding systems will require a clear characterization of whether and how various linguistic meaning representations complement each other. To perform a systematic comparative analysis, we evaluate the mapping between meaning representations from different frameworks using two complementary methods : (i) a rule-based converter, and (ii) a supervised delexicalized parser that parses to one framework using only information from the other as features. We apply these methods to convert the STREUSLE corpus (with syntactic and lexical semantic annotations) to UCCA (a graph-structured full-sentence meaning representation). Both methods yield surprisingly accurate target representations, close to fully supervised UCCA parser qualityindicating that UCCA annotations are partially redundant with STREUSLE annotations. Despite this substantial convergence between frameworks, we find several important areas of divergence.


SemEval-2019 Task 1 : Cross-lingual Semantic Parsing with UCCASemEval-2019 Task 1: Cross-lingual Semantic Parsing with UCCA
Daniel Hershcovich | Zohar Aizenbud | Leshem Choshen | Elior Sulem | Ari Rappoport | Omri Abend
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the SemEval 2019 shared task on Universal Conceptual Cognitive Annotation (UCCA) parsing in English, German and French, and discuss the participating systems and results. UCCA is a cross-linguistically applicable framework for semantic representation, which builds on extensive typological work and supports rapid annotation. UCCA poses a challenge for existing parsing techniques, as it exhibits reentrancy (resulting in DAG structures), discontinuous structures and non-terminal nodes corresponding to complex semantic units. The shared task has yielded improvements over the state-of-the-art baseline in all languages and settings. Full results can be found in the task’s website.https://competitions.codalab.org/competitions/19160.

Syntactic Interchangeability in Word Embedding Models
Daniel Hershcovich | Assaf Toledo | Alon Halfon | Noam Slonim
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP

Nearest neighbors in word embedding models are commonly observed to be semantically similar, but the relations between them can vary greatly. We investigate the extent to which word embedding models preserve syntactic interchangeability, as reflected by distances between word vectors, and the effect of hyper-parameterscontext window size in particular. We use part of speech (POS) as a proxy for syntactic interchangeability, as generally speaking, words with the same POS are syntactically valid in the same contexts. We also investigate the relationship between interchangeability and similarity as judged by commonly-used word similarity benchmarks, and correlate the result with the performance of word embedding models on these benchmarks. Our results will inform future research and applications in the selection of word embedding model, suggesting a principle for an appropriate selection of the context window size parameter depending on the use-case.

Content Differences in Syntactic and Semantic Representation
Daniel Hershcovich | Omri Abend | Ari Rappoport
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Syntactic analysis plays an important role in semantic parsing, but the nature of this role remains a topic of ongoing debate. The debate has been constrained by the scarcity of empirical comparative studies between syntactic and semantic schemes, which hinders the development of parsing methods informed by the details of target schemes and constructions. We target this gap, and take Universal Dependencies (UD) and UCCA as a test case. After abstracting away from differences of convention or formalism, we find that most content divergences can be ascribed to : (1) UCCA’s distinction between a Scene and a non-Scene ; (2) UCCA’s distinction between primary relations, secondary ones and participants ; (3) different treatment of multi-word expressions, and (4) different treatment of inter-clause linkage. We further discuss the long tail of cases where the two schemes take markedly different approaches. Finally, we show that the proposed comparison methodology can be used for fine-grained evaluation of UCCA parsing, highlighting both challenges and potential sources for improvement. The substantial differences between the schemes suggest that semantic parsers are likely to benefit downstream text understanding applications beyond their syntactic counterparts.

The Language of Legal and Illegal Activity on the DarknetDarknet
Leshem Choshen | Dan Eldad | Daniel Hershcovich | Elior Sulem | Omri Abend
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The non-indexed parts of the Internet (the Darknet) have become a haven for both legal and illegal anonymous activity. Given the magnitude of these networks, scalably monitoring their activity necessarily relies on automated tools, and notably on NLP tools. However, little is known about what characteristics texts communicated through the Darknet have, and how well do off-the-shelf NLP tools do on this domain. This paper tackles this gap and performs an in-depth investigation of the characteristics of legal and illegal text in the Darknet, comparing it to a clear net website with similar content as a control condition. Taking drugs-related websites as a test case, we find that texts for selling legal and illegal drugs have several linguistic characteristics that distinguish them from one another, as well as from the control condition, among them the distribution of POS tags, and the coverage of their named entities in Wikipedia.

Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning
Stephan Oepen | Omri Abend | Jan Hajic | Daniel Hershcovich | Marco Kuhlmann | Tim O’Gorman | Nianwen Xue
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

MRP 2019 : Cross-Framework Meaning Representation ParsingMRP 2019: Cross-Framework Meaning Representation Parsing
Stephan Oepen | Omri Abend | Jan Hajic | Daniel Hershcovich | Marco Kuhlmann | Tim O’Gorman | Nianwen Xue | Jayeol Chun | Milan Straka | Zdenka Uresova
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

The 2019 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks. Five distinct approaches to the representation of sentence meaning in the form of directed graph were represented in the training and evaluation data for the task, packaged in a uniform abstract graph representation and serialization. The task received submissions from eighteen teams, of which five do not participate in the official ranking because they arrived after the closing deadline, made use of additional training data, or involved one of the task co-organizers. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at : http://mrp.nlpl.eu


Multitask Parsing Across Semantic Representations
Daniel Hershcovich | Omri Abend | Ari Rappoport
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The ability to consolidate information of different types is at the core of intelligence, and has tremendous practical value in allowing learning for one task to benefit from generalizations learned for others. In this paper we tackle the challenging task of improving semantic parsing performance, taking UCCA parsing as a test case, and AMR, SDP and Universal Dependencies (UD) parsing as auxiliary tasks. We experiment on three languages, using a uniform transition-based system and learning architecture for all parsing tasks. Despite notable conceptual, formal and domain differences, we show that multitask learning significantly improves UCCA parsing in both in-domain and out-of-domain settings.