Natalia Grabar


2020

pdf bib
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Atelier DÉfi Fouille de Textes
Rémi Cardon | Natalia Grabar | Cyril Grouin | Thierry Hamon
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Atelier DÉfi Fouille de Textes

pdf bib
Reducing the Search Space for Parallel Sentences in Comparable Corpora
Rémi Cardon | Natalia Grabar
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

This paper describes and evaluates simple techniques for reducing the research space for parallel sentences in monolingual comparable corpora. Initially, when searching for parallel sentences between two comparable documents, all the possible sentence pairs between the documents have to be considered, which introduces a great degree of imbalance between parallel pairs and non-parallel pairs. This is a problem because even with a high performing algorithm, a lot of noise will be present in the extracted results, thus introducing a need for an extensive and costly manual check phase. We work on a manually annotated subset obtained from a French comparable corpus and show how we can drastically reduce the number of sentence pairs that have to be fed to a classifier so that the results can be manually handled.

pdf bib
A French Corpus for Semantic SimilarityFrench Corpus for Semantic Similarity
Rémi Cardon | Natalia Grabar
Proceedings of the 12th Language Resources and Evaluation Conference

Semantic similarity is an area of Natural Language Processing that is useful for several downstream applications, such as machine translation, natural language generation, information retrieval, or question answering. The task consists in assessing the extent to which two sentences express or do not express the same meaning. To do so, corpora with graded pairs of sentences are required. The grade is positioned on a given scale, usually going from 0 (completely unrelated) to 5 (equivalent semantics). In this work, we introduce such a corpus for French, the first that we know of. It is comprised of 1,010 sentence pairs with grades from five annotators. We describe the annotation process, analyse these data, and perform a few experiments for the automatic grading of semantic similarity.

2019

pdf bib
RNN Embeddings for Identifying Difficult to Understand Medical WordsRNN Embeddings for Identifying Difficult to Understand Medical Words
Hanna Pylieva | Artem Chernodub | Natalia Grabar | Thierry Hamon
Proceedings of the 18th BioNLP Workshop and Shared Task

Patients and their families often require a better understanding of medical information provided by doctors. We currently address this issue by improving the identification of difficult to understand medical words. We introduce novel embeddings received from RNN-FrnnMUTE (French RNN Medical Understandability Text Embeddings) which allow to reach up to 87.0 F1 score in identification of difficult words. We also note that adding pre-trained FastText word embeddings to the feature set substantially improves the performance of the model which classifies words according to their difficulty. We study the generalizability of different models through three cross-validation scenarios which allow testing classifiers in real-world conditions : understanding of medical words by new users, and classification of new unseen words by the automatic models. The RNN-FrnnMUTE embeddings and the categorization code are being made available for the research.