Marcos Zampieri


2022

pdf bib
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Luciana Benotti | Naoaki Okazaki | Yves Scherrer | Marcos Zampieri
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

2021

pdf bib
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer | Tommi Jauhiainen
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
Comparing Approaches to Dravidian Language IdentificationDravidian Language Identification
Tommi Jauhiainen | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop. The DLI training set includes 16,674 YouTube comments written in Roman script containing code-mixed text with English and one of the three South Dravidian languages : Kannada, Malayalam, and Tamil. We submitted results generated using two models, a Naive Bayes classifier with adaptive language models, which has shown to obtain competitive performance in many language and dialect identification tasks, and a transformer-based model which is widely regarded as the state-of-the-art in a number of NLP tasks. Our first submission was sent in the closed submission track using only the training set provided by the shared task organisers, whereas the second submission is considered to be open as it used a pretrained model trained with external data. Our team attained shared second position in the shared task with the submission based on Naive Bayes. Our results reinforce the idea that deep learning methods are not as competitive in language identification related tasks as they are in many other text classification tasks.

pdf bib
Cross-lingual Offensive Language Identification for Low Resource Languages : The Case of MarathiMarathi
Saurabh Sampatrao Gaikwad | Tharindu Ranasinghe | Marcos Zampieri | Christopher Homan
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

pdf bib
WLV-RIT at GermEval 2021 : Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming CommentsWLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments
Skye Morgan | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments

This paper addresses the identification of toxic, engaging, and fact-claiming comments on social media. We used the dataset made available by the organizers of the GermEval2021 shared task containing over 3,000 manually annotated Facebook comments in German. Considering the relatedness of the three tasks, we approached the problem using large pre-trained transformer models and multitask learning. Our results indicate that multitask learning achieves performance superior to the more common single task learning approach in all three tasks. We submit our best systems to GermEval-2021 under the team name WLV-RIT.

pdf bib
MUDES : Multilingual Detection of Offensive SpansMUDES: Multilingual Detection of Offensive Spans
Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

The interest in offensive content identification in social media has grown substantially in recent years. Previous work has dealt mostly with post level annotations. However, identifying offensive spans is useful in many ways. To help coping with this important challenge, we present MUDES, a multilingual system to detect offensive spans in texts. MUDES features pre-trained models, a Python API for developers, and a user-friendly web-based interface. A detailed description of MUDES’ components is presented in this paper.

pdf bib
SemEval-2021 Task 1 : Lexical Complexity PredictionSemEval-2021 Task 1: Lexical Complexity Prediction
Matthew Shardlow | Richard Evans | Gustavo Henrique Paetzold | Marcos Zampieri
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper presents the results and main findings of SemEval-2021 Task 1-Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al. CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks : Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.

pdf bib
LCP-RIT at SemEval-2021 Task 1 : Exploring Linguistic Features for Lexical Complexity PredictionLCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for Lexical Complexity Prediction
Abhinandan Tejalkumar Desai | Kai North | Marcos Zampieri | Christopher Homan
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes team LCP-RIT’s submission to the SemEval-2021 Task 1 : Lexical Complexity Prediction (LCP). The task organizers provided participants with an augmented version of CompLex (Shardlow et al., 2020), an English multi-domain dataset in which words in context were annotated with respect to their complexity using a five point Likert scale. Our system uses logistic regression and a wide range of linguistic features (e.g. psycholinguistic features, n-grams, word frequency, POS tags) to predict the complexity of single words in this dataset. We analyze the impact of different linguistic features on the classification performance and we evaluate the results in terms of mean absolute error, mean squared error, Pearson correlation, and Spearman correlation.

2020

pdf bib
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Ritesh Kumar | Atul Kr. Ojha | Bornini Lahiri | Marcos Zampieri | Shervin Malmasi | Vanessa Murdock | Daniel Kadar
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

pdf bib
SemEval-2020 Task 12 : Multilingual Offensive Language Identification in Social Media (OffensEval 2020)SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
Marcos Zampieri | Preslav Nakov | Sara Rosenthal | Pepa Atanasova | Georgi Karadzhov | Hamdy Mubarak | Leon Derczynski | Zeses Pitenis | Çağrı Çöltekin
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We present the results and the main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval-2020). The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages : Arabic, Danish, English, Greek, and Turkish. OffensEval-2020 was one of the most popular tasks at SemEval-2020, attracting a large number of participants across all subtasks and languages : a total of 528 teams signed up to participate in the task, 145 teams submitted official runs on the test data, and 70 teams submitted system description papers.

pdf bib
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
Offensive Language Identification in GreekGreek
Zesis Pitenis | Marcos Zampieri | Tharindu Ranasinghe
Proceedings of the 12th Language Resources and Evaluation Conference

As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types : cyberbullying, hate speech, aggression, etc. With a few notable exceptions, most research on this topic so far has dealt with English. This is mostly due to the availability of language resources for English. To address this shortcoming, this paper presents the first Greek annotated dataset for offensive language identification : the Offensive Greek Tweet Dataset (OGTD). OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive. Along with a detailed description of the dataset, we evaluate several computational models trained and tested on this data.

2019

pdf bib
SemEval-2019 Task 6 : Identifying and Categorizing Offensive Language in Social Media (OffensEval)SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)
Marcos Zampieri | Shervin Malmasi | Preslav Nakov | Sara Rosenthal | Noura Farra | Ritesh Kumar
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets, and it featured three sub-tasks. In sub-task A, systems were asked to discriminate between offensive and non-offensive posts. In sub-task B, systems had to identify the type of offensive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts. OffensEval attracted a large number of participants and it was one of the most popular tasks in SemEval-2019. In total, nearly 800 teams signed up to participate in the task and 115 of them submitted results, which are presented and analyzed in this report.

pdf bib
UTFPR at SemEval-2019 Task 5 : Hate Speech Identification with Recurrent Neural NetworksUTFPR at SemEval-2019 Task 5: Hate Speech Identification with Recurrent Neural Networks
Gustavo Henrique Paetzold | Marcos Zampieri | Shervin Malmasi
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper we revisit the problem of automatically identifying hate speech in posts from social media. We approach the task using a system based on minimalistic compositional Recurrent Neural Networks (RNN). We tested our approach on the SemEval-2019 Task 5 : Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (HatEval) shared task dataset. The dataset made available by the HatEval organizers contained English and Spanish posts retrieved from Twitter annotated with respect to the presence of hateful content and its target. In this paper we present the results obtained by our system in comparison to the other entries in the shared task. Our system achieved competitive performance ranking 7th in sub-task A out of 62 systems in the English track.

pdf bib
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Shervin Malmasi | Nikola Ljubešić | Jörg Tiedemann | Ahmed Ali
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
UDSDFKI Submission to the WMT2019 CzechPolish Similar Language Translation Shared TaskUDSDFKI Submission to the WMT2019 Czech–Polish Similar Language Translation Shared Task
Santanu Pal | Marcos Zampieri | Josef van Genabith
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this paper we present the UDS-DFKI system submitted to the Similar Language Translation shared task at WMT 2019. The first edition of this shared task featured data from three pairs of similar languages : Czech and Polish, Hindi and Nepali, and Portuguese and Spanish. Participants could choose to participate in any of these three tracks and submit system outputs in any translation direction. We report the results obtained by our system in translating from Czech to Polish and comment on the impact of out-of-domain test data in the performance of our system. UDS-DFKI achieved competitive performance ranking second among ten teams in Czech to Polish translation.

2018

pdf bib
A Report on the Complex Word Identification Shared Task 2018
Seid Muhie Yimam | Chris Biemann | Shervin Malmasi | Gustavo Paetzold | Lucia Specia | Sanja Štajner | Anaïs Tack | Marcos Zampieri
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT’2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks : English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks : binary classification and probabilistic classification. A total of 12 teams submitted their results in different task / track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings.

pdf bib
A Portuguese Native Language Identification DatasetPortuguese Native Language Identification Dataset
Iria del Río Gayo | Marcos Zampieri | Shervin Malmasi
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author’s first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s : Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. NLI-PT includes the original student text and four different types of annotation : POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP. We discuss possible applications of this dataset and present the results obtained for the first lexical baseline system for Portuguese NLI.

pdf bib
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Shervin Malmasi | Ahmed Ali
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

pdf bib
A Neural Approach to Language Variety Translation
Marta R. Costa-jussà | Marcos Zampieri | Santanu Pal
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present the first neural-based machine translation system trained to translate between standard national varieties of the same language. We take the pair Brazilian-European Portuguese as an example and compare the performance of this method to a phrase-based statistical machine translation system. We report a performance improvement of 0.9 BLEU points in translating from European to Brazilian Portuguese and 0.2 BLEU points when translating in the opposite direction. We also carried out a human evaluation experiment with native speakers of Brazilian Portuguese which indicates that humans prefer the output produced by the neural-based system in comparison to the statistical system.

pdf bib
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)
Ritesh Kumar | Atul Kr. Ojha | Marcos Zampieri | Shervin Malmasi
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

pdf bib
Benchmarking Aggression Identification in Social Media
Ritesh Kumar | Atul Kr. Ojha | Shervin Malmasi | Marcos Zampieri
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

In this paper, we present the report and findings of the Shared Task on Aggression Identification organised as part of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-1) at COLING 2018. The task was to develop a classifier that could discriminate between Overtly Aggressive, Covertly Aggressive, and Non-aggressive texts. For this task, the participants were provided with a dataset of 15,000 aggression-annotated Facebook Posts and Comments each in Hindi (in both Roman and Devanagari script) and English for training and validation. For testing, two different sets-one from Facebook and another from a different social media-were provided. A total of 130 teams registered to participate in the task, 30 teams submitted their test runs, and finally 20 teams also sent their system description paper which are included in the TRAC workshop proceedings. The best system obtained a weighted F-score of 0.64 for both Hindi and English on the Facebook test sets, while the best scores on the surprise set were 0.60 and 0.50 for English and Hindi respectively. The results presented in this report depict how challenging the task is. The positive response from the community and the great levels of participation in the first edition of this shared task also highlights the interest in this topic.

pdf bib
Classifying Patent Applications with Ensemble Methods
Fernando Benites | Shervin Malmasi | Marcos Zampieri
Proceedings of the Australasian Language Technology Association Workshop 2018

We present methods for the automatic classification of patent applications using an annotated dataset provided by the organizers of the ALTA 2018 shared task-Classifying Patent Applications. The goal of the task is to use computational methods to categorize patent applications according to a coarse-grained taxonomy of eight classes based on the International Patent Classification (IPC). We tested a variety of approaches for this task and the best results, 0.778 micro-averaged F1-Score, were achieved by SVM ensembles using a combination of words and characters as features. Our team, BMZ, was ranked first among 14 teams in the competition.

2017

pdf bib
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Preslav Nakov | Marcos Zampieri | Nikola Ljubešić | Jörg Tiedemann | Shevin Malmasi | Ahmed Ali
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

pdf bib
Findings of the VarDial Evaluation Campaign 2017VarDial Evaluation Campaign 2017
Marcos Zampieri | Shervin Malmasi | Nikola Ljubešić | Preslav Nakov | Ahmed Ali | Jörg Tiedemann | Yves Scherrer | Noëmi Aepli
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017. This year, we included four shared tasks : Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers.

pdf bib
German Dialect Identification in Interview TranscriptionsGerman Dialect Identification in Interview Transcriptions
Shervin Malmasi | Marcos Zampieri
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper presents three systems submitted to the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2017. The task consists of training models to identify the dialect of Swiss-German speech transcripts. The dialects included in the GDI dataset are Basel, Bern, Lucerne, and Zurich. The three systems we submitted are based on : a plurality ensemble, a mean probability ensemble, and a meta-classifier trained on character and word n-grams. The best results were obtained by the meta-classifier achieving 68.1 % accuracy and 66.2 % F1-score, ranking first among the 10 teams which participated in the GDI shared task.

pdf bib
Arabic Dialect Identification Using iVectors and ASR TranscriptsArabic Dialect Identification Using iVectors and ASR Transcripts
Shervin Malmasi | Marcos Zampieri
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper presents the systems submitted by the MAZA team to the Arabic Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign 2017. The goal of the task is to evaluate computational models to identify the dialect of Arabic utterances using both audio and text transcriptions. The ADI shared task dataset included Modern Standard Arabic (MSA) and four Arabic dialects : Egyptian, Gulf, Levantine, and North-African. The three systems submitted by MAZA are based on combinations of multiple machine learning classifiers arranged as (1) voting ensemble ; (2) mean probability ensemble ; (3) meta-classifier. The best results were obtained by the meta-classifier achieving 71.7 % accuracy, ranking second among the six teams which participated in the ADI shared task.

pdf bib
Native Language Identification on Text and Speech
Marcos Zampieri | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper presents an ensemble system combining the output of multiple SVM classifiers to native language identification (NLI). The system was submitted to the NLI Shared Task 2017 fusion track which featured students essays and spoken responses in form of audio transcriptions and iVectors by non-native English speakers of eleven native languages. Our system competed in the challenge under the team name ZCD and was based on an ensemble of SVM classifiers trained on character n-grams achieving 83.58 % accuracy and ranking 3rd in the shared task.

pdf bib
Complex Word Identification : Challenges in Data Annotation and System Performance
Marcos Zampieri | Shervin Malmasi | Gustavo Paetzold | Lucia Specia
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task. We use ensemble classifiers to investigate how well computational methods can discriminate between complex and non-complex words. Furthermore, we analyze the classification performance to understand what makes lexical complexity challenging. Our findings show that most systems performed poorly on the SemEval CWI dataset, and one of the reasons for that is the way in which human annotation was performed.