Animesh Mukherjee


2020

pdf bib
Code-Switching Patterns Can Be an Effective Route to Improve Performance of Downstream NLP Applications : A Case Study of Humour, Sarcasm and Hate Speech DetectionNLP Applications: A Case Study of Humour, Sarcasm and Hate Speech Detection
Srijan Bansal | Vishal Garimella | Ayush Suhane | Jasabanta Patro | Animesh Mukherjee
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In this paper, we demonstrate how code-switching patterns can be utilised to improve various downstream NLP applications. In particular, we encode various switching features to improve humour, sarcasm and hate speech detection tasks. We believe that this simple linguistic observation can also be potentially helpful in improving other similar NLP applications.

2019

pdf bib
On the Compositionality Prediction of Noun Phrases using Poincar Embeddings
Abhik Jana | Dima Puzyrev | Alexander Panchenko | Pawan Goyal | Chris Biemann | Animesh Mukherjee
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The compositionality degree of multiword expressions indicates to what extent the meaning of a phrase can be derived from the meaning of its constituents and their grammatical relations. Prediction of (non)-compositionality is a task that has been frequently addressed with distributional semantic models. We introduce a novel technique to blend hierarchical information with distributional information for predicting compositionality. In particular, we use hypernymy information of the multiword and its constituents encoded in the form of the recently introduced Poincar embeddings in addition to the distributional information to detect compositionality for noun phrases. Using a weighted average of the distributional similarity and a Poincar similarity function, we obtain consistent and substantial, statistically significant improvement across three gold standard datasets over state-of-the-art models based on distributional information only. Unlike traditional approaches that solely use an unsupervised setting, we have also framed the problem as a supervised task, obtaining comparable improvements. Further, we publicly release our Poincar embeddings, which are trained on the output of handcrafted lexical-syntactic patterns on a large corpus.

2018

pdf bib
WikiRef : Wikilinks as a route to recommending appropriate references for scientific Wikipedia pagesWikiRef: Wikilinks as a route to recommending appropriate references for scientific Wikipedia pages
Abhik Jana | Pranjal Kanojiya | Pawan Goyal | Animesh Mukherjee
Proceedings of the 27th International Conference on Computational Linguistics

The exponential increase in the usage of Wikipedia as a key source of scientific knowledge among the researchers is making it absolutely necessary to metamorphose this knowledge repository into an integral and self-contained source of information for direct utilization. Unfortunately, the references which support the content of each Wikipedia entity page, are far from complete. Why are the reference section ill-formed for most Wikipedia pages? Is this section edited as frequently as the other sections of a page? Can there be appropriate surrogates that can automatically enhance the reference section? In this paper, we propose a novel two step approach WikiRef that (i) leverages the wikilinks present in a scientific Wikipedia target page and, thereby, (ii) recommends highly relevant references to be included in that target page appropriately and automatically borrowed from the reference section of the wikilinks. In the first step, we build a classifier to ascertain whether a wikilink is a potential source of reference or not. In the following step, we recommend references to the target page from the reference section of the wikilinks that are classified as potential sources of references in the first step. We perform an extensive evaluation of our approach on datasets from two different domains Computer Science and Physics. For Computer Science we achieve a notably good performance with a precision@1 of 0.44 for reference recommendation as opposed to 0.38 obtained from the most competitive baseline. For the Physics dataset, we obtain a similar performance boost of 10 % with respect to the most competitive baseline.

pdf bib
Deep Learning for Social Media Health Text Classification
Santosh Tokala | Vaibhav Gambhir | Animesh Mukherjee
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

This paper describes the systems developed for 1st and 2nd tasks of the 3rd Social Media Mining for Health Applications Shared Task at EMNLP 2018. The first task focuses on automatic detection of posts mentioning a drug name or dietary supplement, a binary classification. The second task is about distinguishing the tweets that present personal medication intake, possible medication intake and non-intake. We performed extensive experiments with various classifiers like Logistic Regression, Random Forest, SVMs, Gradient Boosted Decision Trees (GBDT) and deep learning architectures such as Long Short-Term Memory Networks (LSTM), jointed Convolutional Neural Networks (CNN) and LSTM architecture, and attention based LSTM architecture both at word and character level. We have also explored using various pre-trained embeddings like Global Vectors for Word Representation (GloVe), Word2Vec and task-specific embeddings learned using CNN-LSTM and LSTMs.

pdf bib
CL Scholar : The ACL Anthology Knowledge Graph MinerCL Scholar: The ACL Anthology Knowledge Graph Miner
Mayank Singh | Pradeep Dogga | Sohan Patro | Dhiraj Barnwal | Ritam Dutt | Rajarshi Haldar | Pawan Goyal | Animesh Mukherjee
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

We present CL Scholar, the ACL Anthology knowledge graph miner to facilitate high-quality search and exploration of current research progress in the computational linguistics community. In contrast to previous works, periodically crawling, indexing and processing of new incoming articles is completely automated in the current system. CL Scholar utilizes both textual and network information for knowledge graph construction. As an additional novel initiative, CL Scholar supports more than 1200 scholarly natural language queries along with standard keyword-based search on constructed knowledge graph. It answers binary, statistical and list based natural language queries. The current system is deployed at. We also provide REST API support along with bulk download facility. Our code and data are available at.http://cnerg.iitkgp.ac.in/aclakg. We also provide REST API\n support along with bulk download facility. Our code and data are available\n at https://github.com/CLScholar.\n

2017

pdf bib
Adapting predominant and novel sense discovery algorithms for identifying corpus-specific sense differences
Binny Mathew | Suman Kalyan Maity | Pratip Sarkar | Animesh Mukherjee | Pawan Goyal
Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing

Word senses are not static and may have temporal, spatial or corpus-specific scopes. Identifying such scopes might benefit the existing WSD systems largely. In this paper, while studying corpus specific word senses, we adapt three existing predominant and novel-sense discovery algorithms to identify these corpus-specific senses. We make use of text data available in the form of millions of digitized books and newspaper archives as two different sources of corpora and propose automated methods to identify corpus-specific word senses at various time points. We conduct an extensive and thorough human judgement experiment to rigorously evaluate and compare the performance of these approaches. Post adaptation, the output of the three algorithms are in the same format and the accuracy results are also comparable, with roughly 45-60 % of the reported corpus-specific senses being judged as genuine.

pdf bib
All that is English may be Hindi : Enhancing language identification through automatic ranking of the likeliness of word borrowing in social mediaEnglish may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media
Jasabanta Patro | Bidisha Samanta | Saurabh Singh | Abhipsa Basu | Prithwish Mukherjee | Monojit Choudhury | Animesh Mukherjee
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

n this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman’s correlation values, our methods perform more than two times better (0.62) in predicting the borrowing likeliness compared to the best performing baseline (0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88 % of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems.