Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

Nafise Sadat Moosavi, Angela Fan, Vered Shwartz, Goran Glavaš, Shafiq Joty, Alex Wang, Thomas Wolf (Editors)

Anthology ID:
EMNLP | sustainlp
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing
Nafise Sadat Moosavi | Angela Fan | Vered Shwartz | Goran Glavaš | Shafiq Joty | Alex Wang | Thomas Wolf

pdf bib
Knowing Right from Wrong : Should We Use More Complex Models for Automatic Short-Answer Scoring in Bahasa Indonesia?Bahasa Indonesia?
Ali Akbar Septiandri | Yosef Ardhito Winatmoko | Ilham Firdausi Putra

We compare three solutions to UKARA 1.0 challenge on automated short-answer scoring : single classical, ensemble classical, and deep learning. The task is to classify given answers to two questions, whether they are right or wrong. While recent development shows increasing model complexity to push the benchmark performances, they tend to be resource-demanding with mundane improvement. For the UKARA task, we found that bag-of-words and classical machine learning approaches can compete with ensemble models and Bi-LSTM model with pre-trained word2vec embedding from 200 million words. In this case, the single classical machine learning achieved less than 2 % difference in F1 compared to the deep learning approach with 1/18 time for model training.

pdf bib
Learning Informative Representations of Biomedical Relations with Latent Variable Models
Harshil Shah | Julien Fauqueur

Extracting biomedical relations from large corpora of scientific documents is a challenging natural language processing task. Existing approaches usually focus on identifying a relation either in a single sentence (mention-level) or across an entire corpus (pair-level). In both cases, recent methods have achieved strong results by learning a point estimate to represent the relation ; this is then used as the input to a relation classifier. However, the relation expressed in text between a pair of biomedical entities is often more complex than can be captured by a point estimate. To address this issue, we propose a latent variable model with an arbitrarily flexible distribution to represent the relation between an entity pair. Additionally, our model provides a unified architecture for both mention-level and pair-level relation extraction. We demonstrate that our model achieves results competitive with strong baselines for both tasks while having fewer parameters and being significantly faster to train. We make our code publicly available.

pdf bib
End to End Binarized Neural Networks for Text Classification
Kumar Shridhar | Harshil Jain | Akshat Agarwal | Denis Kleyko

Deep neural networks have demonstrated their superior performance in almost every Natural Language Processing task, however, their increasing complexity raises concerns. A particular concern is that these networks pose high requirements for computing hardware and training budgets. The state-of-the-art transformer models are a vivid example. Simplifying the computations performed by a network is one way of addressing the issue of the increasing complexity. In this paper, we propose an end to end binarized neural network for the task of intent and text classification. In order to fully utilize the potential of end to end binarization, both the input representations (vector embeddings of tokens statistics) and the classifier are binarized. We demonstrate the efficiency of such a network on the intent classification of short texts over three datasets and text classification with a larger dataset. On the considered datasets, the proposed network achieves comparable to the state-of-the-art results while utilizing 20-40 % lesser memory and training time compared to the benchmarks.

pdf bib
Exploring the Boundaries of Low-Resource BERT DistillationBERT Distillation
Moshe Wasserblat | Oren Pereg | Peter Izsak

In recent years, large pre-trained models have demonstrated state-of-the-art performance in many of NLP tasks. However, the deployment of these models on devices with limited resources is challenging due to the models’ large computational consumption and memory requirements. Moreover, the need for a considerable amount of labeled training data also hinders real-world deployment scenarios. Model distillation has shown promising results for reducing model size, computational load and data efficiency. In this paper we test the boundaries of BERT model distillation in terms of model compression, inference efficiency and data scarcity. We show that classification tasks that require the capturing of general lexical semantics can be successfully distilled by very simple and efficient models and require relatively small amount of labeled training data. We also show that the distillation of large pre-trained models is more effective in real-life scenarios where limited amounts of labeled training are available.

pdf bib
Efficient Estimation of Influence of a Training Instance
Sosuke Kobayashi | Sho Yokoi | Jun Suzuki | Kentaro Inui

Understanding the influence of a training instance on a neural network model leads to improving interpretability. However, it is difficult and inefficient to evaluate the influence, which shows how a model’s prediction would be changed if a training instance were not used. In this paper, we propose an efficient method for estimating the influence. Our method is inspired by dropout, which zero-masks a sub-network and prevents the sub-network from learning each training instance. By switching between dropout masks, we can use sub-networks that learned or did not learn each training instance and estimate its influence. Through experiments with BERT and VGGNet on classification datasets, we demonstrate that the proposed method can capture training influences, enhance the interpretability of error predictions, and cleanse the training dataset for improving generalization.

pdf bib
Efficient Inference For Neural Machine Translation
Yi-Te Hsu | Sarthak Garg | Yi-Hsiu Liao | Ilya Chatsviorkin

Large Transformer models have achieved state-of-the-art results in neural machine translation and have become standard in the field. In this work, we look for the optimal combination of known techniques to optimize inference speed without sacrificing translation quality. We conduct an empirical study that stacks various approaches and demonstrates that combination of replacing decoder self-attention with simplified recurrent units, adopting a deep encoder and a shallow decoder architecture and multi-head attention pruning can achieve up to 109 % and 84 % speedup on CPU and GPU respectively and reduce the number of parameters by 25 % while maintaining the same translation quality in terms of BLEU.

pdf bib
Sparse Optimization for Unsupervised Extractive Summarization of Long Documents with the Frank-Wolfe Algorithm
Alicia Tsai | Laurent El Ghaoui

We address the problem of unsupervised extractive document summarization, especially for long documents. We model the unsupervised problem as a sparse auto-regression one and approximate the resulting combinatorial problem via a convex, norm-constrained problem. We solve it using a dedicated Frank-Wolfe algorithm. To generate a summary with k sentences, the algorithm only needs to execute approximately k iterations, making it very efficient for a long document. We evaluate our approach against two other unsupervised methods using both lexical (standard) ROUGE scores, as well as semantic (embedding-based) ones. Our method achieves better results with both datasets and works especially well when combined with embeddings for highly paraphrased summaries.

pdf bib
A Two-stage Model for Slot Filling in Low-resource Settings : Domain-agnostic Non-slot Reduction and Pretrained Contextual Embeddings
Cennet Oguz | Ngoc Thang Vu

Learning-based slot filling-a key component of spoken language understanding systems-typically requires a large amount of in-domain hand-labeled data for training. In this paper, we propose a novel two-stage model architecture that can be trained with only a few in-domain hand-labeled examples. The first step is designed to remove non-slot tokens (i.e., O labeled tokens), as they introduce noise in the input of slot filling models. This step is domain-agnostic and therefore, can be trained by exploiting out-of-domain data. The second step identifies slot names only for slot tokens by using state-of-the-art pretrained contextual embeddings such as ELMO and BERT. We show that our approach outperforms other state-of-art systems on the SNIPS benchmark dataset.

pdf bib
Early Exiting BERT for Efficient Document RankingBERT for Efficient Document Ranking
Ji Xin | Rodrigo Nogueira | Yaoliang Yu | Jimmy Lin

Pre-trained language models such as BERT have shown their effectiveness in various tasks. Despite their power, they are known to be computationally intensive, which hinders real-world applications. In this paper, we introduce early exiting BERT for document ranking. With a slight modification, BERT becomes a model with multiple output paths, and each inference sample can exit early from these paths. In this way, computation can be effectively allocated among samples, and overall system latency is significantly reduced while the original quality is maintained. Our experiments on two document ranking datasets demonstrate up to 2.5x inference speedup with minimal quality degradation. The source code of our implementation can be found at

pdf bib
A Little Bit Is Worse Than None : Ranking with Limited Training Data
Xinyu Zhang | Andrew Yates | Jimmy Lin

Researchers have proposed simple yet effective techniques for the retrieval problem based on using BERT as a relevance classifier to rerank initial candidates from keyword search. In this work, we tackle the challenge of fine-tuning these models for specific domains in a data and computationally efficient manner. Typically, researchers fine-tune models using corpus-specific labeled data from sources such as TREC. We first answer the question : How much data of this type do we need? Recognizing that the most computationally efficient training is no training, we explore zero-shot ranking using BERT models that have already been fine-tuned with the large MS MARCO passage retrieval dataset. We arrive at the surprising and novel finding that some labeled in-domain data can be worse than none at all.

pdf bib
Load What You Need : Smaller Versions of Mutililingual BERTBERT
Amine Abdaoui | Camille Pradel | Grégoire Sigel

Pre-trained Transformer-based models are achieving state-of-the-art results on a variety of Natural Language Processing data sets. However, the size of these models is often a drawback for their deployment in real production applications. In the case of multilingual models, most of the parameters are located in the embeddings layer. Therefore, reducing the vocabulary size should have an important impact on the total number of parameters. In this paper, we propose to extract smaller models that handle fewer number of languages according to the targeted corpora. We present an evaluation of smaller versions of multilingual BERT on the XNLI data set, but we believe that this method may be applied to other multilingual transformers. The obtained results confirm that we can generate smaller models that keep comparable results, while reducing up to 45 % of the total number of parameters. We compared our models with DistilmBERT (a distilled version of multilingual BERT) and showed that unlike language reduction, distillation induced a 1.7 % to 6 % drop in the overall accuracy on the XNLI data set. The presented models and code are publicly available.

pdf bib
Towards Accurate and Reliable Energy Measurement of NLP ModelsNLP Models
Qingqing Cao | Aruna Balasubramanian | Niranjan Balasubramanian

Accurate and reliable measurement of energy consumption is critical for making well-informed design choices when choosing and training large scale NLP models. In this work, we show that existing software-based energy estimations are not accurate because they do not take into account hardware differences and how resource utilization affects energy consumption. We conduct energy measurement experiments with four different models for a question answering task. We quantify the error of existing software-based energy estimations by using a hardware power meter that provides highly accurate energy measurements. Our key takeaway is the need for a more accurate energy estimation model that takes into account hardware variabilities and the non-linear relationship between resource utilization and energy consumption. We release the code and data at

pdf bib
Overview of the SustaiNLP 2020 Shared TaskSustaiNLP 2020 Shared Task
Alex Wang | Thomas Wolf

We describe the SustaiNLP 2020 shared task : efficient inference on the SuperGLUE benchmark (Wang et al., 2019). Participants are evaluated based on performance on the benchmark as well as energy consumed in making predictions on the test sets. We describe the task, its organization, and the submitted systems. Across the six submissions to the shared task, participants achieved efficiency gains of 20 over a standard BERT (Devlin et al., 2019) baseline, while losing less than an absolute point in performance.