Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing

Nafise Sadat Moosavi, Iryna Gurevych, Angela Fan, Thomas Wolf, Yufang Hou, Ana Marasović, Sujith Ravi (Editors)


Anthology ID:
2021.sustainlp-1
Month:
November
Year:
2021
Address:
Virtual
Venues:
EMNLP | sustainlp
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2021.sustainlp-1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote

pdf bib
Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing
Nafise Sadat Moosavi | Iryna Gurevych | Angela Fan | Thomas Wolf | Yufang Hou | Ana Marasović | Sujith Ravi

pdf bib
Low Resource Quadratic Forms for Knowledge Graph Embeddings
Zachary Zhou | Jeffery Kline | Devin Conathan | Glenn Fung

We address the problem of link prediction between entities and relations of knowledge graphs. State of the art techniques that address this problem, while increasingly accurate, are computationally intensive. In this paper we cast link prediction as a sparse convex program whose solution defines a quadratic form that is used as a ranking function. The structure of our convex program is such that standard support vector machine software packages, which are numerically robust and efficient, can solve it. We show that on benchmark data sets, our model’s performance is competitive with state of the art models, but training times can be reduced by a factor of 40 using only CPU-based (and not GPU-accelerated) computing resources. This approach may be suitable for applications where balancing the demands of graph completion performance against computational efficiency is a desirable trade-off.link prediction between entities and relations of knowledge graphs. State of the art techniques that address this problem, while increasingly accurate, are computationally intensive. In this paper we cast link prediction as a sparse convex program whose solution defines a quadratic form that is used as a ranking function. The structure of our convex program is such that standard support vector machine software packages, which are numerically robust and efficient, can solve it. We show that on benchmark data sets, our model’s performance is competitive with state of the art models, but training times can be reduced by a factor of 40 using only CPU-based (and not GPU-accelerated) computing resources. This approach may be suitable for applications where balancing the demands of graph completion performance against computational efficiency is a desirable trade-off.

pdf bib
Limitations of Knowledge Distillation for Zero-shot Transfer Learning
Saleh Soltan | Haidar Khan | Wael Hamza

Pretrained transformer-based encoders such as BERT have been demonstrated to achieve state-of-the-art performance on numerous NLP tasks. Despite their success, BERT style encoders are large in size and have high latency during inference (especially on CPU machines) which make them unappealing for many online applications. Recently introduced compression and distillation methods have provided effective ways to alleviate this shortcoming. However, the focus of these works has been mainly on monolingual encoders. Motivated by recent successes in zero-shot cross-lingual transfer learning using multilingual pretrained encoders such as mBERT, we evaluate the effectiveness of Knowledge Distillation (KD) both during pretraining stage and during fine-tuning stage on multilingual BERT models. We demonstrate that in contradiction to the previous observation in the case of monolingual distillation, in multilingual settings, distillation during pretraining is more effective than distillation during fine-tuning for zero-shot transfer learning. Moreover, we observe that distillation during fine-tuning may hurt zero-shot cross-lingual performance. Finally, we demonstrate that distilling a larger model (BERT Large) results in the strongest distilled model that performs best both on the source language as well as target languages in zero-shot settings.

pdf bib
Combining Lexical and Dense Retrieval for Computationally Efficient Multi-hop Question Answering
Georgios Sidiropoulos | Nikos Voskarides | Svitlana Vakulenko | Evangelos Kanoulas

In simple open-domain question answering (QA), dense retrieval has become one of the standard approaches for retrieving the relevant passages to infer an answer. Recently, dense retrieval also achieved state-of-the-art results in multi-hop QA, where aggregating information from multiple pieces of information and reasoning over them is required. Despite their success, dense retrieval methods are computationally intensive, requiring multiple GPUs to train. In this work, we introduce a hybrid (lexical and dense) retrieval approach that is highly competitive with the state-of-the-art dense retrieval models, while requiring substantially less computational resources. Additionally, we provide an in-depth evaluation of dense retrieval methods on limited computational resource settings, something that is missing from the current literature.

pdf bib
Learning to Rank in the Age of Muppets : EffectivenessEfficiency Tradeoffs in Multi-Stage Ranking
Yue Zhang | ChengCheng Hu | Yuqi Liu | Hui Fang | Jimmy Lin

It is well known that rerankers built on pretrained transformer models such as BERT have dramatically improved retrieval effectiveness in many tasks. However, these gains have come at substantial costs in terms of efficiency, as noted by many researchers. In this work, we show that it is possible to retain the benefits of transformer-based rerankers in a multi-stage reranking pipeline by first using feature-based learning-to-rank techniques to reduce the number of candidate documents under consideration without adversely affecting their quality in terms of recall. Applied to the MS MARCO passage and document ranking tasks, we are able to achieve the same level of effectiveness, but with up to 18 increase in efficiency. Furthermore, our techniques are orthogonal to other methods focused on accelerating transformer inference, and thus can be combined for even greater efficiency gains. A higher-level message from our work is that, even though pretrained transformers dominate the modern IR landscape, there are still important roles for traditional LTR techniques, and that we should not forget history.

pdf bib
Distiller : A Systematic Study of Model Distillation Methods in Natural Language Processing
Haoyu He | Xingjian Shi | Jonas Mueller | Sheng Zha | Mu Li | George Karypis

Knowledge Distillation (KD) offers a natural way to reduce the latency and memory / energy usage of massive pretrained models that have come to dominate Natural Language Processing (NLP) in recent years. While numerous sophisticated variants of KD algorithms have been proposed for NLP applications, the key factors underpinning the optimal distillation performance are often confounded and remain unclear. We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets / tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component’s contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MI-objective functions with better bias / variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyper-parameter optimization. Our experiments reveal the following : 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks.

pdf bib
Shrinking Bigfoot : Reducing wav2vec 2.0 footprint
Zilun Peng | Akshay Budhkar | Ilana Tuil | Jason Levy | Parinaz Sobhani | Raphael Cohen | Jumana Nassour

Wav2vec 2.0 is a state-of-the-art speech recognition model which maps speech audio waveforms into latent representations. The largest version of wav2vec 2.0 contains 317 million parameters. Hence, the inference latency of wav2vec 2.0 will be a bottleneck in production, leading to high costs and a significant environmental footprint. To improve wav2vec’s applicability to a production setting, we explore multiple model compression methods borrowed from the domain of large language models. Using a teacher-student approach, we distilled the knowledge from the original wav2vec 2.0 model into a student model, which is 2 times faster, 4.8 times smaller than the original model. More importantly, the student model is 2 times more energy efficient than the original model in terms of CO2 emission. This increase in performance is accomplished with only a 7 % degradation in word error rate (WER). Our quantized model is 3.6 times smaller than the original model, with only a 0.1 % degradation in WER. To the best of our knowledge, this is the first work that compresses wav2vec 2.0.

pdf bib
Unsupervised Contextualized Document Representation
Ankur Gupta | Vivek Gupta

Several NLP tasks need the effective repre-sentation of text documents. Arora et al.,2017 demonstrate that simple weighted aver-aging of word vectors frequently outperformsneural models. SCDV (Mekala et al., 2017)further extends this from sentences to docu-ments by employing soft and sparse cluster-ing over pre-computed word vectors. How-ever, both techniques ignore the polysemyand contextual character of words. In thispaper, we address this issue by proposingSCDV+BERT(ctxd), a simple and effective un-supervised representation that combines con-textualized BERT (Devlin et al., 2019) basedword embedding for word sense disambigua-tion with SCDV soft clustering approach. Weshow that our embeddings outperform origi-nal SCDV, pre-train BERT, and several otherbaselines on many classification datasets. Wealso demonstrate our embeddings effective-ness on other tasks, such as concept match-ing and sentence similarity. In addition, we show that SCDV+BERT(ctxd) outperformsfine-tune BERT and different embedding ap-proaches in scenarios with limited data andonly few shots examples.