Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

Eunjeong L. Park, Masato Hagiwara, Dmitrijs Milajevs, Nelson F. Liu, Geeticka Chauhan, Liling Tan (Editors)

Anthology ID:
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)
Eunjeong L. Park | Masato Hagiwara | Dmitrijs Milajevs | Nelson F. Liu | Geeticka Chauhan | Liling Tan

pdf bib
End-to-end NLP Pipelines in RustNLP Pipelines in Rust
Guillaume Becquin

The recent progress in natural language processing research has been supported by the development of a rich open source ecosystem in Python. Libraries allowing NLP practitioners but also non-specialists to leverage state-of-the-art models have been instrumental in the democratization of this technology. The maturity of the open-source NLP ecosystem however varies between languages. This work proposes a new open-source library aimed at bringing state-of-the-art NLP to Rust. Rust is a systems programming language for which the foundations required to build machine learning applications are available but still lacks ready-to-use, end-to-end NLP libraries. The proposed library, rust-bert, implements modern language models and ready-to-use pipelines (for example translation or summarization). This allows further development by the Rust community from both NLP experts and non-specialists. It is hoped that this library will accelerate the development of the NLP ecosystem in Rust. The library is under active development and available at

pdf bib
Open Korean Corpora : A Practical ReportKorean Corpora: A Practical Report
Won Ik Cho | Sangwhan Moon | Youngsook Song

Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.

pdf bib
PySBD : Pragmatic Sentence Boundary DisambiguationPySBD: Pragmatic Sentence Boundary Disambiguation
Nipun Sadvilkar | Mark Neumann

We present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter which we ported to Python with additional improvements and functionality. PySBD passes 97.92 % of the Golden Rule Set examplars for English, an improvement of 25 % over the next best open source Python tool.

pdf bib
SacreROUGE : An Open-Source Library for Using and Developing Summarization Evaluation MetricsSacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics
Daniel Deutsch | Dan Roth

We present SacreROUGE, an open-source library for using and developing summarization evaluation metrics. SacreROUGE removes many obstacles that researchers face when using or developing metrics : (1) The library provides Python wrappers around the official implementations of existing evaluation metrics so they share a common, easy-to-use interface ; (2) it provides functionality to evaluate how well any metric implemented in the library correlates to human-annotated judgments, so no additional code needs to be written for a new evaluation metric ; and (3) it includes scripts for loading datasets that contain human judgments so they can easily be used for evaluation. This work describes the design of the library, including the core Metric interface, the command-line API for evaluating summarization models and metrics, and the scripts to load and reformat publicly available datasets. The development of SacreROUGE is ongoing and open to contributions from the community.