Wei Liu


2021

pdf bib
LexiClean : An annotation tool for rapid multi-task lexical normalisationLexiClean: An annotation tool for rapid multi-task lexical normalisation
Tyler Bikaun | Tim French | Melinda Hodkiewicz | Michael Stewart | Wei Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

NLP systems are often challenged by difficulties arising from noisy, non-standard, and domain specific corpora. The task of lexical normalisation aims to standardise such corpora, but currently lacks suitable tools to acquire high-quality annotated data to support deep learning based approaches. In this paper, we present LexiClean, the first open-source web-based annotation tool for multi-task lexical normalisation. LexiClean’s main contribution is support for simultaneous in situ token-level modification and annotation that can be rapidly applied corpus wide. We demonstrate the usefulness of our tool through a case study on two sets of noisy corpora derived from the specialised-domain of industrial mining. We show that LexiClean allows for the rapid and efficient development of high-quality parallel corpora. A demo of our system is available at : https://youtu.be/P7_ooKrQPDU.

2019

pdf bib
Multi-lingual Wikipedia Summarization and Title Generation On Low Resource CorpusWikipedia Summarization and Title Generation On Low Resource Corpus
Wei Liu | Lei Li | Zuying Huang | Yinan Liu
Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources

MultiLing 2019 Headline Generation Task on Wikipedia Corpus raised a critical and practical problem : multilingual task on low resource corpus. In this paper we proposed QDAS extractive summarization model enhanced by sentence2vec and try to apply transfer learning based on large multilingual pre-trained language model for Wikipedia Headline Generation task. We treat it as sequence labeling task and develop two schemes to handle with it. Experimental results have shown that large pre-trained model can effectively utilize learned knowledge to extract certain phrase using low resource supervised data.

pdf bib
In Conclusion Not Repetition : Comprehensive Abstractive Summarization with Diversified Attention Based on Determinantal Point Processes
Lei Li | Wei Liu | Marina Litvak | Natalia Vanetik | Zuying Huang
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Various Seq2Seq learning models designed for machine translation were applied for abstractive summarization task recently. Despite these models provide high ROUGE scores, they are limited to generate comprehensive summaries with a high level of abstraction due to its degenerated attention distribution. We introduce Diverse Convolutional Seq2Seq Model(DivCNN Seq2Seq) using Determinantal Point Processes methods(Micro DPPs and Macro DPPs) to produce attention distribution considering both quality and diversity. Without breaking the end to end architecture, DivCNN Seq2Seq achieves a higher level of comprehensiveness compared to vanilla models and strong baselines. All the reproducible codes and datasets are available online.

2018

pdf bib
NovelPerspective : Identifying Point of View CharactersNovelPerspective: Identifying Point of View Characters
Lyndon White | Roberto Togneri | Wei Liu | Mohammed Bennamoun
Proceedings of ACL 2018, System Demonstrations

We present NovelPerspective : a tool to allow consumers to subset their digital literature, based on point of view (POV) character. Many novels have multiple main characters each with their own storyline running in parallel. A well-known example is George R. R. Martin’s novel : A Game of Thrones, and others from that series. Our tool detects the main character that each section is from the POV of, and allows the user to generate a new ebook with only those sections. This gives consumers new options in how they consume their media ; allowing them to pursue the storylines sequentially, or skip chapters about characters they find boring. We present two heuristic-based baselines, and two machine learning based methods for the detection of the main character.