Giuseppe G. A. Celano
2020
A Gradient Boosting-Seq2Seq System for Latin POS Tagging and LemmatizationSeq2Seq System for Latin POS Tagging and Lemmatization
Giuseppe G. A. Celano
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
The paper presents the system used in the EvaLatin shared task to POS tag and lemmatize Latin. It consists of two components. A gradient boosting machine (LightGBM) is used for POS tagging, mainly fed with pre-computed word embeddings of a window of seven contiguous tokensthe token at hand plus the three preceding and following onesper target feature value. Word embeddings are trained on the texts of the Perseus Digital Library, Patrologia Latina, and Biblioteca Digitale di Testi Tardo Antichi, which together comprise a high number of texts of different genres from the Classical Age to Late Antiquity. Word forms plus the outputted POS labels are used to feed a seq2seq algorithm implemented in Keras to predict lemmas. The final shared-task accuracies measured for Classical Latin texts are in line with state-of-the-art POS taggers (0.96) and lemmatizers (0.95).
SIGTYP 2020 Shared Task : Prediction of Typological FeaturesSIGTYP 2020 Shared Task: Prediction of Typological Features
Johannes Bjerva
|
Elizabeth Salesky
|
Sabrina J. Mielke
|
Aditi Chaudhary
|
Giuseppe G. A. Celano
|
Edoardo Maria Ponti
|
Ekaterina Vylomova
|
Ryan Cotterell
|
Isabelle Augenstein
Proceedings of the Second Workshop on Computational Research in Linguistic Typology
Typological knowledge bases (KBs) such as WALS (Dryer and Haspelmath, 2013) contain information about linguistic properties of the world’s languages. They have been shown to be useful for downstream applications, including cross-lingual transfer learning and linguistic probing. A major drawback hampering broader adoption of typological KBs is that they are sparsely populated, in the sense that most languages only have annotations for some features, and skewed, in that few features have wide coverage. As typological features often correlate with one another, it is possible to predict them and thus automatically populate typological KBs, which is also the focus of this shared task. Overall, the task attracted 8 submissions from 5 teams, out of which the most successful methods make use of such feature correlations. However, our error analysis reveals that even the strongest submitted systems struggle with predicting feature values for languages where few features are known.
Search
Co-authors
- Johannes Bjerva 1
- Elizabeth Salesky 1
- Sabrina J. Mielke 1
- Aditi Chaudhary 1
- Edoardo Maria Ponti 1
- show all...