Aitor Soroa
2022
Principled Paraphrase Generation with Parallel Corpora
Aitor Ormazabal
|
Mikel Artetxe
|
Aitor Soroa
|
Gorka Labaka
|
Eneko Agirre
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Round trip Machine Translation MT is a popular choice for paraphrase generation which leverages readily available parallel corpora for supervision In this paper we formalize the implicit similarity function induced by this approach and show that it is susceptible to non paraphrase pairs sharing a single ambiguous translation Based on these insights we design an alternative similarity metric that mitigates this issue by requiring the entire translation distribution to match and implement a relaxation of it through the Information Bottleneck method Our approach incorporates an adversarial term into MT training in order to learn representations that encode as much information about the reference translation as possible while keeping as little information about the input as possible Paraphrases can be generated by decoding back to the source from this representation without having to generate pivot translations In addition to being more principled and efficient than round trip MT our approach offers an adjustable parameter to control the fidelity diversity trade off and obtains better results in our experiments
2019
Analyzing the Limitations of Cross-lingual Word Embedding Mappings
Aitor Ormazabal
|
Mikel Artetxe
|
Gorka Labaka
|
Aitor Soroa
|
Eneko Agirre
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train word embeddings in different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, which states that word embeddings in different languages have approximately the same structure, it is not clear whether this is an inherent limitation of mapping approaches or a more general issue when learning cross-lingual embeddings. So as to answer this question, we experiment with parallel corpora, which allows us to compare offline mapping to an extension of skip-gram that jointly learns both embedding spaces. We observe that, under these ideal conditions, joint learning yields to more isomorphic embeddings, is less sensitive to hubness, and obtains stronger results in bilingual lexicon induction. We thus conclude that current mapping methods do have strong limitations, calling for further research to jointly learn cross-lingual embeddings with a weaker cross-lingual signal.