Xutai Ma


2022

pdf bib
Direct Speech-to-Speech Translation With Discrete Units
Ann Lee | Peng-Jen Chen | Changhan Wang | Jiatao Gu | Sravya Popuri | Xutai Ma | Adam Polyak | Yossi Adi | Qing He | Yun Tang | Juan Pino | Wei-Ning Hsu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages.

2021

pdf bib
FINDINGS OF THE IWSLT 2021 EVALUATION CAMPAIGNFINDINGS OF THE IWSLT 2021 EVALUATION CAMPAIGN
Antonios Anastasopoulos | Ondřej Bojar | Jacob Bremerman | Roldano Cattoni | Maha Elbayad | Marcello Federico | Xutai Ma | Satoshi Nakamura | Matteo Negri | Jan Niehues | Juan Pino | Elizabeth Salesky | Sebastian Stüker | Katsuhito Sudoh | Marco Turchi | Alexander Waibel | Changhan Wang | Matthew Wiesner
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2021) featured this year four shared tasks : (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Multilingual speech translation, (iv) Low-resource speech translation. A total of 22 teams participated in at least one of the tasks. This paper describes each shared task, data and evaluation metrics, and reports results of the received submissions.

2019

pdf bib
Broad-Coverage Semantic Parsing as Transduction
Sheng Zhang | Xutai Ma | Kevin Duh | Benjamin Van Durme
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We unify different broad-coverage semantic parsing tasks into a transduction parsing paradigm, and propose an attention-based neural transducer that incrementally builds meaning representation via a sequence of semantic relations. By leveraging multiple attention mechanisms, the neural transducer can be effectively trained without relying on a pre-trained aligner. Experiments separately conducted on three broad-coverage semantic parsing tasks AMR, SDP and UCCA demonstrate that our attention-based neural transducer improves the state of the art on both AMR and UCCA, and is competitive with the state of the art on SDP.

pdf bib
Harnessing Indirect Training Data for End-to-End Automatic Speech Translation : Tricks of the Trade
Juan Pino | Liezl Puzon | Jiatao Gu | Xutai Ma | Arya D. McCarthy | Deepak Gopinath
Proceedings of the 16th International Conference on Spoken Language Translation

For automatic speech translation (AST), end-to-end approaches are outperformed by cascaded models that transcribe with automatic speech recognition (ASR), then trans- late with machine translation (MT). A major cause of the performance gap is that, while existing AST corpora are small, massive datasets exist for both the ASR and MT subsystems. In this work, we evaluate several data augmentation and pretraining approaches for AST, by comparing all on the same datasets. Simple data augmentation by translating ASR transcripts proves most effective on the EnglishFrench augmented LibriSpeech dataset, closing the performance gap from 8.2 to 1.4 BLEU, compared to a very strong cascade that could directly utilize copious ASR and MT data. The same end-to-end approach plus fine-tuning closes the gap on the EnglishRomanian MuST-C dataset from 6.7 to 3.7 BLEU. In addition to these results, we present practical rec- ommendations for augmentation and pretraining approaches. Finally, we decrease the performance gap to 0.01 BLEU us- ing a Transformer-based architecture.

pdf bib
AMR Parsing as Sequence-to-Graph TransductionAMR Parsing as Sequence-to-Graph Transduction
Sheng Zhang | Xutai Ma | Kevin Duh | Benjamin Van Durme
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We propose an attention-based model that treats AMR parsing as sequence-to-graph transduction. Unlike most AMR parsers that rely on pre-trained aligners, external semantic resources, or data augmentation, our proposed parser is aligner-free, and it can be effectively trained with limited amounts of labeled AMR data. Our experimental results outperform all previously reported SMATCH scores, on both AMR 2.0 (76.3 % on LDC2017T10) and AMR 1.0 (70.2 % on LDC2014T12).