Predicting Long-Range Reordering Occurrences from Linguistically Motivated Features in Source Sentences for Machine Translation

Project: National Science and Technology CouncilNational Science and Technology Council Academic Grants

Project Details

Abstract

Differences in word order between languages in a translation pair have cause difficulties for machine translation. In particular, when involving a translation pair like Mandarin and English where the differences in word order not only require local reordering, but also long-range reordering, the translation task becomes challenging. With a state-of-the-art statistical machine translation system (SMT) like Google Translate, local reordering can be handled using sufficient language data, and achieving a natural-sounding translation would be possible. However, the reordering limit applied to SMT for the purpose of controlling the overall translation quality and reducing computational complexity actually discourages long-range reordering. Consequently, long-range reordering issues manifest in SMT for some difficult translation pairs. Although measures have been proposed to tackle reordering issues, the results have not been overwhelmingly positive. One of the key reasons is that the methods targeting long-range reordering issues are less efficient or even less accurate in translating sentences requiring either no reordering or merely local reordering. If sentences requiring different reordering levels can be separated, measures targeting long-range reordering issues can be used on the right sentences without damaging other acceptable translations. We plan to use machine learning algorithms to predict whether the to-be translated sentence requires long-distance reordering treatments based on the recent performance of the MT system under investigation. The sentences of the source language will be categorised into “without long-range reordering issues” or “with long-range reordering issues”. The features used for the categorisation task are mostly linguistically motivated, including the types of oblique phrases that modify the verb, types of subordinate clauses, and types of adpositional phrases. The machine learning method will be compared with methods that calculate translatability based solely on the features of the source text without the knowledge of current MT capability. In addition, the annotated data used to build the model will also be analysed in order to gain a better understanding of the reordering issues in SMT and provide directions for developing comprehensive solutions to long-range reordering problems. Although the proposed research will be based on the outputs of English-to-Chinese translation generated by Google Translate, by changing the training data and adapting lexicon-specific features for the source language, the same method can be applied to other MT systems for other language pairs with long-range reordering issues.

Project IDs

Project ID:PB10406-1540
External Project ID:MOST104-2410-H182-008
StatusFinished
Effective start/end date01/08/1531/07/16

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.