Lexical and Syntactical Distinctions between Machine Translation and Human Writing---Chinese as the Target Language

Project: National Science and Technology CouncilNational Science and Technology Council Academic Grants

Project Details

Abstract

Lexical and Syntactical Distinctions between Machine Translation and Human Writing: Chinese as the Target Language Online Machine Translation systems such as Google translate have been widely used nowadays, although the output of such a system is not always satisfactory: it is easy for a native speaker to distinguish between text produced by a machine translation system and text written by another native speaker. However, it is not straightforward to describe the distinctions between two types of text systematically. Automatic machine translation evaluation often relies on comparing the output with several sets of reference translation produced by human translators. The score is the key point of the evaluation, not the types of distinctions. The proposed research aims to adapt concepts and techniques that have been applied in various studies associated with authorship attribution (i.e. identifying who is the author) in order to describe the differences between two types of text at the lexical and syntactic level. The proposed research involves 1) collecting a set of text composed by human writers and another set of text translated by a machine translation system; 2) extracting features that required deep linguistic analysis; 3) using machine learning algorithms to identify the most important features; 4) analyzing the results. Mandarin will be the target language and Wikipedia will be used as the source of the text collection for its multi-author and multilingual nature. In order to extract features that required deep linguistic analysis, a Chinese parser developed by Academia Sinica will be used to parse the text. The results of the proposed research will benefit applications which takes identifying machine translation output as an important task, such as plagiarism detection. The machine translation developers can also use the results to improve the system.

Project IDs

Project ID:PB10112-0087
External Project ID:NSC101-2410-H182-031
StatusFinished
Effective start/end date01/11/1231/10/13

Keywords

  • natural language processing
  • text mining
  • machine translation
  • Wikipedia
  • authorship attribution

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.