Project Details
Abstract
Lexical and Syntactical Distinctions between Machine Translation and Human Writing: Chinese as the
Target Language
Online Machine Translation systems such as Google translate have been widely used nowadays,
although the output of such a system is not always satisfactory: it is easy for a native speaker to
distinguish between text produced by a machine translation system and text written by another
native speaker. However, it is not straightforward to describe the distinctions between two types
of text systematically. Automatic machine translation evaluation often relies on comparing the
output with several sets of reference translation produced by human translators. The score is the
key point of the evaluation, not the types of distinctions. The proposed research aims to adapt
concepts and techniques that have been applied in various studies associated with authorship
attribution (i.e. identifying who is the author) in order to describe the differences between two
types of text at the lexical and syntactic level.
The proposed research involves 1) collecting a set of text composed by human writers and another
set of text translated by a machine translation system; 2) extracting features that required deep
linguistic analysis; 3) using machine learning algorithms to identify the most important features;
4) analyzing the results. Mandarin will be the target language and Wikipedia will be used as the
source of the text collection for its multi-author and multilingual nature. In order to extract
features that required deep linguistic analysis, a Chinese parser developed by Academia Sinica will
be used to parse the text.
The results of the proposed research will benefit applications which takes identifying machine
translation output as an important task, such as plagiarism detection. The machine translation
developers can also use the results to improve the system.
Project IDs
Project ID:PB10112-0087
External Project ID:NSC101-2410-H182-031
External Project ID:NSC101-2410-H182-031
Status | Finished |
---|---|
Effective start/end date | 01/11/12 → 31/10/13 |
Keywords
- natural language processing
- text mining
- machine translation
- Wikipedia
- authorship attribution
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.