On mining incomplete medical datasets: Ordering imputation and classification

Chih Wen Chen, Wei Chao Lin, Shih Wen Ke, Chih Fong Tsai*, Ya Han Hu

*此作品的通信作者

研究成果: 期刊稿件文章同行評審

摘要

To collect medical datasets, it is usually the case that a number of data samples contain some missing values. Performing the data mining task over the incomplete datasets is a difficult problem. In general, missing value imputation can be approached, which aims at providing estimations for missing values by reasoning from the observed data. Consequently, the effectiveness of missing value imputation is heavily dependent on the observed data (or complete data) in the incomplete datasets. OBJECTIVE: In this paper, the research objective is to perform instance selection to filter out some noisy data (or outliers) from a given (complete) dataset to see its effect on the final imputation result. Specifically, four different processes of combining instance selection and missing value imputation are proposed and compared in terms of data classification. METHODS: Experiments are conducted based on 11 medical related datasets containing categorical, numerical, and mixed attribute types of data. In addition, missing values for each dataset are introduced into all attributes (the missing data rates are 10%, 20%, 30%, 40%, and 50%). For instance selection and missing value imputation, the DROP3 and k-nearest neighbor imputation methods are employed. On the other hand, the support vector machine (SVM) classifier is used to assess the final classification accuracy of the four different processes. RESULTS: The experimental results show that the second process by performing instance selection first and imputation second allows the SVM classifiers to outperform the other processes. CONCLUSIONS: For incomplete medical datasets containing some missing values, it is necessary to perform missing value imputation. In this paper, we demonstrate that instance selection can be used to filter out some noisy data or outliers before the imputation process. In other words, the observed data for missing value imputation may contain some noisy information, which can degrade the quality of the imputation result as well as the classification performance.

原文英語
頁(從 - 到)619-625
頁數7
期刊Technology and Health Care
23
發行號5
DOIs
出版狀態已出版 - 22 09 2015
對外發佈

文獻附註

Publisher Copyright:
© 2015 IOS Press and the authors.

指紋

深入研究「On mining incomplete medical datasets: Ordering imputation and classification」主題。共同形成了獨特的指紋。

引用此