TY - JOUR
T1 - Instance selection in medical datasets
T2 - A divide-and-conquer framework
AU - Huang, Min Wei
AU - Tsai, Chih Fong
AU - Lin, Wei Chao
N1 - Publisher Copyright:
© 2020
PY - 2021/3
Y1 - 2021/3
N2 - Instance selection is an important problem in medical data mining. It focuses on selecting representative data samples from a given training set, whereas unrepresentative (or noisy) data samples are filtered out. This reduces the size of the training set, which then requires less storage space. In addition, when the instance selection algorithm was carefully chosen, a reduction in the training set so that it contains less noisy data can usually make the classifiers perform better than the ones without considering instance selection. In the literature, many instance selection algorithms have been proposed. However, different algorithms tend to use different criteria to determine the noisy data, making it difficult to find the best algorithm for different domain datasets. In other words, some algorithms may perform better than the others for some specific domain datasets, but may perform worse than others over other domain datasets. Instead of developing a novel algorithm that performs better than most other algorithms, this paper introduces a divide-and-conquer based instance selection (DCIS) framework that aims to improve the performance of each specific instance selection algorithm per se. Two well-known algorithms, i.e., DROP3 and IB3, are used as the baseline, and various small and large scale medical datasets are used in the experiments. Our results show that when DROP3 and IB3 are used to perform instance selection based on the DCIS framework, there is an improvement in the performance of the k-NN and SVM classifiers over the ones by the DROP3 and IB3 baselines, respectively.
AB - Instance selection is an important problem in medical data mining. It focuses on selecting representative data samples from a given training set, whereas unrepresentative (or noisy) data samples are filtered out. This reduces the size of the training set, which then requires less storage space. In addition, when the instance selection algorithm was carefully chosen, a reduction in the training set so that it contains less noisy data can usually make the classifiers perform better than the ones without considering instance selection. In the literature, many instance selection algorithms have been proposed. However, different algorithms tend to use different criteria to determine the noisy data, making it difficult to find the best algorithm for different domain datasets. In other words, some algorithms may perform better than the others for some specific domain datasets, but may perform worse than others over other domain datasets. Instead of developing a novel algorithm that performs better than most other algorithms, this paper introduces a divide-and-conquer based instance selection (DCIS) framework that aims to improve the performance of each specific instance selection algorithm per se. Two well-known algorithms, i.e., DROP3 and IB3, are used as the baseline, and various small and large scale medical datasets are used in the experiments. Our results show that when DROP3 and IB3 are used to perform instance selection based on the DCIS framework, there is an improvement in the performance of the k-NN and SVM classifiers over the ones by the DROP3 and IB3 baselines, respectively.
KW - Data reduction
KW - Divide-and-conquer
KW - Instance selection
KW - Machine learning
KW - Medical data mining
UR - http://www.scopus.com/inward/record.url?scp=85098667252&partnerID=8YFLogxK
U2 - 10.1016/j.compeleceng.2020.106957
DO - 10.1016/j.compeleceng.2020.106957
M3 - 文章
AN - SCOPUS:85098667252
SN - 0045-7906
VL - 90
JO - Computers and Electrical Engineering
JF - Computers and Electrical Engineering
M1 - 106957
ER -