TY - JOUR
T1 - Feature selection and its combination with data over-sampling for multi-class imbalanced datasets
AU - Tsai, Chih Fong
AU - Chen, Kuan Chen
AU - Lin, Wei Chao
N1 - Publisher Copyright:
© 2024 Elsevier B.V.
PY - 2024/3
Y1 - 2024/3
N2 - Feature selection aims at filtering out some unrepresentative features from a given dataset in order to construct more effective learning models. Furthermore, ensemble feature selection by combining multiple feature selection methods has shown its outperformance over single feature selection. However, the performances of different (ensemble) feature selection methods have not been fully examined over multi-class imbalanced datasets. On the other hand, for class imbalanced datasets, one widely considered solution is to re-balance the datasets by data over-sampling, which generates some synthetic examples for the minority classes. However, the effect of performing (ensemble) feature selection on over-sampling multi-class imbalanced datasets has not been investigated. Therefore, the first research objective is to examine the performances of single and ensemble feature selection methods by fifteen well-known filter, wrapper, and embedded algorithms in terms of classification accuracy. For the second research objective, two orders of combining the feature selection and over-sampling steps are compared in order to find out the best combination procedure as well as the best combined algorithms. The experimental results based on ten different domain datasets containing low to very high feature dimensions show that ensemble feature selection methods slightly perform better than single ones. However, their performance differences are not big. To combine with the Synthetic Minority Oversampling Technique (SMOTE) over-sampling algorithm, performing feature selection first and over-sampling second outperforms the other procedure. Although the best combined algorithms are based on ensemble feature selection, eXtreme Gradient Boosting (XGBoost), as the single best feature selection algorithm, combined with SMOTE provides very similar classification performance to the best combined algorithms. To consider the issues of classification performance and compactional cost, the optimal solution is based on the combined XGBoost and SMOTE.
AB - Feature selection aims at filtering out some unrepresentative features from a given dataset in order to construct more effective learning models. Furthermore, ensemble feature selection by combining multiple feature selection methods has shown its outperformance over single feature selection. However, the performances of different (ensemble) feature selection methods have not been fully examined over multi-class imbalanced datasets. On the other hand, for class imbalanced datasets, one widely considered solution is to re-balance the datasets by data over-sampling, which generates some synthetic examples for the minority classes. However, the effect of performing (ensemble) feature selection on over-sampling multi-class imbalanced datasets has not been investigated. Therefore, the first research objective is to examine the performances of single and ensemble feature selection methods by fifteen well-known filter, wrapper, and embedded algorithms in terms of classification accuracy. For the second research objective, two orders of combining the feature selection and over-sampling steps are compared in order to find out the best combination procedure as well as the best combined algorithms. The experimental results based on ten different domain datasets containing low to very high feature dimensions show that ensemble feature selection methods slightly perform better than single ones. However, their performance differences are not big. To combine with the Synthetic Minority Oversampling Technique (SMOTE) over-sampling algorithm, performing feature selection first and over-sampling second outperforms the other procedure. Although the best combined algorithms are based on ensemble feature selection, eXtreme Gradient Boosting (XGBoost), as the single best feature selection algorithm, combined with SMOTE provides very similar classification performance to the best combined algorithms. To consider the issues of classification performance and compactional cost, the optimal solution is based on the combined XGBoost and SMOTE.
KW - Class imbalance learning
KW - Ensemble feature selection
KW - Feature selection
KW - Machine learning
KW - Over-sampling
UR - http://www.scopus.com/inward/record.url?scp=85183451096&partnerID=8YFLogxK
U2 - 10.1016/j.asoc.2024.111267
DO - 10.1016/j.asoc.2024.111267
M3 - 文章
AN - SCOPUS:85183451096
SN - 1568-4946
VL - 153
JO - Applied Soft Computing Journal
JF - Applied Soft Computing Journal
M1 - 111267
ER -