Abstract
The skewed class distributions of many class imbalanced domain datasets often make it difficult for machine learning techniques to construct effective models. In such cases, data re-sampling techniques, such as under-sampling the majority class and over-sampling the minority class are usually employed. In related literatures, some studies have shown that hybrid combinations of under- and over-sampling methods with differ orders can produce better results. However, each study only compares with either under- or over-sampling methods to make the final conclusion. Therefore, the research objective of this paper is to find out which order of combining under- and over-sampling methods perform better. Experiments are conducted based on 44 different domain datasets using three over-sampling algorithms, including SMOTE, CTGAN, and TAN, and three under-sampling (i.e. instance selection) algorithms, including IB3, DROP3, and GA. The results show that if the under-sampling algorithm is chosen carefully, i.e. IB3, no significant performance improvement is obtained by further addition of the over-sampling step. Furthermore, with the IB3 algorithm, it is better to perform instance selection first and over-sampling second than the other combination order, which can allow the random forest classifier to provide the highest AUC rate.
Original language | English |
---|---|
Pages (from-to) | 845-863 |
Number of pages | 19 |
Journal | Artificial Intelligence Review |
Volume | 56 |
Issue number | 2 |
DOIs | |
State | Published - 02 2023 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2022, The Author(s), under exclusive licence to Springer Nature B.V.
Keywords
- Class imbalance
- Data science
- Machine learning
- Over-sampling
- Under-sampling