Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

Cian Lin, Chih Fong Tsai, Wei Chao Lin*

*Corresponding author for this work

Research output: Contribution to journalJournal Article peer-review

20 Scopus citations

Abstract

The skewed class distributions of many class imbalanced domain datasets often make it difficult for machine learning techniques to construct effective models. In such cases, data re-sampling techniques, such as under-sampling the majority class and over-sampling the minority class are usually employed. In related literatures, some studies have shown that hybrid combinations of under- and over-sampling methods with differ orders can produce better results. However, each study only compares with either under- or over-sampling methods to make the final conclusion. Therefore, the research objective of this paper is to find out which order of combining under- and over-sampling methods perform better. Experiments are conducted based on 44 different domain datasets using three over-sampling algorithms, including SMOTE, CTGAN, and TAN, and three under-sampling (i.e. instance selection) algorithms, including IB3, DROP3, and GA. The results show that if the under-sampling algorithm is chosen carefully, i.e. IB3, no significant performance improvement is obtained by further addition of the over-sampling step. Furthermore, with the IB3 algorithm, it is better to perform instance selection first and over-sampling second than the other combination order, which can allow the random forest classifier to provide the highest AUC rate.

Original languageEnglish
Pages (from-to)845-863
Number of pages19
JournalArtificial Intelligence Review
Volume56
Issue number2
DOIs
StatePublished - 02 2023
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Nature B.V.

Keywords

  • Class imbalance
  • Data science
  • Machine learning
  • Over-sampling
  • Under-sampling

Fingerprint

Dive into the research topics of 'Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study'. Together they form a unique fingerprint.

Cite this