Interaction effect between data discretization and data resampling for class-imbalanced medical datasets

Min Wei Huang, Chih Fong Tsai, Wei Chao Lin, Jia Yang Lin

Research output: Contribution to journalJournal Article peer-review

Abstract

BackgroundData discretization is an important preprocessing step in data mining for the transfer of continuous feature values to discrete ones, which allows some specific data mining algorithms to construct more effective models and facilitates the data mining process. Because many medical domain datasets are class imbalanced, data resampling methods, including oversampling, undersampling, and hybrid sampling methods, have been widely applied to rebalance the training set, facilitating effective differentiation between majority and minority classes.ObjectiveHerein, we examine the effect of incorporating both data discretization and data resampling as steps in the analytical process on the classifier performance for class-imbalanced medical datasets. The order in which these two steps are carried out is compared in the experiments.MethodsTwo experimental studies were conducted, one based on 11 two-class imbalanced medical datasets and the other using 3 multiclass imbalanced medical datasets. In addition, the two discretization algorithms employed are ChiMerge and minimum description length principle (MDLP). On the other hand, the data resampling algorithms chosen for performance comparison are Tomek links undersampling, synthetic minority oversampling technique (SMOTE) oversampling, and SMOTE-Tomek hybrid sampling algorithms. Moreover, the support vector machine (SVM), C4.5 decision tree, and random forest (RF) techniques were used to examine the classification performances of the different approaches.ResultsThe results show that on average, the combination approaches can allow the classifiers to provide higher area under the ROC curve (AUC) rates than the best baseline approach at approximately 0.8%-3.5% and 0.9%-2.5% for twoclass and multiclass imbalanced medical datasets, respectively. Particularly, the optimal results for two-class imbalanced datasets are obtained by performing the MDLP method first for data discretization and SMOTE second for oversampling, providing the highest AUC rate and requiring the least computational cost. For multiclass imbalanced datasets, performing SMOTE or SMOTE-Tomek first for data resampling and ChiMerge second for data discretization offers the best performances.ConclusionsClassifiers with oversampling can provide better performances than the baseline method without oversampling. In contrast, performing data discretization does not necessarily make the classifiers outperform the baselines. On average, the combination approaches have potential to allow the classifiers to provide higher AUC rates than the best baseline approach.

Original languageEnglish
Pages (from-to)1000-1013
Number of pages14
JournalTechnology and health care : official journal of the European Society for Engineering and Medicine
Volume33
Issue number2
DOIs
StatePublished - 01 03 2025

Bibliographical note

Publisher Copyright:
© The Author(s) 2024.

Keywords

  • class imbalance
  • data mining
  • data resampling
  • discretization
  • machine learning

Fingerprint

Dive into the research topics of 'Interaction effect between data discretization and data resampling for class-imbalanced medical datasets'. Together they form a unique fingerprint.

Cite this