Abstract
In financial distress prediction (FDP), it is very important to ensure the quality of the data for developing effective prediction models. Related studies often apply feature selection to filter out some unrepresentative features from a set of financial ratios, or data re-sampling to re-balance class imbalanced FDP training sets. Although these two types of data pre-processing methods have been demonstrated their effectiveness, they have not often been applied at the same time to develop FDP models. Moreover, the performances of various feature selection algorithms, which can be divided into filter, wrapper, and embedded methods, and data re-sampling algorithms, which can be divided into under-sampling, over-sampling, and hybrid sampling methods, have not been fully investigated in FDP. Therefore, in this study several feature selection and data re-sampling methods, which are employed alone and in combination by different orders are compared. The experimental results based on nine FDP datasets show that executing data re-sampling alone always outperforms executing feature selection alone to develop FDP models, in which hybrid sampling is the better choice. In most cases, better prediction performances can be obtained by performing feature selection first and data re-sampling second. The best combined algorithms are based on the decision tree method for feature selection and Synthetic Minority Over-sampling Technique-Edited Nearest Neighbors (SMOTE-ENN) for hybrid sampling. This combination allows the random forest classifier to produce the highest rate of prediction accuracy. On the other hand, for the Type I error, where crisis cases are misclassified into the non-crisis class, the lowest error rate is produced by executing under-sampling alone using the ClusterCentroids algorithm combined with the random forest classifier.
| Original language | English |
|---|---|
| Pages (from-to) | 2205-2229 |
| Number of pages | 25 |
| Journal | Journal of Forecasting |
| Volume | 44 |
| Issue number | 7 |
| DOIs | |
| State | Published - 11 2025 |
Bibliographical note
Publisher Copyright:© 2025 John Wiley & Sons Ltd.
Keywords
- data quality
- data re-sampling
- data science
- feature selection
- financial distress prediction
- machine learning