Abstract
Abstract Instance selection is an important data pre-processing step in the knowledge discovery process. However, the dataset sizes of various domain problems are usually very large, and some are even non-stationary, composed of both old data and a large amount of new data samples. Current algorithms for solving this type of scalability problem have certain limitations, meaning they require a very high computational cost over very large scale datasets during instance selection. To this end, we introduce the ReDD (Representative Data Detection) approach, which is based on outlier pattern analysis and prediction. First, a machine learning model, or detector, is used to learn the patterns of (un)representative data selected by a specific instance selection method from a small amount of training data. Then, the detector can be used to detect the rest of the large amount of training data, or newly added data. We empirically evaluate ReDD over 50 domain datasets to examine the effectiveness of the learned detector, using four very large scale datasets for validation. The experimental results show that ReDD not only reduces the computational cost nearly two or three times by three baselines, but also maintains the final classification accuracy.
Original language | English |
---|---|
Article number | 9498 |
Pages (from-to) | 1-8 |
Number of pages | 8 |
Journal | Journal of Systems and Software |
Volume | 106 |
DOIs | |
State | Published - 01 08 2015 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2015 Elsevier Inc.
Keywords
- Data mining
- Data reduction
- Instance selection