Learning to detect representative data for large scale instance selection

Wei Chao Lin, Chih Fong Tsai*, Shih Wen Ke, Chia Wen Hung, William Eberle

*此作品的通信作者

研究成果: 期刊稿件文章同行評審

30 引文 斯高帕斯(Scopus)

摘要

Abstract Instance selection is an important data pre-processing step in the knowledge discovery process. However, the dataset sizes of various domain problems are usually very large, and some are even non-stationary, composed of both old data and a large amount of new data samples. Current algorithms for solving this type of scalability problem have certain limitations, meaning they require a very high computational cost over very large scale datasets during instance selection. To this end, we introduce the ReDD (Representative Data Detection) approach, which is based on outlier pattern analysis and prediction. First, a machine learning model, or detector, is used to learn the patterns of (un)representative data selected by a specific instance selection method from a small amount of training data. Then, the detector can be used to detect the rest of the large amount of training data, or newly added data. We empirically evaluate ReDD over 50 domain datasets to examine the effectiveness of the learned detector, using four very large scale datasets for validation. The experimental results show that ReDD not only reduces the computational cost nearly two or three times by three baselines, but also maintains the final classification accuracy.

原文英語
文章編號9498
頁(從 - 到)1-8
頁數8
期刊Journal of Systems and Software
106
DOIs
出版狀態已出版 - 01 08 2015
對外發佈

文獻附註

Publisher Copyright:
© 2015 Elsevier Inc.

指紋

深入研究「Learning to detect representative data for large scale instance selection」主題。共同形成了獨特的指紋。

引用此