Load and storage balanced posting file partitioning for parallel information retrieval

Yung Cheng Ma*, Chung Ping Chung, Tien Fu Chen

*Corresponding author for this work

Research output: Contribution to journalJournal Article peer-review

5 Scopus citations

Abstract

Abstract: Many recent major search engines on Internet use a large-scale cluster to store a large database and cope with high query arrival rate. To design a large scale parallel information retrieval system, both performance and storage cost has to be taken into integrated consideration. Moreover, a quantitative method to design the cluster in systematical way is required. This paper proposes posting file partitioning algorithm for these requirements. The partitioning follows the partition-by-document-ID principle to eliminate communication overhead. The kernel of the partitioning is a data allocation algorithm to allocate variable-sized data items for both load and storage balancing. The data allocation algorithm is proven to satisfy a load balancing constraint with asymptotical 1-optimal storage cost. A probability model is established such that query processing throughput can be calculated from keyword popularities and data allocation result. With these results, we show a quantitative method to design a cluster systematically. This research provides a systematical approach to large-scale information retrieval system design. This approach has the following features: (1) the differences to ideal load balancing and storage balancing are negligible in real-world application. (2) Both load balancing and storage balancing can be taken into integrated consideration without conflicting. (3) The data allocation algorithm is capable to deal with data items of variable-sizes and variable loads. An algorithm having all these features together is never achieved before and is the key factor for achieving load and storage balanced workstation cluster in a real-world environment.

Original languageEnglish
Pages (from-to)864-884
Number of pages21
JournalJournal of Systems and Software
Volume84
Issue number5
DOIs
StatePublished - 05 2011

Keywords

  • Inverted file
  • Load balancing
  • Parallel information retrieval
  • Storage balancing

Fingerprint

Dive into the research topics of 'Load and storage balanced posting file partitioning for parallel information retrieval'. Together they form a unique fingerprint.

Cite this