Posting file partitioning and parallel information retrieval

Yung Cheng Ma*, Chung Ping Chung

*Corresponding author for this work

Research output: Contribution to journalJournal Article peer-review

5 Scopus citations

Abstract

The rapid growth in Internet usages brings new challenges on designing a scalable information retrieval system. To reduce the response time of a query to a large database, we parallelize both CPU computation and disk access of Boolean query processing on a cluster of workstations. The key issue is to partition the inverted file such that, during parallel query processing, each workstation consults only its own locally resident data to complete its task. To achieve this goal, we treat the set of all postings referring to a document ID as an object to be allocated in the develop data placement problem. Following the partitioning by document ID principle, we develop posting file partitioning algorithms to transform a sequential information retrieval system to a parallel information retrieval system. The advantage is that a better speed-up can be achieved by deriving from the fast sequential approach - the compressed posting file. The partitioning schemes are designed to balance work-load of workstations in parallel query processing without increasing the average disk access time per posting. The experiment shows that almost linear speed-up can be achieved and the performance bottleneck in previous work, which parallelize only disk access, can be removed. This work shows that, by using parallel processing technique, it is feasible to build a scalable information retrieval system.

Original languageEnglish
Pages (from-to)113-127
Number of pages15
JournalJournal of Systems and Software
Volume63
Issue number2
DOIs
StatePublished - 15 08 2002
Externally publishedYes

Fingerprint

Dive into the research topics of 'Posting file partitioning and parallel information retrieval'. Together they form a unique fingerprint.

Cite this