Inverted file compression through document identifier reassignment

Wann Yun Shieh*, Tien Fu Chen, Jean Jyh Jiun Shann, Chung Ping Chung

*Corresponding author for this work

Research output: Contribution to journalJournal Article peer-review

62 Scopus citations

Abstract

The inverted file is the most popular indexing mechanism for document search in an information retrieval system. Compressing an inverted file can greatly improve document search rate. Traditionally, the d-gap technique is used in the inverted file compression by replacing document identifiers with usually much smaller gap values. However, fluctuating gap values cannot be efficiently compressed by some well-known prefix-free codes. To smoothen and reduce the gap values, we propose a document-identifier reassignment algorithm. This reassignment is based on a similarity factor between documents. We generate a reassignment order for all documents according to the similarity to reassign closer identifiers to the documents having closer relationships. Simulation results show that the average gap values of sample inverted files can be reduced by 30%, and the compression rate of d-gapped inverted file with prefix-free codes can be improved by 15%.

Original languageEnglish
Pages (from-to)117-131
Number of pages15
JournalInformation Processing and Management
Volume39
Issue number1
DOIs
StatePublished - 01 2003
Externally publishedYes

Keywords

  • D-gap
  • Document identifier reassignment
  • Information retrieval
  • Inverted file
  • Traveling salesman problem

Fingerprint

Dive into the research topics of 'Inverted file compression through document identifier reassignment'. Together they form a unique fingerprint.

Cite this