Skip to main navigation Skip to search Skip to main content

On-the-fly detection of content-poor webpaths

  • National Chung Cheng University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Web page crawling is an essential part of a web search engine. As the number of web pages in the Web is so big, it's practically impossible for a search engine to cover all web pages. An important question for the search engine is then "Which web pages should be crawled and indexed ?". In our observation, we found that most of the index-worthless web pages in a web site are in a same directory or generated by a same CGI program. We use webpath to denote the set of web pages residing in a same directory or generated by a same CGI program and we call it a content-poor webpath if it contains mostly index-worthless web pages. In this paper, we present an approach to detect the content poor webpaths on the fly, such that the crawler can improve the quality of the data crawling. We use statistical approach by analyzing URL patterns and page content structures in the crawled pages to decide whether a webpath is content poor. Our experimental results show that, given a fixed time interval, the data crawler with content-poor webpath filtering will produce a search index that has approximately 10% of search result improvement, compared to the original crawler without the filter. The precision of detection is exceeding 90%.

Original languageEnglish
Title of host publicationProceedings of the Second IASTED International Conference on Web Technologies, Applications, and Services, WTAS 2006
Pages197-203
Number of pages7
StatePublished - 2006
Externally publishedYes
Event2nd IASTED International Conference on Web Technologies, Applications, and Services, WTAS 2006 - Calgary, AB, Canada
Duration: 17 07 200619 07 2006

Publication series

NameProceedings of the Second IASTED International Conference on Web Technologies, Applications, and Services, WTAS 2006

Conference

Conference2nd IASTED International Conference on Web Technologies, Applications, and Services, WTAS 2006
Country/TerritoryCanada
CityCalgary, AB
Period17/07/0619/07/06

Keywords

  • Content filter
  • Web crawler

Fingerprint

Dive into the research topics of 'On-the-fly detection of content-poor webpaths'. Together they form a unique fingerprint.

Cite this