Project Details
Abstract
World Wide Web has become an indispensable information platform for human beings after Tim Berners-Lee demonstrating the concept of the Web. WWW is a super platform with rich and dynamic information. The key to enter and access the WWW is the search engines. However the first step to build a WWW search engine is crawling all the pages from WWW, and the search engine can utilize the copy of WWW to create inverted index and support search service. As we know, the number of web pages is about tens billions even hundred or thousand billions. It is a difficult but important research issue to build a large scale crawler. This research project will propose new crawler architecture based
on Service Oriented Architecture concept. It will modular the functions of the crawler and makes the modules to become services. We can crawl all the pages on WWW according to the new design of architecture. We got the first year research grain and will continue the research based on the result of the first year. This research project will be divided in two years, they are: 1) previous year, redesign the architecture of the crawler system based on SOA concept, and design, research, and implement the URL Overlap problem, 2) first year, research and implement the algorithm of data selection problem, decrease the space needed by design a compression method, 3) second year, crawl analyze the web pages of Taiwan and the whole world, research and design the classification and data refresh algorithm for the crawled data. We can implement a large scale SOA based crawler through this research project. We will crawl all the web pages on WWW and make the data available freely to Taiwan researchers. It will highly increase the research power of information retrieval area in Taiwan.
Project IDs
Project ID:PB10207-1906
External Project ID:NSC102-2221-E182-060
External Project ID:NSC102-2221-E182-060
Status | Finished |
---|---|
Effective start/end date | 01/08/13 → 31/07/14 |
Keywords
- Web Crawler
- SOA
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.