Template-based information mining from HTML documents

Jane Yung jen Hsu*, Wen tau Yih

*Corresponding author for this work

Research output: Contribution to conferenceConference Paperpeer-review

22 Scopus citations

Abstract

Tools for mining information from data can create added value for the Internet. As the majority of electronic documents available over the network are in unstructured textual form, extracting useful information from a document usually involves information retrieval techniques or manual processing. This paper presents a novel approach to mining information from HTML documents using tree-structured templates. In addition to syntactic and semantic descriptions, each template is designed to capture the logical structure of a class of documents. Experiments have been conducted to extract FAQ information automatically from over one hundred HTML documents collected from the Web. Using two basic templates, the prototype FAQ Miner has accurately analyzed 65% of the collection of FAQ documents. With additional processing to handle 'near-pass'es, the success rate is approximately 75%. The preliminary results have demonstrated the utility of structural templates for mining information from semi-structured text-based documents.

Original languageEnglish
Pages256-262
Number of pages7
StatePublished - 1997
Externally publishedYes
EventProceedings of the 1997 14th National Conference on Artificial Intelligence, AAAI 97 - Providence, RI, USA
Duration: 27 07 199731 07 1997

Conference

ConferenceProceedings of the 1997 14th National Conference on Artificial Intelligence, AAAI 97
CityProvidence, RI, USA
Period27/07/9731/07/97

Fingerprint

Dive into the research topics of 'Template-based information mining from HTML documents'. Together they form a unique fingerprint.

Cite this