Chinese Unknown Words Extraction for Incomplete Sentences

Yi-Hui Chen, Eric Jui-Lin Lu, Jeng-Jie Huang

Research output: Contribution to journalJournal Article peer-review

Abstract

<div data-language="eng" data-ev-field="abstract">Queried keywords are often used in representing the topics of articles. Word segmentation and unknown word extraction are generally employed to obtain accurate queried keywords. However, existing Chinese unknown word extraction methods are mainly designed to process complete sentences, while the queried keywords are mostly incomplete. In this paper, we propose a Chinese unknown word extraction model for incomplete sentences and use Blog Connect as the experimental platform to collect the queried keywords. A two-phase approach is proposed to solve the unknown word extraction: unknown word detection and unknown word extraction. In the detection phase, we design rules based on the frequency and the probability of queried keywords to detect unknown word candidates. In the extraction phase, we propose a variant of a bottom-up merging algorithm according to pattern and statistical conditions to extract unknown words. The experimental results show that our method can identify about 70% of unknown words and outperforms the CKIP in resolving unknown Chinese words for incomplete sentences.<br/></div> &copy; Institute of Mathematical Statistics, 2022
Original languageAmerican English
Pages (from-to)755-764
JournalInternational Journal of Network Security
Volume24
Issue number4
DOIs
StatePublished - 2022

Keywords

  • Blog connect
  • Blogs
  • Chinese unknown words
  • Computational linguistics
  • Detection phase
  • Experimental platform
  • Extraction method
  • Extraction modeling
  • Queried keyword
  • Statistics
  • Two phase
  • Unknown word extraction
  • Word segmentation

Fingerprint

Dive into the research topics of 'Chinese Unknown Words Extraction for Incomplete Sentences'. Together they form a unique fingerprint.

Cite this