Abstract
<div data-language="eng" data-ev-field="abstract">Queried keywords are often used in representing the topics of articles. Word segmentation and unknown word extraction are generally employed to obtain accurate queried keywords. However, existing Chinese unknown word extraction methods are mainly designed to process complete sentences, while the queried keywords are mostly incomplete. In this paper, we propose a Chinese unknown word extraction model for incomplete sentences and use Blog Connect as the experimental platform to collect the queried keywords. A two-phase approach is proposed to solve the unknown word extraction: unknown word detection and unknown word extraction. In the detection phase, we design rules based on the frequency and the probability of queried keywords to detect unknown word candidates. In the extraction phase, we propose a variant of a bottom-up merging algorithm according to pattern and statistical conditions to extract unknown words. The experimental results show that our method can identify about 70% of unknown words and outperforms the CKIP in resolving unknown Chinese words for incomplete sentences.<br/></div> © Institute of Mathematical Statistics, 2022
Original language | American English |
---|---|
Pages (from-to) | 755-764 |
Journal | International Journal of Network Security |
Volume | 24 |
Issue number | 4 |
DOIs | |
State | Published - 2022 |
Keywords
- Blog connect
- Blogs
- Chinese unknown words
- Computational linguistics
- Detection phase
- Experimental platform
- Extraction method
- Extraction modeling
- Queried keyword
- Statistics
- Two phase
- Unknown word extraction
- Word segmentation