Abstract
Recent Chinese word segmentation (CWS) models have shown competitive performance with pre-trained language models' knowledge. However, these models tend to learn the segmentation knowledge through in-vocabulary words rather than understanding the meaning of the entire context. To address this issue, we introduce a context-aware approach that incorporates unsupervised sentence representation learning over different dropout masks into the multi-criteria training framework. We demonstrate that our approach reaches state-of-the-art (SoTA) performance on F1 scores for six of the nine CWS benchmark datasets and out-of-vocabulary (OOV) recalls for eight of nine. Further experiments discover that substantial improvements can be brought with various sentence representation objectives.
| Original language | English |
|---|---|
| Title of host publication | Findings of the Association for Computational Linguistics |
| Subtitle of host publication | EMNLP 2023 |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 12756-12763 |
| Number of pages | 8 |
| ISBN (Electronic) | 9798891760615 |
| DOIs | |
| State | Published - 2023 |
| Externally published | Yes |
| Event | 2023 Findings of the Association for Computational Linguistics: EMNLP 2023 - Hybrid, Singapore Duration: 06 12 2023 → 10 12 2023 |
Publication series
| Name | Findings of the Association for Computational Linguistics: EMNLP 2023 |
|---|
Conference
| Conference | 2023 Findings of the Association for Computational Linguistics: EMNLP 2023 |
|---|---|
| Country/Territory | Singapore |
| City | Hybrid |
| Period | 06/12/23 → 10/12/23 |
Bibliographical note
Publisher Copyright:© 2023 Association for Computational Linguistics.