Abstract
In this article, we present our works to create the first Hong Kong content-based public pre-training dataset and the experiments which resulted in the creation of ELECTRA-based models for commonly used languages in Hong Kong. The creation of pre-training dataset is required for us to study the effect of diglossia on Hong Kong language model, and this is the first ever study on the effect starting all the way from dataset creation phase. Our experiment shows that removing diglossia from pre-training data hurts model performance. We will release our data and models to encourage future studies in Hong Kong languages.
| Original language | English |
|---|---|
| Article number | 71 |
| Journal | ACM Transactions on Asian and Low-Resource Language Information Processing |
| Volume | 24 |
| Issue number | 7 |
| DOIs | |
| State | Published - 24 07 2025 |
Bibliographical note
Publisher Copyright:© 2025 Association for Computing Machinery. All rights reserved.