Exploring the Effectiveness of Pre-training Language Models with Incorporation of Diglossia for Hong Kong Content

Research output: Contribution to journalJournal Article peer-review

Abstract

In this article, we present our works to create the first Hong Kong content-based public pre-training dataset and the experiments which resulted in the creation of ELECTRA-based models for commonly used languages in Hong Kong. The creation of pre-training dataset is required for us to study the effect of diglossia on Hong Kong language model, and this is the first ever study on the effect starting all the way from dataset creation phase. Our experiment shows that removing diglossia from pre-training data hurts model performance. We will release our data and models to encourage future studies in Hong Kong languages.

Original languageEnglish
Article number71
JournalACM Transactions on Asian and Low-Resource Language Information Processing
Volume24
Issue number7
DOIs
StatePublished - 24 07 2025

Bibliographical note

Publisher Copyright:
© 2025 Association for Computing Machinery. All rights reserved.

Fingerprint

Dive into the research topics of 'Exploring the Effectiveness of Pre-training Language Models with Incorporation of Diglossia for Hong Kong Content'. Together they form a unique fingerprint.

Cite this