跳至主導覽 跳至搜尋 跳過主要內容

LLM-Based Enhanced Clustering for Low-Resource Language: An Empirical Study

  • Talha Farooq Khan
  • , Majid Hussain
  • , Muhammad Arslan
  • , Muhammad Saeed
  • , Lal Khan*
  • , Hsien Tsung Chang*
  • *此作品的通信作者
  • The University of Faisalabad
  • The University of Southern Punjab
  • Gachon University

研究成果: 期刊稿件文章同行評審

1 引文 斯高帕斯(Scopus)

摘要

Text clustering is an important task because of its vital role in NLP-related tasks. However, existing research on clustering is mainly based on the English language, with limited work on low-resource languages, such as Urdu. Low-resource language text clustering has many drawbacks in the form of limited annotated collections and strong linguistic diversity. The primary aim of this paper is twofold: (1) By introducing a clustering dataset named UNC-2025 comprises 100k Urdu news documents, and (2) a detailed empirical standard of Large Language Model (LLM) improved clustering methods for Urdu text. We explicitly evaluate the behavior of the 11 multilingual and Urdu-specific embeddings on 3 different clustering algorithms. We carefully evaluated our performance based on a set of internal and external measurements of validity. We discover the best configuration of the mBERT embedding with the HDBSCAN algorithm that attains a new state-of-the-art performance with a high score of external validity of 0.95. This new LLM method has created a new strong standard of Urdu text clustering. Importantly, the results confirm the strength and high scalability of the LLM-generated embeddings towards the ability to generalise the fine, subtle semantics needed to discover topics in low-resource settings and open the door to novel NLP applications in underrepresented languages.

原文英語
頁(從 - 到)3883-3911
頁數29
期刊CMES - Computer Modeling in Engineering and Sciences
145
發行號3
DOIs
出版狀態已出版 - 2025

文獻附註

Publisher Copyright:
Copyright © 2025 The Authors. Published by Tech Science Press.

指紋

深入研究「LLM-Based Enhanced Clustering for Low-Resource Language: An Empirical Study」主題。共同形成了獨特的指紋。

引用此