Taiwanese corpus collection via continuous speech recognition tool

Yuang Chin Chiang, Zhi Siang Yang, Ren Yuan Lyu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Corpora, in their different forms for different purposes, have been the bases for modern natural language processing technology. Taiwanese (MinNan), as other language members in the Sino-Tibet family, has been marginalized due to many reasons. One of the consequences of this marginalization is that no standard written script exists, and thus collecting corpus for these languages has been extremely difficult. By (almost) arbitrarily selecting the ha nlor written script (mixture of hanzi and roman characters), we are still facing the problem that only few people are capable of phonetically transcribing a given Taiwanese text. On the other hand, reading a Taiwanese text is easier due to the existence of many commonly used hanzi. By recording a person's reading of Taiwanese text, we use a continuous speech recognizer for Taiwanese to automatically transcribe the text, and end up with two kinds of corpora, one in text, one in speech. The accuracy of the automatic phonetic transcription is about 96.05% in syllable count. For marginalized languages, this automatic transcription can be very useful for corpus collection if proper error spotting scheme is implemented.

Original languageEnglish
Title of host publication6th International Conference on Spoken Language Processing, ICSLP 2000
PublisherInternational Speech Communication Association
ISBN (Electronic)7801501144, 9787801501141
StatePublished - 2000
Event6th International Conference on Spoken Language Processing, ICSLP 2000 - Beijing, China
Duration: 16 10 200020 10 2000

Publication series

Name6th International Conference on Spoken Language Processing, ICSLP 2000

Conference

Conference6th International Conference on Spoken Language Processing, ICSLP 2000
Country/TerritoryChina
CityBeijing
Period16/10/0020/10/00

Fingerprint

Dive into the research topics of 'Taiwanese corpus collection via continuous speech recognition tool'. Together they form a unique fingerprint.

Cite this