An efficient algorithm to select phonetically balanced scripts for constructing a speech corpus

Min Siong Liang, Ren Yuan Lyu, Yuang Chin Chiang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

In this paper, we describe an efficient algorithm to select ph onetically balanced scripts for collecti ng a large-scale multilingual speech corpus. It is expected to collect a multilingual speech corpus covering three most frequently used languages in Taiwan, including Taiwanese (Min-nan), Hakka, and Mandarin Chinese. To achieve the objective, the first step is to construct a multilingual phonetic alphabet, namely Formosa Phonetic Alphabet (ForPA). In addition, the multilingual lexicons (Fomosa Lexicons) are also important parts for building the corpus. Until now, this corpus containing 600 speakers' speech of Taiwanese (Min-nan) and Mandarin Chinese has been finished and ready to release. There contains about 40 hours of speech in 247 thousand utterances in thi s release.

Original languageEnglish
Title of host publicationNLP-KE 2003 - 2003 International Conference on Natural Language Processing and Knowledge Engineering, Proceedings
EditorsChengqing Zong
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages433-437
Number of pages5
ISBN (Electronic)0780379020, 9780780379022
DOIs
StatePublished - 2003
EventInternational Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2003 - Beijing, China
Duration: 26 10 200329 10 2003

Publication series

NameNLP-KE 2003 - 2003 International Conference on Natural Language Processing and Knowledge Engineering, Proceedings

Conference

ConferenceInternational Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2003
Country/TerritoryChina
CityBeijing
Period26/10/0329/10/03

Bibliographical note

Publisher Copyright:
© 2003 IEEE.

Keywords

  • Phonetic alphabet
  • Phonetically-balanced word Speech corpus
  • Pronunciation lexicon

Fingerprint

Dive into the research topics of 'An efficient algorithm to select phonetically balanced scripts for constructing a speech corpus'. Together they form a unique fingerprint.

Cite this