Abstract
In this paper, we describe an efficient algorithm to select ph onetically balanced scripts for collecti ng a large-scale multilingual speech corpus. It is expected to collect a multilingual speech corpus covering three most frequently used languages in Taiwan, including Taiwanese (Min-nan), Hakka, and Mandarin Chinese. To achieve the objective, the first step is to construct a multilingual phonetic alphabet, namely Formosa Phonetic Alphabet (ForPA). In addition, the multilingual lexicons (Fomosa Lexicons) are also important parts for building the corpus. Until now, this corpus containing 600 speakers' speech of Taiwanese (Min-nan) and Mandarin Chinese has been finished and ready to release. There contains about 40 hours of speech in 247 thousand utterances in thi s release.
Original language | English |
---|---|
Title of host publication | NLP-KE 2003 - 2003 International Conference on Natural Language Processing and Knowledge Engineering, Proceedings |
Editors | Chengqing Zong |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 433-437 |
Number of pages | 5 |
ISBN (Electronic) | 0780379020, 9780780379022 |
DOIs | |
State | Published - 2003 |
Event | International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2003 - Beijing, China Duration: 26 10 2003 → 29 10 2003 |
Publication series
Name | NLP-KE 2003 - 2003 International Conference on Natural Language Processing and Knowledge Engineering, Proceedings |
---|
Conference
Conference | International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2003 |
---|---|
Country/Territory | China |
City | Beijing |
Period | 26/10/03 → 29/10/03 |
Bibliographical note
Publisher Copyright:© 2003 IEEE.
Keywords
- Phonetic alphabet
- Phonetically-balanced word Speech corpus
- Pronunciation lexicon