A new speech corpus of super-elderly Japanese for acoustic modeling

Fukuda, Meiko; Nishimura, Ryota; Nishizaki, Hiromitsu; Horii, Koharu; Iribe, Yurie; Yamamoto, Kazumasa; Kitaoka, Norihide

doi:10.1016/j.csl.2022.101424

直近一年間の累計

アクセス数 : ? 件

ダウンロード数 : ? 件

この文献の参照には次のURLをご利用ください : https://repo.lib.tokushima-u.ac.jp/117813

ID	117813
著者	福田, 芽衣子 Tokushima University 西村, 良太 Tokushima University 徳島大学教育研究者総覧 Nishizaki, Hiromitsu University of Yamanashi Horii, Koharu Toyohashi University of Technology Iribe, Yurie Aichi Prefectural University Yamamoto, Kazumasa Chubu University 北岡, 教英 Toyohashi University of Technology KAKEN研究者をさがす
キーワード	Speech corpus Elderly Speech recognition Acoustic feature
資料タイプ	学術雑誌論文
抄録	The development of accessible speech recognition technology will allow the elderly to more easily access electronically stored information. However, the necessary level of recognition accuracy for elderly speech has not yet been achieved using conventional speech recognition systems, due to the unique features of the speech of elderly people. To address this problem, we have created a new speech corpus named EARS (Elderly Adults Read Speech), consisting of the recorded read speech of 123 super-elderly Japanese people (average age: 83.1), as a resource for training automated speech recognition models for the elderly. In this study, we investigated the acoustic features of super-elderly Japanese speech using our new speech corpus. In comparison to the speech of less elderly Japanese speakers, we observed a slower speech rate and extended vowel duration for both genders, a slight increase in fundamental frequency for males, and a slight decrease in fundamental frequency for females. To demonstrate the efficacy of our corpus, we also conducted speech recognition experiments using two different acoustic models (DNN-HMM and transformer-based), trained with a combination of data from our corpus and speech data from three conventional Japanese speech corpora. When using the DNN-HMM trained with EARS and speech data from existing corpora, the character error rate (CER) was reduced by 7.8% (to just over 9%), compared to a CER of 16.9% when using only the baseline training corpora. We also investigated the effect of training the models with various amounts of EARS data, using a simple data expansion method. The acoustic models were also trained for various numbers of epochs without any modifications. When using the Transformer-based end-to-end speech recognizer, the character error rate was reduced by 3.0% (to 11.4%) by using a doubled EARS corpus with the baseline data for training, compared to a CER of 13.4% when only data from the baseline training corpora were used.
掲載誌名	Computer Speech & Language
ISSN	08852308 10958363
cat書誌ID	AA10677208 AA11545097
出版者	Elsevier
巻	77
開始ページ	101424
発行日	2022-06-24
権利情報	This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
EDB ID	386212
出版社版DOI	10.1016/j.csl.2022.101424
出版社版URL	https://doi.org/10.1016/j.csl.2022.101424
フルテキストファイル	csl_77_101424.pdf 804 KB
言語	eng
著者版フラグ	出版社版
部局	理工学系