Dual-TBNet : Improving the Robustness of Speech Features via Dual-Transformer-BiLSTM for Speech Emotion Recognition

Liu, Zheng; Kang, Xin; Ren, Fuji

doi:10.1109/TASLP.2023.3282092

Total for the last 12 months

number of access : ? 件

number of downloads : ?

Use this link to cite this item : https://repo.lib.tokushima-u.ac.jp/118515

ID	118515
Author	Liu, Zheng Tokushima University Kang, Xin Tokushima University Tokushima University Educator and Researcher Directory Ren, Fuji University of Electronic Science and Technology of China Tokushima University Educator and Researcher Directory KAKEN Search Researchers
Keywords	Speech emotion recognition affective computing speech representation learning feature fusion transformer
Content Type	Journal Article
Description	Speech emotion recognition has always been one of the topics that have attracted a lot of attention from many researchers. In traditional feature fusion methods, the speech features used only come from the data set, and the weak robustness of features can easily lead to overfitting of the model. In addition, these methods often use simple concatenation to fuse features, which will cause the loss of speech information. In this article, to solve the above problems and improve the recognition accuracy, we utilize self-supervised learning to enhance the robustness of speech features and propose a feature fusion model(Dual-TBNet) that consists of two 1D convolutional layers, two Transformer modules and two bidirectional long short-term memory (BiLSTM) modules. Our model uses 1D convolution to take features of different segment lengths and dimension sizes as input, uses the attention mechanism to capture the correspondence between the two features, and uses the bidirectional time series module to enhance the contextual information of the fused features. We designed a total of four fusion models to fuse five pre-trained features and acoustic features. In the comparison experiments, the Dual-TBNet model achieved a recognition accuracy and F1 score of 95.7% and 95.8% on the CASIA dataset, 66.7% and 65.6% on the eNTERFACE05 dataset, 64.8% and 64.9% on the IEMOCAP dataset, 84.1% and 84.3% on the EMO-DB dataset and 83.3% and 82.1% on the SAVEE dataset. The Dual-TBNet model effectively fuses acoustic features of different lengths and dimensions with pre-trained features, enhancing the robustness of the features, and achieved the best performance.
Journal Title	IEEE/ACM Transactions on Audio, Speech, and Language Processing
ISSN	23299290 23299304
NCID	AA12669539
Publisher	IEEE
Volume	31
Start Page	2193
End Page	2203
Published Date	2023-06-01
Remark	論文本文は2025-06-01以降公開予定
Rights	© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
EDB ID	397694
DOI (Published Version)	10.1109/TASLP.2023.3282092
URL ( Publisher's Version )	https://doi.org/10.1109/TASLP.2023.3282092
language	eng
TextVersion	その他
departments	Science and Technology