ID | 118515 |
Author |
Liu, Zheng
Tokushima University
Ren, Fuji
University of Electronic Science and Technology of China
Tokushima University Educator and Researcher Directory
KAKEN Search Researchers
|
Keywords | Speech emotion recognition
affective computing
speech representation learning
feature fusion transformer
|
Content Type |
Journal Article
|
Description | Speech emotion recognition has always been one of the topics that have attracted a lot of attention from many researchers. In traditional feature fusion methods, the speech features used only come from the data set, and the weak robustness of features can easily lead to overfitting of the model. In addition, these methods often use simple concatenation to fuse features, which will cause the loss of speech information. In this article, to solve the above problems and improve the recognition accuracy, we utilize self-supervised learning to enhance the robustness of speech features and propose a feature fusion model(Dual-TBNet) that consists of two 1D convolutional layers, two Transformer modules and two bidirectional long short-term memory (BiLSTM) modules. Our model uses 1D convolution to take features of different segment lengths and dimension sizes as input, uses the attention mechanism to capture the correspondence between the two features, and uses the bidirectional time series module to enhance the contextual information of the fused features. We designed a total of four fusion models to fuse five pre-trained features and acoustic features. In the comparison experiments, the Dual-TBNet model achieved a recognition accuracy and F1 score of 95.7% and 95.8% on the CASIA dataset, 66.7% and 65.6% on the eNTERFACE05 dataset, 64.8% and 64.9% on the IEMOCAP dataset, 84.1% and 84.3% on the EMO-DB dataset and 83.3% and 82.1% on the SAVEE dataset. The Dual-TBNet model effectively fuses acoustic features of different lengths and dimensions with pre-trained features, enhancing the robustness of the features, and achieved the best performance.
|
Journal Title |
IEEE/ACM Transactions on Audio, Speech, and Language Processing
|
ISSN | 23299290
23299304
|
NCID | AA12669539
|
Publisher | IEEE
|
Volume | 31
|
Start Page | 2193
|
End Page | 2203
|
Published Date | 2023-06-01
|
Remark | 論文本文は2025-06-01以降公開予定
|
Rights | © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
|
EDB ID | |
DOI (Published Version) | |
URL ( Publisher's Version ) | |
language |
eng
|
TextVersion |
その他
|
departments |
Science and Technology
|