ID | 119348 |
著者 |
Wang, Linhuang
Tokushima University
Ding, Fei
Tokushima University
Nakagawa, Satoshi
The University of Tokyo
|
キーワード | Dynamic facial expression recognition
Affective computing
Transformer
Convolution neural network
|
資料タイプ |
学術雑誌論文
|
抄録 | Unlike conventional video action recognition, Dynamic Facial Expression Recognition (DFER) tasks exhibit minimal spatial movement of objects. Addressing this distinctive attribute, we propose an innovative CNN-Transformer model, named LSGTNet, specifically tailored for DFER tasks. Our LSGTNet comprises three stages, each composed of a spatial CNN (Spa-CNN) and a temporal transformer (T-Former) in sequential order. The Spa-CNN extracts spatial features from images, yielding smaller-sized feature maps to alleviate the computational complexity for subsequent T-Former. The T-Former integrates global temporal information from the same spatial positions across different time frames while retaining the feature map dimensions. The alternating interplay between Spa-CNN and T-Former ensures a continuous fusion of spatial and temporal information, leading our model to excel across various real-world datasets. To the best of our knowledge, this is the first method to address the DFER challenge by focusing on capturing the temporal changes in muscles within local spatial regions. Our method has achieved state-of-the-art results on multiple in-the-wild datasets and datasets under laboratory conditions.
|
掲載誌名 |
Applied Soft Computing
|
ISSN | 15684946
18729681
|
cat書誌ID | AA11644645
AA11926126
|
出版者 | Elsevier
|
巻 | 161
|
開始ページ | 111680
|
発行日 | 2024-05-09
|
備考 | 論文本文は2026-05-09以降公開予定
|
権利情報 | © 2024. This manuscript version is made available under the CC-BY-NC-ND 4.0 license
https://creativecommons.org/licenses/by-nc-nd/4.0/ |
EDB ID | |
出版社版DOI | |
出版社版URL | |
言語 |
eng
|
著者版フラグ |
その他
|
部局 |
理工学系
|