Total for the last 12 months
number of access : ?
number of downloads : ?
ID 119348
Author
Wang, Linhuang Tokushima University
Ding, Fei Tokushima University
Nakagawa, Satoshi The University of Tokyo
Keywords
Dynamic facial expression recognition
Affective computing
Transformer
Convolution neural network
Content Type
Journal Article
Description
Unlike conventional video action recognition, Dynamic Facial Expression Recognition (DFER) tasks exhibit minimal spatial movement of objects. Addressing this distinctive attribute, we propose an innovative CNN-Transformer model, named LSGTNet, specifically tailored for DFER tasks. Our LSGTNet comprises three stages, each composed of a spatial CNN (Spa-CNN) and a temporal transformer (T-Former) in sequential order. The Spa-CNN extracts spatial features from images, yielding smaller-sized feature maps to alleviate the computational complexity for subsequent T-Former. The T-Former integrates global temporal information from the same spatial positions across different time frames while retaining the feature map dimensions. The alternating interplay between Spa-CNN and T-Former ensures a continuous fusion of spatial and temporal information, leading our model to excel across various real-world datasets. To the best of our knowledge, this is the first method to address the DFER challenge by focusing on capturing the temporal changes in muscles within local spatial regions. Our method has achieved state-of-the-art results on multiple in-the-wild datasets and datasets under laboratory conditions.
Journal Title
Applied Soft Computing
ISSN
15684946
18729681
NCID
AA11644645
AA11926126
Publisher
Elsevier
Volume
161
Start Page
111680
Published Date
2024-05-09
Remark
論文本文は2026-05-09以降公開予定
Rights
© 2024. This manuscript version is made available under the CC-BY-NC-ND 4.0 license
https://creativecommons.org/licenses/by-nc-nd/4.0/
EDB ID
DOI (Published Version)
URL ( Publisher's Version )
language
eng
TextVersion
その他
departments
Science and Technology