Dynamic Facial Models for Video-Based Dimensional Affect Estimation

Dimensional affect estimation from a face video is a challenging task, mainly due to the large number of possible facial displays made up of a set of behaviour primitives including facial muscle actions. The displays vary not only in composition but also in temporal evolution, with each display composed of behaviour primitives with varying in their short and long-term characteristics. Most existing work models affect relies on complex hierarchical recurrent models unable to capture short-term dynamics well. In this paper, we propose to encode these short-term facial shape and appearance dynamics in an image, where only the semantic meaningful information is encoded into the dynamic face images. We also propose binary dynamic facial masks to remove 'stable pixels' from the dynamic images. This process allows filtering of non-dynamic information, i.e. only pixels that have changed in the sequence are retained. Then, the final proposed Dynamic Facial Model (DFM) encodes both filtered facial appearance and shape dynamics of a image sequence preceding to the given frame into a three-channel raster image. A CNN-RNN architecture is tasked with modelling primarily the long-term changes. Experiments show that our dynamic face images achieved superior performance over the standard RGB face images on dimensional affect prediction task.


Introduction
The face is an important asset for automatic human behaviour understanding, as it displays a wide range of cues about our cognitive state, including our affective state. Analysing human emotions by their face would find application in many cross-disciplinary fields, such as medicine [44], security, or entertainment [26]. Automatic emotion recognition by and large follows two main emotion theories: Ekman's six basic emotion model [6] or the dimensional affect model (a.k.a. Russel's Circumplex model [32]). The Circumplex model predicts values of emotional attributes such as arousal and valence on a continuous scale, where arousal is a physiological state of being alert, awake, attentive, and valence represents how negative or positive someone feels [8]. Although sometimes other dimensions are added, the combination of these two values represents a wide span of emotional states.
People express their affect through auditive, visual, and physiological signals, where the face is a highly valuable visual signal that can be sensed unobtrusively and that can also process many individuals at the same time in a shared scenario without the need for source-separation on a 1-dimensional signal. Motivated by this, existing affect analysis approaches build on analyzing the visual information provided by facial expressions. While some studies [17,7,33] analyze affect on a frame-by-frame basis, without exploiting the relationships between frames, the progression of affect has distinct temporal patterns that span multiple frames, and the values of arousal and valence are therefore highly correlated over time. Thus, temporal models should be used.
The standard approach for modelling dynamics is through sequential latent models, such as Recurrent Neural Networks (RNN). These models exploit the temporal information by applying a set of latent variables that are supposed to model the intrinsic correlation that exists between the input and the output at a given frame, conditioned to the latent states at previous frames. However, they are generally used to learn dynamics from extracted features rather without considering the context of the face. Other works proposed to encode dynamics at the input level, by extracting features from a image sequence [25], constructing spectrum maps [40,14] or an encoded image sequence [15]. However, these methods also have practical drawbacks, as the learning of later approach can become quite complex for long-term sequences. For instance, [15] proposed a temporal CNN approach that needs as many input channels as the number of frames being considered. This results in growing models and limited capacity: in [15] the number of input frames (and channels) is set to 5, which limits the temporal modelling of longer-term expressions at the input side.
To learn dynamics in the context of the face and avoid the limited capacity of encoding long sequence, this paper applies the dynamic image algorithm [1] to encode the shortterm facial dynamics at the image level, which are further forwarded to a CNN-RNN-based model to re-encode both long-term and short-term variations at the feature level. Importantly, in doing so it keeps the framework simple. The dynamic image consists of a 3-channel raster image (similar to an RGB image) displaying a "summary" of an image sequence. This idea itself is very similar to temporal templates as introduced by Bobick & Davis in 2001 [2] but have proved its better ability in action recognition [1]. The use of a summarising image allows CNN-based architectures designed to take still images as input to process a video of variable length. While the dynamic image algorithm has been successfully applied for human action recognition, its extension to model the dynamics of facial actions is not straightforward. Bilen et al. made use of whole images to generate dynamic appearance [1], without segmentation of specific, semantically meaningful regions of objects (the human body, or the face), nor shape information, both of which are highly valuable for face analysis. In order to consider such information, this paper extends the dynamic image algorithm to account for shape domain by combining facial landmarks to produce a dynamic facial appearance (DFA) and shape image (DFS). After that, a Laplacian pyramid-based multi-scale transform is applied for the fusion of facial appearance and shape in order to retain maximum correlation between them.
The Dynamic Facial Images (DFIs) (examples are shown in Fig. 1) are generated per video frame, summarising the content from a few frames prior to the current one and are computationally efficient (please see [1] for details on efficiency). Importantly, they are still images, and thus can be processed by standard CNN architectures whilst retain-ing the short-term temporal information. To learn long-term dynamics the CNN is followed by a RNN model, and RNN is trained individually, tasked with returning valence and arousal values at each time step. In summary, the main contributions of this paper are as follows: 1. We extend the dynamic image algorithm to the face domain, dismissing non-face related attributes and encoding face dynamics in the context of the face.
2. We propose a Dynamic Facial Model (DFM) encoding algorithms that allows to integrate the facial appearance and shape into a standard RGB image, summarising the variation of both along time.
3. We compared the dynamic face images to standard RGB face images on two datasets, where the proposed approach achieved superior results for both arousal and valence estimation tasks.

Related Works
Dimensional affect estimation is often regarded as the regression problem where both valence and arousal are continuous values lying in the range [−1, 1]. Its growing interest has been investigated by a series of AVEC Challenges [46,45,30,28,29], aiming to gather all efforts in a common benchmark of increasing difficulty. Like many other Computer Vision disciplines, existing approaches are generally divided into those that use hand-crafted features with general-purpose machine learning techniques, and those that built on the recent advances in Deep Learning. As time-series data, temporal modeling is crucial for dimensional affects analysis. As shown above, traditional hand-crafted approaches have been [20,16,11,22,13,23] frequently used kernel-based regression, such as SVR, which by nature cannot model contextual information. To overcome such limitations, some hand-crafted features have been extended to the temporal domain. In [25], global and local features are extended to the temporal domain through the magnitude of the Fourier transform of each of them. In order to capture both long and short-term dynamics, they applied the Fourier transform at different scales i.e. at sequences of one to four seconds long. Also, some features extended the spatial domain to the temporal dimension, referred as the Three Orthogonal Planes (TOP), were widely used by AVEC baselines. In [16] the LBP features are extended to the temporal domain as the LBP-TOP, and further combined with a novel sparse regression method, achieving excellent performance on the SEMAINE database [21]. In [17], histogram-based features, such as LPQ, LBP and LGBP, are extended to the temporal dimension, and were further combined with deep features.
However, the TOP extension of features grows drastically in complexity as the number of frames increases, and thus learning temporal models is a better choice. While graphical models such as HMMs or CRFs are powerful temporal representations, they are prone to failure when modeling long-term dynamics. These drawbacks can be tackled with Recurrent Neural Networks, which are feed forward networks of latent states that can be learned through backpropagation. RNNs can be used with either hand-crafted features or in combination with CNNs. Some extensions handling back-propagation problems in RNNs have been proposed too. In [24] a Bidirectional Long Short Term Memory Network (BLSTM) is used with hand-crafted features, showing better results than SVR. Hasani et al. [12] extract features using Inception module [42]. Combining the it with an LSTM yields better results than using the Inception module only in a per-frame basis. A similar approach is adopted in [18], where a relatively shallow CNN is used in combination with a RNN. Kollias et al. [19] showed how pre-trained networks can be adapted to affect estimation tasks with great success, as training some networks endto-end might not be affordable due to the lack of data or resources. In particular, When combined with RNN, the VGG-Face network, with only fully-connected layers finetuned, yielded the best results, showing the great potential of using existing CNNs to predict the intensity of continuous dimensional affects on data gathered "in-the-wild".
RNNs can also be combined with other non-temporal regression techniques. In [10], the output of a RNN is combined with an SVR, thus preventing the former to incur in overfitting, and the latter not to consider the temporal domain. The proposed approach, coined Strength Modeling algorithm, applies the two models in a hierarchical manner.

The proposed approach
The main novelty of our method resides in the encoding of the short-term facial shape and appearance dynamics of image sequences into a single raster image. Our work differs from that of Nicolle et al. [25] in that we do not rely on the frequency domain, as it contains nuisance factors that are hard to capture with a CNN, and in that we incorporate the temporal modelling of a RNN. Similarly, it differs from [15] in that the dynamics are encoded into single images, allowing the use of a flexible number of frames, rather than as a concatenation of frames, which in practice limits the time extent of the short-term encoding. Finally, our work differs from that of [39] in that we encode precise dynamic from a image sequence rather than estimate it from a single image. We further calculate the 'dynamic pixels' while removing 'stable pixels' of the encoded DFA and DFS, allowing both shape dynamics and appearance dynamics to be summarized in a single image without redundant information ('stable pixels') while the dynamic image of [1] only contains appearance dynamics without considering the effects of redundant information. Our approach starts with detecting a set of 66 facial landmarks for each video frame. These landmarks, depicted in Fig. 2, are extracted using the publicly available code of iCCR [34]. These landmarks correspond to specific parts of the face, are then used to generate a static facial appearance (SFA) and static facial shape (SFS), per frame. Then, for each subsequence of T frames, the corresponding DFA and DFS are generated. These images are also generated per video frame, and subsequently fused into a sequence of DFM.

Static Facial Image
Static Facial Shape Image: Based on the detected facial landmarks, the face is segmented into 15 semantic regions. These regions correspond to the left and right eyebrows, left and right eyes, nose, left and right cheekbones, left and right cheeks, mouth, lips, left and right philtrums, and left and right jaws. In static shape images, each region is represented by a unique colour. All pixels lying out of the convex hull of the face are set to 0 in each colour channel (black). An example of SFS is shown in Fig. 2.
Static Facial Appearance Image: Using the aforementioned landmarks, a binary mask is applied to the original face image, whereby only the pixels lying within the convex hull defined by the landmarks are set to one. This mask is applied to the input image to generate the static appearance image, which basically accounts for the facial appearance. This way, the background noise is removed before the feature extraction process.

Dynamic Facial Image
The dynamic image is a parameter matrix whose parameters are learnt to rank the position of the given frames from their features by implementing dot product between the per-frame features and the dynamic image. That is to say, it is an operator that contains the evolution information of frames and consequently can be treated as the representation of given frames. By extending this algorithm to include shape and adapting it to the face domain by leveraging facial landmarks, we obtained two novel dynamic facial images (DFI): dynamic facial appearance (DFA) and dynamic facial shape (DFS).
Let I t ∈ R m×n be the t-th image of a sequence composed of T consecutive face-aligned images, all of size m × n, and let V t = 1 τ t τ =1 I τ be the average value image up to frame t. V t is defined as the average of a given feature mapping of the image, φ(I τ ). The mapping chosen in this paper is the same as that which attained highest performance in the original paper by Bilen et al. [1], which defined φ to be the identity function. Let d ∈ R d be the raw DFI of the image sequence. The ranking score for frame t is defined as the dot product between d and V t : where d lij and v t lij are the values of pixel lij in the dynamic face image d and static face image v t , respectively. Thus, the goal is to learn the DFI so that if q > t, then S(d, V q ) > S(d, V t ) because those closer frames normally contribute more information to current face status. In other words, d is learned so that when projected into the aggregated kernel of the input image size, it returns a score that sorts frames by time. This kernel ranks the input SFIs, and hence contains temporal evolution of the face image sequence end at the last image, making it a good facial dynamic descriptor for the last image. In order to learn d, we minimise the hinge loss between pairs of scores: , is the L2-norm regularised error. The second term in Eq. 3 defines the number of pairs on the subset that are incorrectly ranked by the score function. A pair q > t is said to be correctly ranked if S(d, V q ) ≥ S(d, V t ) + 1. The minimization of Eq. 3 is accomplished with RankSVM [38]. The parameters in the final learned kernel d are in the real space. It is worth highlighting that the RankSVM algorithm is also applied to learn the DFIs d at test time, i.e. it is learned on the go for each subsequence of images.
In order to generate a set of DFA and DFS for a video, we take a set of T − 1 consecutive frames prior to each frame, for which we first obtained the SFA and SFS, respectively. Then, DFA and DFS for each frame are learned by a sliding window of T frames. Therefore, for a video of N frames we have N − T + 1 DFA and DFS images (From frame T to frame N ).

Fusion of dynamic appearance and dynamic shape
Both DFA and DFS are separately generated for each frame. While this is a common approach, after which the two descriptors are combined before being input to a machine learning hypothesis (e.g. SVR, CNN), we propose to fuse them into a single dynamic image, unifying shape and appearance as a single input stream to the ML hypothesis, retaining the context of the face. To the best of our knowledge, despite that many reports of approaches combine facial appearance and shape information at the feature or decision level for affect analysis, no previous work has proposed to fuse DFA and DFS into a three channel image and then learn both features and their correlations at the input level, which is interesting to explore. From Equation 1, we can see that variation in pixel values of a static face image results in differences in the final score across the image sequence, as the kernel matrix (dynamic image) is a constant matrix in each case. In particular, we found from the Equation 1 that pixels whose values remain fixed over the image sequence have no influence on the frame ranking, because the dot product between these pixels in each frame, and corresponding pixels in dynamic image, are the same. Thus, they are not discriminative. In this paper, we call these pixels as the "stable pixels", denoted as (i sta , j sta ), while the reminder is called "dynamic pixels", defined as (j dy , j dy ), as they can contribute different scores to different frames. In this sense, Equation 1 can be re-written as: where the d sta , V t sta = j∈sta (d j × v j ) is the constant and thus only d dy , V t dy leads the difference of scores. Therefore, even we setting all 'stable pixels' as 0, making dynamic image as a sparse matrix, it can still rank frames correctly. Since the DFS mainly contains the edge dynamics of each semantic region while DFA contributes more details about the detailed facial texture dynamics in each region, the 'dynamic pixels' of them are expected to be largely independent in the space domain, allowing the fusion of them not to lose significant information or highly distort the dynamics of the original DFA and DFS. Motivated by this, assuming that the DFA and DFS are generated from sequence Seq of T face images, the framework applies the following steps to fuse DFA and DFS images. This process is also illustrated in Fig. 3. 1. For T continuous SFS and SFA, we firstly find their 'stable pixels' whose R, G, B values keep stable over T frames. Specifically, for each of them, we calculate the absolute value of the difference between the given frame and other frames, respectively, resulting in T − 1 maps. Then, a map representing the sum of these maps is obtained, of which the pixels (R, G, B) values equaling to 0 are defined as the 'stable pixels' while the reminder are denoted as 'dynamic pixels'.
2. Constructing binary dynamic shape mask and binary dynamic appearance mask, where the "dynamic pixels" are set as 1 and 'stable pixels' are set as 0. Since the location of some dynamic pixels in two masks may overlapped, we further set overlapped pixels in dynamic appearance mask to 0 to avoid distortion.
3. Generating a new DFS and a new DFA by conducting dot product between the binary dynamic shape mask/binary dynamic appearance mask and previously obtained DFS/DFA, respectively. Consequently, all redundant information and background noise would be removed from the generated new DFS and DFA as their pixels' value would equal to '0', while the new DFS containing all temporal shape information and new DFA containing most appearance dynamics.
4. Yielding the final fused dynamic facial image by simplely adding the new DFS to new DFA, which contains all temporal shape information and most appearance dynamics without any distortion. In this paper, we call this fused dynamic facial image as Dynamic Facial Model (DFM).

Deep Learning Dynamic Facial Features
As shown above, the DFM and DFIs are 3-channel raster images whose dimensions are same to the input SFIs. Therefore, it allows the information of a video to be learnt by existing CNN models for still images with fine-tuning. The features extracted from the CNN representation are subsequently forwarded to a Recurrent Neural Network (RNNs), which deals with dynamics at the feature level.
In this paper, we chosen VGG-16 network [37] pretrained by VGG face datasets. We applied two simple structures to illustrate the benefit of each proposed DFI, which are shown in Fig. 4. The proposed approach, described above, is depicted in the top of the Figure. In particular, we investigate this approach against the use of two branches for the CNN-RNN structure, by which the shape/static face image and appearance/dynamic face image are not fused at the lowermost level, but are rather forwarded to two CNN networks, the output of which is fused by the RNN network.
The output of the CNN is taken from the first fullyconnected layer of the corresponding VGG, which is a 4096-D vector. These features encode the short-term appearance and shape dynamics, constrained to the length of the time-window. In order to learn the long-term dynamics, we use a RNN on top of the CNN features. For this purpose, we adopt the Bidirectional Gated Recurrent Units (BGRU) [4] as our RNN model. BGRU is a simple version of Bidirectional Long-Short-Term-Memory networks (BLSTMs) due to its less complex structure. It has two multiplicative gates, i.e. reset gate and forget gate, to capture both long and short term dependencies in sequences, where the short-term dynamics will frequently have reset gates being active while the long-term dependencies will mostly update those forget gates. As a result, the use of BGRU allows (a) The framework for inputing a single modality (b) The framework for inputing two modalities Figure 4: CNN-RNN framework: Top corresponds to the pipeline described throughout the paper, whereas bottom corresponds to the approach where shape/static and appearance/dynamic are fused after the CNN processing. our framework to learn both long-and short-term temporal dynamics at the feature level. Thus, it compensates the drawback of DFIs that they only encode short-term dynamics in this paper.

Database
To validate the proposed approach, we have carried out arousal and valence intensities estimation experiments on SEMAINE [21] and RECOLA [31] datasets. The SE-MAINE dataset recorded uncontrolled facial expressions of participants who have a conversation with an operator, and it is annotated with valence and arousal dimensions in a continuous space within −1 and 1. In this paper, we have used the subset used in AVEC 2012 [36], which contains 31 videos for training, 32 videos for development and 32 videos for test. The RECOLA dataset was recorded from 27 French-speaking participants to study socio-affective behaviours from video, audio, electro-cardiogram (ECG) and electro-dermal activity (EDA) in the context of computer supported collaborative work. Each video is around 300 seconds and labels are given with a rate of 25 Hz.

Evaluation measures
Three standard measures were used to assess the performance of the affect estimation; firstly the Mean Squared Error (MSE); secondly Pearson Correlation Coefficient (PCC); and thirdly the Concordance Correlation Coefficient (CCC, Eq. 5): where ρ x,y is the PCC, μ x and μ y are mean values of predictions and labels while σ x and σ y are standard deviations.

Implementation details
DFIs generation: To generate frame-wise dynamic facial images for SEMAINE and RECOLA datasets, the lengths of time-windows are 20, 15 and 6, respectively, with the stride of 2. Model training: In this paper, VGG-16 networks pre-trained by VGG face database and BGRU with one hidden layer of 200 neurons were utilized. MSE was chosen as the loss function and standard SGD algorithm was applied as training method with learning rate of 5 × 10 −3 , learning rate decay of 1×10 −4 , and momentum of 0.85. For SEMAINE, the development partition was used to adjust model's hyper-parameters while test partition was used for reporting the final results. For RECOLA datasets, five-fold cross validation was conducted on training partition and reported results were yielded from the development partition.

Ablation studies
This section firstly conducts the ablation studies in terms of two experimental variables: 1. Temporal status of the input: static face images, (SFA, SFS, SFA+SFS) and dynamic face images (DFS, DFA, DFA+DFS); 2. Type of the input: appearance (SFA, DFA, SFA+DFA) and shape (SFS, DFS, SFS+DFS). All the experiments that have two inputs, e.g. SFA+SFS, DFA+SFA, DFA+DFS and SFS+DFS, were processed by the two branch architecture (Fig. 4(b)).

Facial appearance VS Facial shape
We firstly compared the average performance of facial appearance images to facial shape images in Fig. 5. For both dataset, the predictions yielded by shape inputs are more correlated to arousal and valence intensities labels than the appearance inputs, where the mean CCC values of shape inputs for arousal and valence are 0.354 and 0.304 for SEMAINE as well as 0.419 and 0.435 for RECOLA, which are outperformed the corresponding arousal and valence results obtained appearance inputs (0.302 and 0.283 for SEMAINE, 0.366 and 0.396 for RECOLA). Similarly, the predictions from facial shape features also achieved better MSE results than facial appearance features. When combining facial appearance and shape, it is obviously that each of them can benefit from the other, as the result achieved by 'Shape + Appearance' outperformed using shape or appearance independently for two tasks on both datasets.

Dynamic face VS Static face
As illustrated in Fig. 6, dynamic face images achieved higher average CCC and less average MSE results than static face images. In particular, the mean CCC value obtained by dynamic inputs on two datasets are 0.362 (arousal of SEMAINE), 0.302 (valence of SEMAINE) and 0.426 (arousal of RECOLA), 0.443 (valence of RECOLA),  These results indicate that the temporal dynamics encoded in the proposed DFIs can provide powerful clues for affect intensity estimation. We also reported the average results yielded by 'Static + Dynamic' which achieved similar result to dynamic face images, with slightly improvement.
To further investigate the property of DFIs, Fig. 7 compared some predictions of SFA, SFS, DFA and DFS on SE-MAINE dataset. Obviously, dynamic predictions changed much heavier than static predictions as well as predictions from facial shape changed heavier than facial appearance because the difference between adjacent DFIs is much larger than SFIs. Another observation is that when groundtruth suddenly dropped or increased, e.g. 400th frame, 600th frame, namely high frequency dynamics, the amplitude of dropping or increasing of dynamic predictions were heavier than static predictions. This means that DFIs are more sensitive to affect changes.

Dynamic Facial Model VS Dynamic Facial Images
We also compared the proposed DFM with the best single input system, i.e. DFS, in Table. 1 and Table. 2. While DFS already yielded good performance, DFM achieved significant improvement on both datasets. As both system used the same structure, the only difference is that DFM combined shape and appearance dynamics while DFS only contains shape dynamics. Thus, it can be concluded that our fusion strategy can effectively encode facial shape dynamics and appearance dynamics for affect estimation. Another comparison is made between the result produced by DFA + DFS + BGRU and DFM + BGRU, as both of them input dynamic facial shape and appearance information for estimation. As reported, the results obtained by DFM + BGRU outperformed the results obtained by DFA + DFS + BGRU, except the arousal predictions in SEMAINE. Although DFA and DFS contain the original dynamic information rather than the reduced dynamic information in DFM, the trainable weights in CNN-RNN architecture used for DFA + DFS is at least as twice as it for DFM, resulting in higher computational cost. On the other hand, DFM has some advantages 1. it removes the redundant information ('stable pixels') from both DFS and DFA; 2. it fused shape and appearance information in the context of the face rather than at the feature level; 3. it can be learned by a simple network, where less weights need to be optimized.

Our methods VS state-of-the-art
This section compares our methods with state-of-the-art visual methods on SEMAINE and RECOLA datasets. As shown in Table. 1, our best system (DFM+BGRU) beats all state-of-the-arts for both arousal and valence estima-tion tasks on SEMAINE dataset, especially for arousal estimation, which has 27.7% relative improvement compared to the second best system [16]. As shown in Table. 2, the baselines already generated very promising predictions in RECOLA dataset. However, features extracted from the proposed DFM still achieved excellent performance for both tasks. In terms of the seven recent works on RECOLA dataset that we have compared, our DFM+BGRU system yielded both better arousal and valence results than four of them. For the reminder, the DFM+BDFM+BGRU system either obtained better arousal predictions or valence predictions. In addition, the model of [27] were pre-trained by AFLW dataset, which may also an important factor for its excellent performance.

Cross dataset evaluation of the DFA parameters
To assess the generalizability of the DFA parameters tuned for SEMAINE and RECOLA datasets, we trained dimensional affect recognition models on Aff-wild dataset [19]. Here our aim is to demonstrate that the parameters For this reason, we do not include other stateof-the-art methods on the Aff-wild dataset and the models that were trained using either DFA or DFS alone. Firstly, the DFAs of the Aff-wild data were generated using the parameters that were tuned for SEMAINE and RECOLA datasets. Then we trained two different models with randomly initialized weights, one with SFAs as inputs and the other with stacked SFAs and DFAs as inputs. As shown in Table 3, on the Aff-wild test set, the CNN model trained with the stacked SFA plus DFA inputs outperformed the model trained with only the SFA. This performance improvement clearly demonstrates that the DFA parameters can generalize well and extract meaningful face representations that complement the static face inputs.

Conclusion
This paper proposed a dynamic facial encoding method that allows a single raster image to encode facial appearance and shape dynamics of an image sequence. The features learned from these dynamic inputs using CNN-RNN models outperformed the static inputs on dimensional affect estimation task. The experimental results suggest the following conclusions: 1. facial shape features generate better affect predictions than facial appearance features; 2. combined static and dynamic face inputs perform better than the static face inputs alone on dimensional affect estimation; 3. the proposed DFM can effectively encode facial shape and appearance dynamics, as it achieved better results than either using a DFI or SFI as the input, or jointly using DFI and SFI as the input in most cases. Meanwhile, we believe DFM still haven't fully shown its ability on RECOLA database as the initial weights of VGG face network was pre-trained by RGB images rather than dynamic images, and the VGG structure is not specifically designed for dynamic face images, and thus the trained models may lack of ability to capture encoded dynamic information.