From Pixels to Response Maps: Discriminative Image Filtering for Face Alignment in the Wild

We propose a face alignment framework that relies on the texture model generated by the responses of discriminatively trained part-based filters. Unlike standard texture models built from pixel intensities or responses generated by generic filters (e.g. Gabor), our framework has two important advantages. First, by virtue of discriminative training, invariance to external variations (like identity, pose, illumination and expression) is achieved. Second, we show that the responses generated by discriminatively trained filters (or patch-experts) are sparse and can be modeled using a very small number of parameters. As a result, the optimization methods based on the proposed texture model can better cope with unseen variations. We illustrate this point by formulating both part-based and holistic approaches for generic face alignment and show that our framework outperforms the state-of-the-art on multiple”wild” databases. The code and dataset annotations are available for research purposes from http://ibug.doc.ic.ac.uk/resources.


INTRODUCTION
THE problem of non-rigid face alignment under controlled laboratory settings has been studied for decades and has produced a number of solutions with varying degrees of success.Essentially, the problem is one of getting a facial landmark localization that can describe the face in sufficient detail.These include methods such as active shape model [7], active appearance model (AAM) [11] and constrained local model (CLM) [9], [24].Alternatively, some methods [15] perform global face alignment using Markov Random Fields without explicitly relying on facial landmark localization.However, the performance of these methods [15] under uncontrolled natural settings have not been explored.In contrast, the facial landmark localization based methods for uncontrolled natural settings (referred to as "in the wild") have started to receive some attention [4], [5], [8], [27].
Broadly speaking, there are two major lines of work on nonrigid face alignment, namely, active appearance model [11] and constrained local model [24].AAMs are generative models of shape and texture learned by employing principal component analysis (PCA) to a training set of annotated face images.Baker et al. [2] proposed several generative optimization methods for fitting an AAM, some capable of real-time face tracking [19].Recently, several discriminative optimization methods for AAMs have been proposed [17], [20], [21], [22], [23] that directly learn a fixed update model.However, the overall performance of these methods have been shown to deteriorate significantly for crossdatabase experiments [20], [23].
Compared to the AAM framework, the CLM framework is relatively more capable of handling unseen variations of pose, illumination and expressions.In essence, the standard CLM framework follows a part-based approach in that the face is represented by a set of cropped image patches.A local detector (referred to as the 'patch-expert') is trained for each landmark, using an off-the-shelf linear SVM and a large number of positive and negative patches [25].Now, given a new face image, these patch-experts are used to perform an exhaustive local search around the initial shape estimate.As a result, a response map for each landmark point is generated which provides a likelihood of that landmark point being at a particular position in the given image.These response maps are then efficiently used to drive a simple Gauss-Newton method based optimization [24].
Although the use of response maps has undoubtedly given the CLM framework ability to perform generic face alignment, we believe that their full representative power has not been exploited so far.In particular, the main motivation behind this work is the realization that these response maps tend to be sparse, by virtue of the discriminative training procedure of the patch-experts, and can themselves be represented by a small set of parameters.Hence, in this work, we propose to construct a novel texture model for robust face alignment based on these response maps.In prior work in computer vision, texture models are typically constructed by filtering the image using a set of pre-defined generic filters (e.g.difference of Gaussian, generative filters [10] or Gabor filters [18]  1 ).Instead, we propose to construct a texture model by filtering the image with a set of filters, each of which have been discriminatively trained to localize a particular landmark point.The output of this filtering process is a sparse response map which can be then used to construct a robust texture model (to the best of our knowledge, this is the first time that the response maps generated via discriminatively learned filters are used to construct a texture model).In Fig. 1, we give the overview of the proposed texture model.Within this proposed discriminative image filtering framework: We formulate a part-based approach and propose a discriminative face alignment technique which uses the response map based texture model.We formulate a holistic approach by combining the proposed discriminative image filtering with generative deformable face models (i.e.AAM).This results in a hybrid discriminative/generative face alignment framework in that the actual texture model is discriminative and the alignment method is generative in nature.We show that the proposed framework outperforms stateof-the-art methods (the CLM based RLMS [24] and the tree-based method [27]) convincingly.We release our code (See Supplementary material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2014.2362142)and the pre-trained models for research purposes.

MOTIVATION FOR DISCRIMINATIVE IMAGE FILTERING
As stated earlier in Section 1, the main advantage of the standard CLM framework [24] over the AAMs [2] is the use of response maps to drive its optimization procedure, thereby decoupling the optimization procedure from the variations in facial texture induced by changes in identity, expressions, pose and illumination.Therefore, unlike the AAM optimization methods that suffer from the problem of lack of generalizability, mainly due to the texture model they use, the CLM optimization methods easily bypass these problems by working with the response maps instead of the actual facial appearance.However, one of the shortcomings of the standard CLM framework is that it does not fully exploit the true representative power of the response maps.In particular, in the CLM fitting objective function of [24], the optimization is performed over only the shape model parameters, and the response maps are used only indirectly in computing the weights for the non-parametric Gaussian mixture model that governs the possible landmark locations.
Therefore, in this paper, we make a case for a more direct use of the information provided by the response maps in the fitting procedure.This is motivated mainly due to the realization that: First, each of these discriminatively trained filters (i.e. the patch experts) is tailored for a particular landmark point and can provide sparse filter response (or confidence) maps.Second, since invariance to external elements (like identity, pose, illumination and expressions that makes generic face fitting a very challenging task) is intrinsic to the response maps, a dictionary of response maps (controlled by a small set of parameters) can be easily created and used very accurately to reconstruct unseen response maps.As a result, a dictionary of response maps can be very efficiently used to replace the raw pixel value based texture model.This results in a non-rigid face alignment framework capable of handling the challenging in the wild scenario.
In this section, we empirically test the generalization capability of the proposed response map based texture model and compare it to the standard facial appearance based (i.e.pixel value based) texture model.For this purpose, we train two separate texture models based on the pixel values and the response map values, respectively, using the images from Multi-PIE database [13] only.We then reconstruct instances from unseen test images belonging to the Multi-PIE [13] and LFPW [4] databases.This highlight some highly desirable properties of the response maps and its texture model which include: distinct signature of some landmark points, sparsity, compactness and generalization capability.
See Appendix A in the supplementary material, available online, for details on the experimental setup.For training the pixel value based texture model, all the training images were similarity normalized [24], [25] and 31Â31 patches were extracted around each landmark point.Let us assume we have a training set of image patches fA i j g T j¼1 for each landmark point i.A simple way to model the appearance of the patches for the ith landmark is to vectorize the training set of patches, stack them in a matrix T is a matrix that contains the mean vector m i in each column, and H i ¼ Z T i ðX i À M i Þ are the parameters for the training set of patches (i.e., the projection to the bases).Now, given a testing sample (i.e. a new unseen patch), it can be reconstructed by a small set of parameters h test that are computed by a simple projections on the PCA basis Z i .
For training the response map based texture model, all the 31 Â 31 training patches (extracted above around each landmark point) were convolved with respective patch-experts (learned using the mentioned Multi-PIE training set) to generate training set of 31 Â 31 response maps.Following the similar modeling procedure as above, each of the response maps were vectorized, stacked in a matrix and PCA was applied to compute the PCA basis.Now, given a testing sample (i.e. a new unseen response map), it can be reconstructed by a small set of parameters by just a simple projections on the response map PCA basis.An illustrative example on how effectively a response map can be reconstructed, as compared to the pixel value based image patch, by a very small number of PCA components (for example, the top 5five PCA components in this case) is shown in Fig. 2.
Fig. 3a shows the average reconstruction error for the patch around the left eye corner (i.e.landmark number 5 in Fig. 2) for Multi-PIE and in-the-wild LFPW test set using up to the top 20 PCA components of both the pixel value based and the response map based texture model.The average reconstruction error is computed as the mean-squared error between the ground-truth and the reconstructed patch.Further, in Fig. 3b, we show average reconstruction error for the patch around all 66 landmark points for both the testing sets using top five PCA components of both the texture models.
Overall, these results clearly show the superiority of the response map based texture model over the traditional pixel value based texture model.The empirical evidence suggests that the response maps extracted for certain landmark points have a distinct signature Fig. 1. Background and overview of the proposed response map based texture model.In the following Section 2, we will show that these normalized response maps can be modeled and reconstructed accurately using a very small number of parameters.
(for example, boundary points have a distinct elongated response, the eyes points have a very compact circular response).Moreover, the sparsity of the response maps is highly desirable quality as it drastically reduces the candidate locations for each landmark point.Having said that, the two most important qualities of the response map based texture model are its generalization capability and compactness.Notice the quality of reconstructed response maps for the Multi-PIE test set, but more importantly for the LFPW test set.The response map based texture model is able to generalize easily to the unseen response maps obtained from the LFPW test set, across all the landmark points.On the other-hand, we see a sharp rise in the reconstruction error for the LFPW test set obtained by the pixel value based texture model.Also, the excellent level of generalization obtained by the response map based texture model comes hand-in-hand with its compactness.As shown in Fig. 3a, a stable reconstruction accuracy is obtained by using as few as the top five PCA components making the response map based texture model highly suitable for fast and accurate face alignment optimization strategies.Therefore, in the following sections, we propose the partbased and the holistic approaches that use the novel response map based texture model efficiently for generic face alignment under uncontrolled natural settings.

Background
In the part-based model representation, the model setup is M ¼ fS; Dg where S is the shape model and D is the set of patchexperts.The 3D shape model of CLMs can be described as: parameterized by p ¼ ½s; R; t; q, where s, R (computed via pitch r x , yaw r y and roll r z ) and t ¼ ½t x ; t y ; 0 control the rigid scale, 3D rotation and translation respectively, while q controls the nonrigid variations of the shape.D is a set of patch-experts for detection of n parts and is represented as D ¼ fw i ; b i g n i¼1 , where w i ; b i is the linear filter for the i th landmark point of the face (e.g., eye-corner detector).
The probability of alignment of a particular landmark point at a specific location x i in the given image I can be modeled by using a simple logistic function [3], [24], [25]: where c is the logistic function intercept and d is the regression coefficient.The classifier C i ðx; IÞ distinguishes between the alignment/misalignment for a landmark location x i .We use Linear SVM for training the patch experts as: where w i stands for gain and b i indicates bias, F ðx i ; IÞ is the vectorized feature vector extracted from the image patch centered at x i , and the function P performs normalization so that the result will have the property of zero mean and unit variance.
In the CLMs, the objective is to create a shape model from the parameters p such that the positions of the created model on the image correspond to well-aligned parts.In probabilistic terms, we want to find the shape sðpÞ by solving the following: In [24], in order to solve the optimization problem of (4), a nonparametric estimate of the response map is made in the form of a homoscedastic isotropic Gaussian kernel density estimate.The resulting optimization problem was solved in [24] using an expectation-maximization (EM) algorithm.This method is known as regularized landmark mean-shift (RLMS) [24] and has been shown to produce state-of-the-art results.

Discriminative Fitting of Response Maps (DFRM)
Instead of maximizing the probability of a reconstructed shape [24], the alignment objective of the proposed part-based approach is to directly find shape model parameters that maximize the probability of all the landmark points being aligned.For this purpose, we propose to follow a discriminative regression based approach for estimating the required shape model parameters p.That is, we propose to find a mapping from the response map estimate of shape perturbations to shape parameter updates.
In particular, let us assume that in the training set we introduce a perturbation Dp and compute the response map in a w Â w window centered around each of the perturbed landmark points, represented by A i ðDpÞ ¼ ½pðl i ¼ 1 j x þ x i ðDpÞ.Then, from these responses obtained from the perturbed shapes fA i ðDpÞg n i¼1 , we want to learn a function f such that fðfA i ðDpÞg n i¼1 Þ ¼ Dp.We call this method Discriminative Fitting of Response Maps.Overall, the training procedure for the DFRM method has two main steps.In the first step, the goal is to train a dictionary for response map approximations.The second step involves iteratively learning the parameter update model which is achieved by a modified boosting procedure.

Training Part-Based Response Map Model
In this section, the goal is to build a part-based response map texture model, i.e. a dictionary of response maps, that can be used for representing any instance of an unseen response map.We aim to train a separate dictionary for the response maps obtained from each of the discriminatively trained patch-experts in D (3).In other words, each part-based response map texture model represents A i ðDpÞ using a small number of parameters.Let us assume we have a training set of response maps fA i ðDp j Þg T j¼1 for each landmark point i with various perturbations (including no perturbation, as well).The simplest way to learn the dictionary for the i-th landmark point is to vectorize the training set of response maps and arrange them in a matrix X i ¼ ½vecðA i ðDp 1 ÞÞ; . . .; vecðA i ðDp T ÞÞ.As we motivated in Section 2, we decompose Then, instead of finding a regression function from the perturbed responses fA i ðDpÞg n i¼1 , we aim at finding a function from the low-dimensional weight vectors fh i ðDpÞg n i¼1 to the update of the 3D shape model parameters Dp.
As we motivated in Section 2, extraction of the corresponding weight vector h i can be performed efficiently by a simple projection on the PCA basis.An illustrative example of how effectively a response map can be reconstructed by a small number of PCA components (for example, top 5 PCA components) is shown in Fig. 2. We refer to this dictionary as the Part-Based Response Map Model represented by: R P ¼ fM; Vg : M ¼ fm i g n i¼1 and V ¼ fZ i g n i¼1 ; ( where, m i and Z i are the mean vector and PCA basis, respectively, obtained for each of the n landmark points.

Training DFRM Update Model
Given a set of N training images I and the set of corresponding shapes S, the goal is to iteratively model the relationship between the joint low-dimensional projection of the response maps, obtained from the part-based response map model R P , and the shape model parameters update (Dp).For this, we propose to use a modified boosting procedure in that we uniformly sample the 3D shape model parameter space, which controls all of the landmark positions simultaneously, within a pre-defined range around the ground truth parameters p g (1), and iteratively model the relationship between the joint low-dimensional projection of the response maps at the current sampled shape (represented by tth sampled shape parameter p t ) and the shape model parameter update Dp (Dp ¼ p g À p t ).For the experiments in this paper, the predefined range is set to AE15 pixels for translation, AE10 degree for rotation, AE0:1 for scaling and 1:5 standard deviation (based on the available training set) for the non-rigid parameters (q).The step-by-step training procedure is as follow: Let T be the number of shape parameters sampled from the shapes in S, such that the initial sampled shape parameter set is represented by P ð1Þ : '1' in the superscript represents the initial set (first iteration).Next, extract the response maps for the shape represented by each of the sampled shape parameters in P ð1Þ and compute the low-dimensional projection using R P .Then, concatenate the projections to generate a joint low-dimensional projection vector cðDp j Þ T , one per sampled shape, such that: where, x ð1Þ represents the initial set of joint low-dimensional projections obtained from the training set.Now, with the training set T ð1Þ ¼ fx ð1Þ ; c ð1Þ g, we learn the parameter update function for the first iteration i.e. a weak learner F ð1Þ : For this, any regression method can be employed in our framework.
In this paper, we have chosen a simple Linear support vector regression (SVR) [14] for each of the shape parameters.In total, we used 16 shape parameters: six global shape parameters (representing the six degrees of freedom corresponding to the 3D rigid transformation), and the top 10 local shape parameters (represented by q(1) corresponding to the 3D non-rigid shape variations).Structured regression based approaches can also be employed but we opted to show the power of our method with a simple regression framework.Next, after learning F ð1Þ , we propagate all the samples from T ð1Þ through F ð1Þ to generate T 1 new and eliminate the converged samples in T ð1Þ new to generate T ð2Þ for the second iteration.Here, convergence means that the shape root mean square error between the predicted shape and the ground truth shape is less than a threshold (for example, set to 2 pixels in this paper).Now, in order to replace these eliminated converged samples, we generate a new set of samples (6) (7) from the same images in I whose samples converged in the first iteration.We propagate this new sample set through F 1 and eliminate the converged samples to generate an additional replacement training set for the second iteration T ð2Þ rep .The training set for the second iteration is updated: and the parameter update function for the second iteration is learned i.e. a weak learner F ð2Þ .The sample elimination and replacement procedure for every iteration has two-fold benefits.First, it plays an important role in insuring that the progressive parameter update functions are trained on the tougher samples that have not converged in the previous iterations.And secondly, it helps in regularizing the learning procedure by correcting the samples that diverged in the previous iterations due to overfitting.
The above training procedure is repeated iteratively until all the training samples have converged or the maximum number of desired training iterations (h) have been reached.The resulting DFRM update model U is a set of weak learners: U ¼ fF ð1Þ ; . . .; F ðhÞ g: (10) The training procedure is outlined in Algorithm 1.

THE HOLISTIC APPROACH 4.1 Background
The most well known generative holistic non-rigid face alignment method is the active appearance model [2], [6].An AAM is fully defined by the triplet A ¼ fS; T ; Wðx; p s Þg where S ¼ fs 0 ; F s g and T ¼ ft 0 ; F t g are the shape and texture models, while W is a function that defines the motion model (e.g.piece-wise affine warps or thinplate splines [1], [2]).The problem of fitting the model A to a vectorized test image t (originated from an image I) is formulated as: Gauss-Newton gradient descent is the standard choice for solving (11).Please see [2] for details.

Generative Fitting of Response Maps (GFRM)
The proposed holistic approach that relies on the response map based texture model is akin to the AAM framework [2], [6] in that it uses the 2D shape model and the motion model defined via warping function W .However, unlike the AAM framework that uses the facial appearance to drive the alignment procedure, the proposed holistic approach uses the response maps obtained from the discriminatively trained patch-experts.Here, the motion model defines how, given the shape, the corresponding response maps should be warped into the canonical reference frame (i.e. the mean shape).In this paper, we use the Piecewise Affine Warping [2], [6] method to generate these shape-free response maps.The model setup for the holistic approach is fD; S; Wg, where D is a set of patch-experts, S is the 2D shape model and W is the motion model.The 2D shape model S is parameterized by p ¼ ½s; r; t x ; t y ; q, where, s; r; t x and t y are the global scaling, rotation and translations respectively, and where, s 0 is the mean shape and F s is the shape basis learned from a set of training shapes by applying PCA, and q is the non-rigid shape parameter vector.
Let us assume we have a training image I and the corresponding 2D shape s, we compute the response maps fA 1 ; . . .; A n g where N is the number of landmark points.Next, we generate the shape-free response maps fA i ðW ðx; pÞÞg n i¼1 i.e. warp the response maps to the mean shape.The response map based texture vector t I is generated by vectorizing the shape-free response maps and stacking them together i.e.
The whole procedure is summarized in Fig. 4. Let us assume we have M training images, then the holistic response map texture model is obtained by simply applying PCA to a set of shape-free response map based texture vectors ft i g M i¼1 as where R H ¼ ft 0 ; F t g is the holistically trained response map texture model, t 0 is the mean shape-free response map texture vector and F t ¼ ½F 1 ; ; F K is the texture basis matrix represented by a set of K known response map texture variations F. As a result, the complete model setup for the proposed holistic approach is fS; D; W; R H g. The goal of the Generative Fitting of Response Maps is to infer the shape model parameters p (12) and the response map based texture model parameters c (14).Given a test image I, the alignment objective is to minimize the ' 2 -norm of the error between the shape-free response maps generated by applying patch-experts D (3) to I, represented by t I (13), and the response maps approximations synthesized via holistic response map texture model R H with respect to the model parameters: This optimization can be solved very efficiently using the inverse compositional algorithms [2] which is a variation of the Gauss-Newton optimization procedure.Within this framework, we focus mainly on the project-out algorithm and its alternating extension for the sake of computational efficiency.

Project-Out Method
In the project-out method, the optimization is formulated such that the shape model parameters p are found by the non-linear optimization in the subspace orthogonal to the texture basis F t , thereby, ignoring the texture variation.In particular, the following optimization problem is solved: See Appendix B in the supplementary material, available online, for details.

Alternating Method
The project-out optimization procedure described above is extremely fast, but has been shown to not be robust especially for the case of considerable texture variation [12].Unfortunately, texture variation coincides with the in-the-wild setting assumed in this work.An alternative would be to simultaneously optimize shape and texture but this is extremely slow [12].Fortunately, an alternative option exists via optimizing using an alternating optimization strategy.Suppose that the shape parameters are fixed.Then an update for the response map based texture model parameters can be readily obtained from Dc ¼ F T t ðt I À t 0 Þ and c c þ Dc.Once c has been updated, one can compute the reconstructed response maps from t rec ¼ t 0 þ F t c.The shape parameters can be updated by solving the following Lukas-Kanade problem: See Appendix B in the supplementary material, available online, for details.

EXPERIMENTS AND DISCUSSION
We conducted generic non-rigid face alignment experiments on the controlled and the uncontrolled (a.k.a.wild) databases.For controlled settings, we use Multi-PIE [13] database.For uncontrolled settings, we use LFPW [4], Helen [16] and AFW [27] databases.For all experiments, we consider the independent model (p1050) of the tree-based method [27], released by the authors, as the baseline for comparison.For the multi-view variant of the proposed approach, the pose range of AE30 degree in yaw direction is divided into three view-based models, with each covering À30 to À15 degree, À15 to 15 degree, and 15 to 30 degree in yaw directions.Other non-frontal poses have been excluded for the lack of ground-truth annotations.See Appendix C in the supplementary material, available online, for a step-by-step description to train robust patch-experts used for the following experiments.

Overview of Results
We test the performance of the proposed DFRM (Section 3.2), GFRM-PO (Section 4.2) and GFRM-Alternating (Section 4.2) methods against the existing state-of-the-art RLMS [24] method and the tree-based method [27].Since the main focus of this paper is on inthe-wild generic face alignment, we also compare the performance of the proposed framework with the very recently proposed supervised descent method (SDM) [26] on three very challenging in-thewild databases.Note that [24], [26] have not released their training code.Therefore, in order to perform a fair comparison with RLMS and SDM, we developed our own implementations and trained our own models using exactly the same data as the other methods proposed in this paper.Furthermore, thanks to the authors of [27] who made both the training and testing code available for their algorithm, we used their code for training the tree-based models.We have to highlight once more that all the algorithms have been trained and tested on the same data and using the same features.Finally, even though we experimented with methodologies such as [18] that use generic filters, these methodologies did not work well in generic alignment scenarios, which is in line to the findings of [18].
The Multi-PIE experiment focuses on accessing the performance with combined identity, pose, expression and illumination variation.Overall, the GFRM-Alternating method and the DFRM method show equally promising results over the state-of-the-arts RLMS [24] and the treebased method [27].LFPW, Helen and AFW experiments further verify the generalization capability of the proposed response map texture model based framework to handle challenging uncontrolled natural variations in that it outperforms the state-of-the-art RLMS [24] and tree-based method [27] convincingly.On these wild databases, the results show that GFRM-Alternating is again the best performing method followed by the DFRM and GFRM-PO method.The performance of DFRM is comparable to SDM [26].
The results on LFPW, Helen and AFW database also validate one of the main motivations behind the proposed face alignment framework i.e. the response maps extracted from an unseen image can be very accurately represented by a small set of parameters and are well suited for the task of generic face alignment under uncontrolled natural settings.

Multi-PIE Database Experiments
For this experiment, images of all 346 subjects, with all six expressions at frontal and non-frontal poses at various illumination conditions are used.The training set consisted of roughly 8300 images which included the subjects 001-170 at poses 051, 050, 140, 041 and 130 with all six expressions at frontal illumination and one other randomly selected illumination condition.The multi-view RLMS-MPIE refers to the method trained using the HOG feature based patch experts and the RLMS alignment method (Section 3.1).The multi-view DFRM-MPIE refers to the method trained using the HOG feature based patch experts and the proposed DFRM alignment method (Section 3.2.2).The multi-view GFRM-PO-MPIE refers to the method trained using the HOG feature based patch experts and the proposed GFRM-PO alignment method (Section 4.2.1).The multi-view GFRM-Alt-MPIE refers to the method trained using the HOG feature based patch experts and the proposed GFRM-Alternating alignment method (Section 4.2.2).For the tree-based method [27], we trained the tree-based model p204-MPIE that shares the patch templates across the neighboring viewpoints and is equivalent to the multi-view approach adopted for other alignment methods, using exactly the same training data for a fair comparison.The Multi-PIE test set consisted of roughly 7,100 images which included the subjects 171-346 at poses 051, 050, 140, 041 and 130, with all six expressions, at frontal illumination and one other randomly selected illumination condition.From the results in Fig. 5, we can clearly see that the proposed DFRM and GFRM-Alternating methods outperform the existing RLMS and the equivalent treebased method (p204-MPIE).The GFRM-PO method also outperforms the RLMS and the equivalent tree-based method for majority of the Multi-PIE test set.
Overall, GFRM-Alternating and DFRM are the two best performing methods with both showing equally impressive landmark localization accuracy under controlled settings.The qualitative analysis of the results suggest that the tree-based methods [27], although suited for the task of face detection and rough pose estimation, are not well suited for the task of non-rigid face alignment and landmark localization.We believe this is due to the use of a tree-based shape model that allows for non-face like structures to occur frequently, especially for the case of facial expressions.See the sample alignment results in Fig. 7.As for the overall improvement, considering the normalized error (i.e.Shape RMSE as the fraction of inter-ocular distance) of 0.05 as the benchmark for very accurate landmark localization, the GFRM-Alternating show a significant improvement of 20 over the RLMS and 30 percent over the tree-based method.Whereas, the next best DFRM method show an improvement of 16 percent over the RLMS and 26 percent over the tree-based method.

Wild Database Experiments
To further test the ability of the proposed response map texture model based framework to handle unseen and uncontrolled variations, we conduct experiments using three databases that presents the challenge of wild natural settings.The labeled face parts in the wild (LFPW) database [4] consist of the URLs to 1100 training and 300 test images that can be downloaded from internet.All of these images were captured in the wild and contain large variations in pose, illumination, expression and occlusion.We were able to download only 813 training images and 224 test images because some of the URLs are no longer valid.These images were manually annotated with 66 landmark point locations to generate the LFPW ground-truth annotations.On the other hand, the recently released Helen database [16] consist of 2,000 training images and 330 test images.All of these images are collected from Flickr and present the challenge of being captured under completely natural realworld settings.From this, we manually annotated 890 training images and the entire test set of 330 images with 66 landmark point locations to generate the Helen ground-truth annotations.In addition, we used the extremely challenging annotated faces in-thewild (AFW) database [27] to test the performance of the proposed methods on a completely unseen in-the-wild testset.The AFW test set consisted of 205 images with a total of 468 faces that were manually annotated with 66 landmark point locations to generate AFW ground-truth annotations.
To generate the wild training set, we augmented the Multi-PIE training set (used in Section 5.2) with the LFPW and Helen training sets.The models trained using this wild training set are referred as DFRM-Wild, GFRM-PO-Wild, GFRM-Alt-Wild, RLMS-Wild and p204-Wild.In addition, we also compare the performance of the proposed response-map texture model based framework to the supervised descent method [26].For this, we trained both the single-view SDM (as originally proposed in [26]) and the multi-view SDM.SDM-Singleview-Wild refers to the single-view SDM trained using the HOG features and the wild training set.SDM-Wild refers to the multi-view SDM trained using the HOG features and the wild training set.
These were then used to perform non-rigid face alignment on the LFPW, Helen and AFW test sets and the results are reported in Fig. 6.From these results, we can clearly see the dominance of the proposed DFRM, GFRM-PO and GFRM-Alternating methods over the RLMS and the equivalent tree-based method p204.The performance of multi-view SDM approach is comparable to that of DFRM.Considering the normalized error of 0.05 as the benchmark for very accurate landmark localization, GFRM-Alternating shows significant overall improvement of 20 percent over the RLMS and 39 percent over the tree-based method.Whereas, the DFRM method shows an overall improvement of 14.5 percent over the RLMS and 33 percent over the tree-based method.
GFRM-Alternating method is consistently the best performing method.Our results show that the proposed generative and discriminative methods outperform other state-of-the-art approaches under the generic face alignment scenario.Moreover, they also demonstrate the ability to handle the challenging variations present in the wild databases (pose, illumination, facial hair, glasses and ethnicity).This result validates the main motivation behind the proposed framework (i.e. the response maps extracted from an unseen image can be very accurately represented by a small set of parameters and are suited for the task of generic face alignment).See the sample alignment results in Fig. 7.

CONCLUSION
In this paper, we proposed a new response map texture model based generic face alignment framework that shows state-of-theart results in the wild.For this, first, we empirically validated the superiority of the response map based texture model over the pixel value based texture model.Secondly, within this framework, we proposed a part-based alignment method (i.e.DFRM) and two holistic model based alignment methods (i.e.GFRM-PO and GFRM-Alternating) that can handle challenging in-the-wild conditions.Overall, the proposed methods are highly efficient and realtime capable.
The current MATLAB implementation of the multi-view GFRM-Alternating, GFRM-PO and DFRM methods take 4, 1, and 1 sec/image, respectively, on an Intel Xeon 3.60 GHz processor.Moreover, the current C/CUDA implementation of DFRM method runs at 30-45 FPS on an Intel Xeon 3.60 GHz processor with NVIDIA GeForce GTS 450 graphic card (192 Cores).In this implementation, the response map for each landmark point is computed in parallel using CUDA, allowing the DFRM fitting to perform in real-time.On the other hand, the GFRM-Alternating method requires the Hessian and its inverse to be computed at each iteration.Therefore, making a real-time GFRM implementation is not straight-forward and is left as future work.See supplementary material, available online, for additional experimental results on benchmarking the accuracy fitting results (Appendix D, available in the online supplemental material), in-the-wild occluded images (Appendix E, available in the online supplemental material) and images under varying resolution (Appendix F, available in the online supplemental material).

Fig. 2 .Fig. 3 .
Fig. 2. Pixel value based versus Response Map based texture model: (a) Sample test image from LFPW with relevant landmarks labelled 1-8.(b) For landmarks 1-8: First row shows the extracted image patches.Second row shows the reconstructed image patches generated by the top five PCA components of each landmark's pixel value based texture model.(c) For landmarks 1-8: First row shows the extracted response maps.Second row shows the reconstructed response maps generated by the top five PCA components of each landmark's response map based texture model.
Score 11 Compute s test from p final (Eqn. 1) Output: Final Shape (s test ) 3 Generate training set for first iteration T ð1Þ .4 for i ¼ 1 !h do 5 Compute the weak learner F ðiÞ using T ðiÞ .6 Propagate T ðiÞ through F ðiÞ to generate T ðiÞ new .7 Eliminate converged samples in T ðiÞ new to generate T ðiþ1Þ .8 if T ðiþ1Þ is empty then 9 All training samples converged.Stop Training.10 else 11 Get new shape parameters sample set (6) from images whose samples are eliminated in Step 7. 12 Get new joint low-dimensional projection set (7) for the samples generated in Step 11. 13 Generate new replacement training set T ðiÞ rep .14 for j ¼ 1 !ði À 1Þ do 15 Propagate T ðiÞ rep through F ðjÞ .16 Eliminate converged samples in T ðiÞ rep .17 Update T ðiþ1Þ È T ðiþ1Þ ; T ðiÞ rep É Output: DFRM Update Model U (10) 3.2.3Alignment Procedure Given the test image I test , the parameter update model U is used to compute the additive parameter update Dp iteratively.The efficacy of alignment is measured as the alignment score that is computed for each iteration by simply adding the responses (i.e. the probability values) at the landmark locations estimated by the current shape estimate of that iteration.The final aligned shape is the shape with the highest alignment score.The alignment procedure is outlined in Algorithm 2. Algorithm 2. DFRM Alignment Procedure Require: I test and s initial 1 Compute p test (1) from s initial 2 Best = 0; 3 for i ¼ 1 !h do 4 Extract response maps for p test and compute the joint low-dimensional projection (c test ) 5 Dp ¼ F i ðc test Þ