Fast and robust appearance-based tracking

We introduce a fast and robust subspace-based approach to appearance-based object tracking. The core of our approach is based on Fast Robust Correlation (FRC), a recently proposed technique for the robust estimation of large translational displacements. We show how the basic principles of FRC can be naturally extended to formulate a robust version of Principal Component Analysis (PCA) which can be efficiently implemented incrementally and therefore is particularly suitable for robust real-time appearance-based object tracking. Our experimental results demonstrate that the proposed approach outperforms other state-of-the-art holistic appearance-based trackers on several popular video sequences.


Fast and Robust Appearance-based Tracking
Stephan Liwicki, Stefanos Zafeiriou, Georgios Tzimiropoulos and Maja Pantic Abstract-We introduce a fast and robust subspace-based approach to appearance-based object tracking. The core of our approach is based on Fast Robust Correlation (FRC), a recently proposed technique for the robust estimation of large translational displacements. We show how the basic principles of FRC can be naturally extended to formulate a robust version of Principal Component Analysis (PCA) which can be efficiently implemented incrementally and therefore is particularly suitable for robust real-time appearance-based object tracking. Our experimental results demonstrate that the proposed approach outperforms other state-of-the-art holistic appearance-based trackers on several popular video sequences.

I. INTRODUCTION
Visual tracking in unconstrained environments is an unsolved problem. For example, in real-world face analysis applications, tracking algorithms have to deal with significant appearance changes induced by sudden head motions, nonrigid facial deformations as well as illumination changes, cast shadows and occlusions. Such phenomena typically make most existing tracking algorithms fail.
The appearance-based approach to tracking has been one of the de facto choices for tracking faces in image sequences. Prominent examples of such an approach include subspacebased techniques [1], mixture models [2], [3], discriminative models for regression/classification [4], gradient descent [5] and very often combinations of the above [1], [6]- [10]. In this paper, we propose a subspace-based tracking algorithm which, to some extend, is able to provide a remedy to typical problems encountered in face analysis applications by featuring many favorable properties. Our algorithm is closely related to the incremental visual tracker (IVT) of Ross et al. [9] and its incremental kernel PCA (IKPCA) extension proposed by Chin and Suter [10], and as such can deal with drastic appearance changes, does not require offline training, continually updates a compact object representation and uses the Condensation algorithm [11] to robustly estimate the object's location.
Similarly to IVT and IKPCA, our method is essentially an eigentracker [1] where the eigenspace is adaptively learned and updated online. The key element which makes our approach equally fast but significantly more robust, is how the eigenspace is generated. Ross et al. use standard ℓ 2 norm PCA. Unfortunately, the ℓ 2 norm enjoys optimality properties only when image noise is independent and identically distributed (i.i.d.) Gaussian; however, for data corrupted by out-liers, the estimated subspace can be arbitrarily skewed [12]. A somewhat more robust approach is the method proposed by Chin and Suter which incrementally learns a non-linear subspace via KPCA [10]. The tracking process requires the computation of the pre-images which imposes a trade-off between efficiency and robustness while experimental results show that the gain in robustness appears to be not very significant.
On the contrary, the proposed tracker is based on a robust reformulation of PCA which requires straightforward optimizations and is as computationally efficient as ℓ 2 norm PCA. More specifically, our approach is based on a dissimilarity measure originally introduced by Fitch et al. in the context of robust correlation-based estimation of large translational displacements [13]. The basic idea is to suppress gross errors by encoding pixel intensities as angles and measure dissimilarity using the cosine of angle differences. We show how the framework for robust correlation can be naturally extended to form a robust version of PCA which replaces the ℓ 2 norm with the dissimilarity measure of Fitch et al.. Finally, we use our direct robust PCA within the framework of IVT for efficient and robust appearance-based tracking.

A. Principal Component Analysis with ℓ 2 Norm
Let x i be the d-dimensional vector obtained by writing image I i in lexicographic ordering. We assume that we are given a population of n samples X = [x 1 · · · x n ] ∈ R d×n . Let us also denote by x = 1 n n i=1 x i and X the sample mean and the centralized sample matrix of X. ℓ 2 norm PCA finds a set of p < d (usually, p ≪ d) orthonormal basis functions B = [b 1 · · · b p ] ∈ R d×p by minimizing the error function where ||.|| F denotes the Frobenius norm. The above optimization problem is equivalent to: where tr [.] is the trace of a matrix. The solution is given by the eigenvectors corresponding to the p largest eigenvalues obtained from the eigendecomposition of the covariance matrix S = X X T (or the Singular Value Decomposition (SVD) of X). Finally, the reconstruction of X from the subspace spanned by the columns of B is given byX = BC + M, where C = B T X is the matrix with the set of projection coefficients and M is a matrix with n columns each of which is the mean vector x.

B. Cosine-based Error Function
The error function in (1) is based on the ℓ 2 norm and therefore is extremely sensitive to gross errors caused by outliers [12]. Motivated by the recent work of Fitch et al. on robust correlation-based translation estimation [13], we replace the ℓ 2 norm with the following dissimilarity measure where the pixel values of the corresponding images I i , I j are represented in the range [0, 1] and α ∈ R + . As noted by Fitch et al., for pixel intensities in the range [0, 1], (3) is equivalent to Andrews' M-Estimate [13]. In particular, Andrews' influence function, i.e. the derivative of a kernel, is given by The Fast Robust Correlation (FRC) scheme proposed by Fitch et al. [13] utilizes (3) and, unlike ℓ 2 -based correlation, is able to estimate large translational displacements in real images while achieving the same computational complexity. In the following, we show how to exploit the cosine kernel to formulate a direct robust version of PCA.

C. Fast, Direct and Robust PCA
To show how (3) can be used as a basis for direct and robust PCA, for notational convenience, let us first define The last equality makes the basic computational module of the proposed scheme apparent. That is, we define the mapping from [0, 1] to the (2d)-dimensional sphere with radius √ d and apply linear PCA to the transformed data. Notice that when α < 2, this mapping is one-to-one and, therefore, Algorithm 1 ESTIMATING THE PRINCIPAL SUBSPACE Input: A set of n images I i , i = 1, . . . , n, of d pixels, the number p of principal components and parameter α. Output: The principal subspace B, eigenvalues Σ and mean vector z of the transformed data.
Step 1. Represent I i in [0, 1] and obtain x i by writing I i in lexicographic ordering.
Step 2. Compute z i using (6), form the matrix of the transformed data Z = [z 1 · · · z n ] ∈ R 2d×n and compute z and the centralized sample matrix Z.
Step 3. Compute the matrix W = Z T Z ∈ R n×n and find the eigendecomposition of W = UΛU T .
Step 4. Find the p-reduced set, U p ∈ R 2d×p and Λ p ∈ R p×p . Step Step 6. Reconstruct usingZ = BB T Z + M, where M contains the mean vector z as columns.
Step 7. Go back to the pixel domain using trigonometry.

Algorithm 2 EMBEDDING OF NEW SAMPLES
Input: An image J of d pixels and the principal subspace B of Algorithm 1.
Step 1. Represent J in [0, 1] and obtain y by writing J in lexicographic ordering.
Step 2. Find z using (6) and obtain embedding as B T z.
reconstruction of the original input space is feasible by applying simple trigonometry.
For high-dimensional data such as images, the proposed framework enables a fast implementation by making use of the following theorem [14]. Theorem I: Define matrices A and B such that A = ΦΦ T and B = Φ T Φ. Let U A and U B be the eigenvectors corresponding to the non-zero eigenvalues Λ A and Λ B of A and B, respectively. Then, Algorithm 1 summarizes the steps of our direct robust PCA. Our framework also enables the direct embedding of new samples. Algorithm 2 summarizes this procedure.

D. A Kernel PCA Perspective
The proposed PCA with the cosine-based dissimilarity measure can be interpreted as a kernel PCA (KPCA). Let k : R d ×R d → R be a positive definite function that satisfies the Mercer's conditions. Then, k defines an arbitrary dimensional Hilbert space H (the so-called feature space in the rest of the paper) through an implicit mapping φ : [15] is defined exactly as PCA in feature space and aims at finding a set of projection bases by minimizing the least-squares reconstruction error in the feature space.
Let us define the kernel: Theorem II: The kernel defined in (7) is positive semidefinite.

Algorithm 3 INCREMENTAL PRINCIPAL SUBSPACE ESTI-MATION
Input: A mean vector z n , the principal subspace B n ∈ R 2d×p , the root of the corresponding eigenvalues Σ n ∈ R p×p , a set of new images {I n+1 , . . . , I n+m }, the number p of principal components and parameter α. Output: The new subspace B n+m , eigenvalues Σ m+n and new mean z n+m .
Step 1. From set {I n+1 , . . . , I n+m } compute the matrix of the transformed data Z m = [z n+1 · · · z n+m ] and the mean vector z m .
Step 2. Compute the new mean vector z n+m = n n+m z n + m n+m z m and form matrix Step 4. Compute R svd =BΣṼ T and obtain the p-reduced setB p andΣ p .
Step 5. Compute B n+m = [B nF ]B p and set Σ n+m =Σ p .
Proof: Using the analysis in (5), we can write the kernel k(x i , x j ) as a dot product: which proves Theorem II. Using (7), we can write the proposed dissimilarity measure (3) as Moreover, from (8) we can easily verify that φ(x i ) has a closed form, i.e. φ( . This is in contrast to other popular kernels in machine learning, such as Gaussian RBFs [10], [15], for which φ is defined only implicitly. Such kernels allow only for inexact fast incremental versions of KPCA [10]. On the other hand, since in our case, the mapping is explicit, our incremental robust PCA is both fast and exact. Algorithm 3 summarizes the main steps. III. FAST AND ROBUST TRACKING Similarly to Ross et al. [9], we model the tracking process using a Markov model with hidden states as affine transform A t . That is, the location of the object at time t is defined by the affine transform parameters A t . Given a set of observations Z t = {z 1 , . . . , z t }, A t can be computed by maximizing p(A t |Z t ) Algorithm 4 TRACKING ALGORITHM FOR TIME t Input: Mean vector z t−1 , subspace B t−1 , location A t−1 of time t − 1 and current image frame I t .
Step 1. Draw a number of particles A p (in our case 600) from p(A t |A t−1 ).
Step 2. Take all image patches from I t which corresponds to particles A p and order them lexicographically to form vectors y p and compute z p using (6).
Step 3. Choose {A t , z t } = arg max A p ,z p p(z p |A p ).
Step 4. Using z t update mean and subspace by applying Algorithm 3.
To obtain an approximation for the above, we used a variant of the well-known Condensation algorithm [9], [11] using We used a typical Brownian motion model for modeling the dynamics between A t and A t−1 . That is, the elements of A t are modeled independently by a Gaussian distribution around the previous state A t−1 : where Ξ is a diagonal covariance matrix whose elements are the corresponding variances of the affine parameters. In a particle filtering fashion, we sample p(A t |A t−1 ) by drawing a number of particles from (11). It is well-known that there is a tradeoff between the number of particles, and how well the sampling approximates the distribution (11). In our experiments, we used 600 particles as in Ross et al. [9].

B. Modeling p(z t |A t )
Similarly to probabilistic PCA [16], we model the probability p(z t |A t ) as where: • p w (z t |A t ) is the likelihood of the projected sample onto the principal subspace spanned by the columns of B, modelled by the exponential of the Mahalanobis distance from the mean where z is the mean vector and Σ is the eigenvalues that correspond to the principal subspace B. • p d (z t |A t ) is the probability of a sample generated from the principal subspace spanned by the columns of B.
If we assume that the observation process is governed by an additive Gaussian model with a variance term ǫI then Having defined models for p(A t |A t−1 ) and p(z t |A t ) the sequential inference model can be summarized in Algorithm 4.

IV. RESULTS
The proposed tracker (which we coin FDR-PCA for the rest of the paper) is tested on several publicly available challenging video sequences which contain intrinsic and extrinsic changes to the tracked faces. The state-of-theart IVT of Ross et al. [9] and its extension for IKPCA by Chin and Suter [10] act as comparison as they both form an appearance-based holistic tracker which classifies the foreground without additional background models. The initial position of the objects, the number of particles and the size of the eigenspaces are equivalent in all methods for each video sequence. Additionally, the results of another holistic tracker proposed by Zhou et al. [3] are included in the experiments.
For the proposed algorithm the parameter α, used by the kernel function (3) of the proposed FDR-PCA, should be a set a-priori. Different values were tested on a validation set of video sequences (different to the set of video sequences used for the experiments presented in this section) and for this validation set α = 0.7 performed best, and therefore the parameter was fixed to this value. The variance of the Gaussian RBF kernel, used with the IKPCA algorithm, was selected in a similar manner.

A. Quantitative Evaluation
The Dudek video sequence 1 forms the data for the quantitative evaluation ( fig. 3). In this sequence, each frame contains seven annotated positions of points which describe the true location and formation of the face. The points' initial position in the first frame are given and used to describe the initial transformation of the unit square for the holistic trackers. The trackers then estimate the transformation for subsequent frames, with which the new position of the points are calculated. The accuracy of the tracking in subsequence frames is then defined as the root mean square (RMS) error between the ground truth and the recognized points. Fig. 4 plots the RMS error for the whole Dudek video sequence for both the proposed and IVT methods.
The method of Zhou et al. [3] loses track after the occlusion between frame 100 and frame 120. IKPCA unsuccessfully estimates the motion in frame 288, after the filmed person rises from the chair in a quick movement. Only two methods, IVT and FDR-PCA, manage to follow the object for the whole length of the video. The mean RMS error of both methods are compared in table I. The proposed method performs most accurately.  The RMS errors during the occlusion between frame 100 and frame 120 are compared ( fig. 1). IKPCA performs competitively until the occlusion, however, RMS errors are higher thereafter for this method. IVT generally performs less accurately than IKPCA and FDR-PCA before the occlusion. The occlusion itself has little impact on this method, thus the algorithm continues on similar accuracy afterwards. The tracker proposed in this paper recovers most quickly from the occlusion: the effects of the occlusion are counteracted by the robustness of the scheme, and the overall displacement of the unit square is kept to a minimum. The accuracy of FDR-PCA during motion blur around frame 288 and frame 486 is slightly lower than IVT, but pose variation in frame 470 is better supported (fig. 2).
Finally, fig. 6 plots the the RMS error versus α. As can be seen, for a wide range of α values the algorithm performs rather well.

B. Qualitative Evaluation
Three challenging video sequences 2 with challenging illuminations, occlussions and pose variations were used for the qualitative evaluation. Fig. 5 shows the results of the different trackers when the target object undergoes several pose changes and illumination alterations. IKPCA is the first method which loses the object in this sequence due to variations in the lighting condition. While the scheme of Zhou et al. copes with the change in frame 77, it fails after the extreme illumination changes in frame 172, just after the object moves from a bright into a dark area. IVT and the proposed tracker prove robust towards these type of changes, as both methods successfully track the objects until frame 329. The frames around frame 329 contain difficult prolonged pose changes, and therefore cause IVT to lose track in frame 329. The proposed FDR-PCA tracker successfully follows the face through all the frames of the video sequence until it eventually misclassifies the object's position in frame 330. Thus, for this video sequence, the proposed tracker outperforms other state-of-the-art trackers as it is more robust to illumination changes and pose variation. Fig. 7.a shows the proposed tracker under variations in both, illumination and pose, and occlusion. In this sequence,  the tracker successfully tracks the face throughout the complete sequence of frames. Even after the side-view of the face in frame 162, FDR-PCA recovers considerably better than IVT, and therefore the target is recognized correctly between frame 179 and frame 198. The occlusion in frame 331 and frame 387 is handled by both approaches. The effect of occlusions on the proposed tracking scheme is presented by the video shown in fig. 7.b. The tracker quickly and successfully recovers from prolonged occlusions as in frame 497 and 722. In comparison to IVT, its performance is more robust for this sequence.

V. CONCLUSIONS AND FUTURE WORK
We introduced a fast, direct and robust approach to incremental PCA for appearance-based visual tracking. Our results show that the proposed tracker is robust to illumination changes, some pose variations, intrinsic alterations and most prolonged occlusions. Our tracker outperforms existing holistic visual trackers in quantitative and qualitative evaluations. In contrast to IKPCA [10], the proposed scheme avoids the optimization required for finding the mean of the feature space with the implicit kernel function via preimages, yet utilizes robust kernel PCA. Our tracker directly utilizes the incremental learning framework of IVT [9], and therefore not only is more robust but also equally fast. In future work, tracking may be improved by employing multiple adaptive expert appearance models for different views of the object. Within this framework, extreme changes in the object will initiate the generation of a new appearance model for this pose. Additionally, a more sophisticated particle generator for the particle filter which describes more than a simple condensation may be added. This may improve the efficiency as well as the accuracy of the proposed algorithm as fewer particles' likelihoods need to be calculated for better performance.