Face Recognition Using Parzenfaces
Zhirong Yang and Jorma Laaksonen
Laboratory of Computer and Information Science⋆ Helsinki University of Technology
P.O. Box 5400, FI-02015 TKK, Espoo, Finland
{zhirong.yang, jorma.laaksonen}@tkk.fi
Abstract. A novel discriminant analysis method is presented for the face recognition problem. It has been recently shown that the predic-tive objecpredic-tives based on Parzen estimation are advantageous for learning discriminative projections if the class distributions are complicated in the projected space. However, the existing algorithms based on Parzen estimators require expensive computation to obtain the gradient for op-timization. We propose here an accelerating technique by reformulat-ing the gradient and implement its computation by matrix products. Furthermore, we point out that regularization is necessary for high-dimensional face recognition problems. The discriminative objective is therefore extended by a smoothness constraint of facial images. Our Parzen Discriminant Analysis method can be trained much faster and achieve higher recognition accuracies than the compared algorithms in experiments on two popularly used face databases.
1
Introduction
Face Recognition (FR) is becoming an even more active research topic in the forthcoming years. The challenge of FR is at first induced by the high dimension-ality of facial images. The problem is more challenging in presence of structured variations such as poses and expressions, which are difficult to be modeled and cause the data to distribute in a complicated manifolds. Therefore, the research in this field is not only useful for classifying faces, but also conducive to other high-dimensional pattern recognition problems.
A substantial amount of efforts has been devoted to the FR problem, among which Fisher’sLinear Discriminant Analysisis widely used. Modeling each class by a single Gaussian distribution which shares a common covariance, LDA max-imizes the Fisher criterion of between-class scatter over within-class scatter and can be solved bySingular Value Decomposition(SVD). The facial feature extrac-tion by LDA is calledFisherfaces [1]. The Fisherface method is attractive for its simplicity, but the assumption of Gaussians with common variance heavily re-stricts its performance. Moreover, Fisherface requires preprocessing byPrincipal
⋆Supported by the Academy of Finland in the projectsNeural methods in information
Component Analysis(PCA) and the discriminative information may however be lost during the unsupervised dimensionality reduction. Later many variants of Fisherface such as [2] have been proposed. However, the Fisherface method and its variants make use of only the first- and second-order statistics of the class distributions while discarding the higher-order statistics.
Recently Goldberger et al. [3] proposed Neighborhood Component Analysis
(NCA) which learns a linear transformation matrix by maximizing the summed likelihood of the labeled data. The probability density at each data point is es-timated by using the neighbors in the transformed space, which turns out to be the Parzen estimation of the posterior of the class label. Peltonen and Kaski later proposed a very similar method calledInformative Discriminant Analysis
(IDA) [4], in which they instead employ log-likelihood, i.e. the information of predictive probability density. The likelihood formulation allows NCA and IDA to model very complicated class distributions. It was reported that these two methods outperform traditional discriminant analysis approaches in a number of low-dimensional supervised learning problems. However, the optimization of NCA or IDA requires the gradient of the Parzen-based objective, the computa-tion of which is too expensive for most applicacomputa-tions. To obtain an orthonormal transformation matrix, IDA employs a reparameterization based on Givens ro-tation, which even aggravates the computation and prevents its application to high-dimensional data. Peltonen et al. later proposed a modified version [5] to speed up the computation by using a small number Gaussian mixtures instead of the Parzen method. This nevertheless loses the advantage of nonparametric estimation. One has to insert additional EM iterations before computing the gradient, and how to select an appropriate number of Gaussians is unclear.
In this paper we point out that the computational burden of calculating the gradient in NCA and IDA can be significantly reduced by using matrix multipli-cation. Next, the Givens reparameterization in IDA can be replaced by geodesic updates in the Stiefel manifold, which further simplifies the optimization. Fur-thermore, we propose to regularize the projection matrix by employing a smooth-ness constraint. This is done by introducing an additional penalization term of local pixel variance. We name the new method asParzenfacewhen applying our discriminant analysis to the face recognition problem. The experiments on two public facial image databases, FERET [6] and ORL [7], demonstrate that our learning algorithm can achieve higher accuracy and run much faster than NCA and IDA.
2
Parzen Discriminant Analysis
2.1 Unregularized objective
the discriminative information, i.e. the sum of predictive log-likelihood
whereJi is the shorthand notation for loghP
j:ci=cjeij
withσa positive parameter which controls the Gaussian window width.
2.2 Computing the gradient
Our optimization algorithm is based on the gradient of J(W) with respect to W
Notice that the chain rule in the inner summation applies to the subscriptj, i.e. treatingyj as an intermediate variable andyi as a constant. Denote
Gij ≡
∂Ji ∂kyj−yik2
(5)
for notational simplicity. The gradient then becomes
∇=
reveals that the computation can be significantly reduced: That is, the gradient can be computed by matrix operations as
∇= 2X(D−G)XTW. (12)
It is known that there exist fast algorithms that implement matrix multiplication in O(τq) time where τ = min(m, n) and q a positive scalar less than 3 and towards 2 [8]. Many researchers believe that an optimal algorithm will run in essentiallyO(τ2) time [9]. In practice, if the matrix multiplication is accelerated via the Fast Fourier Transformation (FFT), the computation of the gradient (12) can be accomplished inO(τ2logτ) time [8], which is already acceptable for most applications.
2.3 Geodesic flows on the Stiefel manifold
Orthonormality of the transformation matrix is preferred in feature extraction because it enforces the matrix to encode the intrinsic subspace in the most eco-nomic way. The orthonormality constraint also prevents the learning algorithm from falling into some trivial local minima. In addition, an orthonormal matrix as the learning result is convenient for us to compare the new method with many existing projective methods used in face recognition.
The set ofm×rreal orthonormal matrices forms a Stiefel manifold St(m, r). Given the gradient∇atW, it has been shown [10] that the natural gradient in such a manifold is given by
gradSt(m,r)W J =∇ −W∇
TW, (13)
and an approximated geodesic learning flow with the starting pointWby Wnew= expm t ∇WT −W∇T
W, (14)
2.4 Regularization
An orthonormal matrix has (m−r)r+r(r−1)/2 free parameters [11]. If this number is comparable to or larger than the number of samples n, the discrim-inant analysis problem probably becomes ill-posed. Unfortunately this is the case in face recognition, especially when the facial images are sampled in high resolutions. The learning objective must therefore be regularized.
However, simpleL2-norm used in e.g. Support Vector Machines is not suit-able for penalization here because summing the squared entries of an m×r orthonormal matrix results in a constant r. One thus has to use some other regularization techniques.
Notice that each column ofWacts as a linear filter and can be displayed like a filter image. It is a crucial observation we have made that many overfitting projection matrices have highly rough filter images. That is, local contrastive pixel groups dominate the filters, but they are too small to represent any rel-evant patterns for face recognition. This motivates us to adopt a penalization term Tr WT
ΩW[12] to emphasize the smoothness prior of images, where the constant matrixΩ is constructed by
Ωst=N(d(s, t);ρ). (15)
Hered(s, t) is the 2-D Euclidean distance of the locationssandt, andN the zero-mean normal distribution. The variance parameterρcontrols the neighborhood size and its value depends on the resolution of the facial images used. We find thatρ∈(0.3,0.8) works fine in our experiments with 32×32- and 23×28-sized images. It is not difficult to see that Tr(WT
ΩW) is an approximated version of the Laplacian used in [12].
By attaching regularization term, we define the objective ofParzen Discrim-inant Analysis (PDA) to be the maximum of
JPDA(W) =1 2
n
X
i=1 log
P
j:ci=cjeij
Pn
j=1eij −1
2λTr W T
ΩW
, (16)
whereλis a positive parameter that controls the balance between discrimination and smoothness. The optimization of PDA is based on the gradient
˜
∇=X(D−G)XTW−λΩW. (17)
In the following experiments, we use the approximated geodesic update
Wnew= expm
t∇˜WT −W∇˜TW. (18)
3
Connections to Previous Work
Fisherface is a combined method which applies Fisher’s Linear Discriminant Analysis (LDA) on the results ofPrincipal Component Analysis (PCA). Fish-erface and its variants are attractive because they have closed-form solutions which can be obtained by (generalized) singular value decomposition. However, these methods model each subject class by a single Gaussian class, which heavily restricts their generalization in presence of different facial expressions, face poses and illumination conditions. In fact, these structural variabilities cause a subject class to stretch in a curved but non-Gaussian manifold.
Recently some unsupervised methods such as Laplacianfaces [13] have been proposed to unfold the structures of the face manifolds. Although it was reported that they have better recognition accuracy in some cases, the discriminative per-formance of these methods is naturally limited because they omit the supervised information.
There exist two gradient-based approaches that are closely related to our PDA method. Neighborhood Component Analysis (NCA) [3] learns a transfor-mation matrix (not necessarily orthonormal) to maximize theleave-one-out(loo) performance of nearest neighbor classification. NCA measures the performance based on “soft” neighbor assignments in the transformed space, which is sim-ilar to the PDA objective except the logarithm function is dropped. However, without the logarithm function, NCA lacks the connection to the information theory. By contrast, PDA conforms the general assumption that the samples are
independently and identically distributed (i.i.d.). Here the “independence” refers to the predictive version
p({ci}n
i=1|{yi}ni=1) = n
Y
i=1
p(ci|yi), (19)
of which the maximization is equivalent to that of the PDA unregularized ob-jective (1). In addition, the loss of orthogonality may cause NCA to fall in some trivial local optima, for example, all columns of the transformation matrix con-verging to a same vector.
Fig. 1.The sample images from (top) FERET and from (bottom) ORL databases.
4
Experiments
4.1 Data
We have compared PDA and five other methods on two databases of facial images. The first data set contains facial images collected under the FERET program [6]. 2409 frontal facial images (poses “fa” and “fb”) of 867 subjects were stored in the database after face segmentation. In this work we obtained the coordinates of the eyes from the ground truth data of the collection, with which we calibrated the head rotation so that all faces are upright. Afterwards, all face boxes were normalized to the size of 32×32, with fixed locations for the left eye (26,9) and the right eye (7,9). The second data set comes from the ORL database [7] which includes 400 facial images of 40 subjects. There are 10 images taken in various poses and expressions from each subject. We resize the ORL images to the size of 23×28 without further normalization. Example images from FERET and ORL are displayed in Figure 1. We divide the images of each subject into two parts of equal size, the first half for training and the rest for testing. The whole training set is the union of the training part of all subjects, and so is the whole testing set.
4.2 Training time
A major advantage of PDA (18) over its close cousins NCA [3] and IDA [4] is that PDA requires much less training time. We demonstrate this by running the compared algorithms on a Linux machine with 12GB RAM and two 64-bit 2.2GHz AMD Opteron processors.
We set the number of iterations for PDA and NCA to 10, and 10×n for IDA since IDA employs an online learning based on stochastic gradients. In this way all algorithms go through a same number of training samples. We repeated such training ten times and recorded the total time used in Table 1. It is easy to see that PDA significantly outperforms NCA and IDA in efficiency. PDA requires about 1/22 training time of NCA and 1/25 of IDA. The advantage is more obvious for the FERET database of larger scale, where PDA is almost 84 times and 100 times faster than NCA and IDA, respectively.
4.3 Visualizing the filter images
Table 1.Training time of PDA and IDA on the facial image databases (in seconds).
database PDA IDA NCA
FERET (n= 1208, m= 1024, r= 10) 3,733 362,412 313,811 ORL (n= 200, m= 644, r= 10) 1,502 40,136 32,914
FR problem, it is expected to find some semantic connections between the filter images and our common prior knowledge about facial images.
Figure 2 shows the first ten filter images of five compared methods, where the top two are unsupervised and the bottom four supervised. We only plot the ORL results due to space limit. The filter images of IDA contain almost random pixels and there seems no pattern related to faces. This is probably because IDA starts from the LDA projection matrix, but the latter suffers from data scarce in face recognition. Fisherface and Laplacianface are better than IDA since one can slightly perceive some contractive parts around or within the head-like boundary. These parts however are too small and scattered all over every filter image, which might cause overfitting of the projection matrix, e.g. being sensitive to small shifts and variation. The contrastive parts of the NCA basis mainly lie around the head, but these filters differ only in some tiny regions. This is probably caused by the removal of the orthogonal constraint in NCA. By contrast, Parzenface yields filter images that contain clearer facial semantics and are hence easier for interpretation. For example, the fourth filter image is likely related to the beard feature and the fifth may control the head shape. The filter images of Eigenface also comprise some facial parts like eyes and chins, which albeit are more blurred and may lead to underfitting for face recognition.
4.4 Face recognition accuracies
Classification of the testing faces is performed in the projected space by using the nearest neighbor classifier. The face recognition accuracies with r ranging from 10 to 70 are shown in Figure 3. Since the maximum output dimensionality of Fisherface is the number of classes minus one, i.e. 39 for the ORL database, we set a tick at 39 in thex-axis of the right plot for better comparison.
We found that NCA heavily suffers from the overfitting problem. Although it can achieve excellent classification accuracies for the training set, the NCA transformation matrix generalizes poorly to the testing data. Laplacianface per-forms the second worst. This is probably because it requires a large amount of data to build a reliable graph of locality, which is infeasible in our experiments. This drawback is more severe for the ORL database which contains facial images of different poses. The performance order of Eigenface, Fisherface and IDA de-pends on the database used. Fisherface is the best among these three for FERET while Eigenface is the best one for ORL.
Fig. 2. The first ten filter images of the ORL database using (from top to bottom) Eigenface, Laplacianface, Fisherface, IDA and Parzenface.
were obtained by cross-validation using the training set. The face recognition accuracies were then calculated by applying the trained Parzenface model to the testing set. From Figure 3 we can see that the face recognition accuracies using Parzenfaces are superior to all the other compared methods.
5
Conclusions
We have presented a new discriminant analysis method and applied it to the face recognition problem. The proposed Parzenface method overcomes two ma-jor drawbacks of existing gradient-based discriminant analysis methods by using information theory. Firstly the computation of the gradient can be greatly accel-erated by using matrix multiplication instead of going through all the pairwise differences. Secondly we have proposed to employ the smoothness constraint of images to regularize the face recognition problem. The empirical study on two popular facial image databases shows that Parzenface requires much less training time than the IDA and NCA methods while achieving higher face recognition accuracies than the other compared methods.
10 20 30 40 50 60 70
Fig. 3.Face recognition accuracies of FERET (left) and ORL (right).
References
1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence19(7) (1997) 711–720
2. Howland, P., Wang, J., Park, H.: Solving the small sample size problem in face recognition using generalized discriminant analysis. Pattern Recognition 39(2) (2006) 277–287
3. Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R.: Neighbourhood compo-nents analysis. Advances in Neural Information Processing17(2005) 513–520 4. Peltonen, J., Kaski, S.: Discriminative components of data. IEEE Transactions on
Neural Networks16(1) (2005) 68–83
5. Peltonen, J., Goldberger, J., Kaski, S.: Fast discriminative component analysis for comparing examples. In: NIPS. (2006)
6. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation method-ology for face recognition algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence22(2000) 1090–1104
7. Harter, F.S.A.: Parameterisation of a stochastic model for human face identifica-tion. In: Proceedings of the Second IEEE Workshop on Applications of Computer Vision. (1994) 138–142
8. Horn, R., Johnson, C.: Topics in Matrix Analysis. Cambridge (1994)
9. Robinson, S.: Toward an optimal algorithm for matrix multiplication. SIAM News 38(9) (2005)
10. Nishimori, Y., Akaho, S.: Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold. Neurocomputing67(2005) 106–135
11. Edelman, A.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl.20(2) (1998) 303–353
12. Hastie, T., Buja, A., Tibshirani, R.: Penalized discriminant analysis. The Annals of Statistics23(1) (1995) 73–102
13. He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.J.: Face recognition using Laplacian-faces. IEEE Transactions on Pattern Analysis And Machine Intelligence27 (328-340) (2005)