Partial Face Detection using Regions with Convolutional Neural Networks

Although many methods have been developed for holistic face recognition, partial face detection has not yet been successful. Partial faces often appear in real environments such as in surveillance videos and are quite difficult to detect. We propose an approach to detect partial faces together with holistic faces (frontal and profile images) present in natural scenarios.

The drawback of CNN being computationally expensive is handled with Selective Search by Segmentation, which greatly reduces the search space. We used FDDB Benchmarking to evaluate our method and the results were close to the best results of all recent methods in the face detection domain. Face detection has advanced significantly in the past few decades, but partial face detection has not received much attention.

Most of the face detectors work well with images that only have front view or profile view of faces and struggle to detect faces that are only partially visible. For feature extraction and representation, although HOG and SIFT have been conventionally widely used in detection, it appears to be a bottleneck in further performance improvement as we move towards deep learning architectures. Features learned by Neural Networks are much more promising as seen in the recent PASCAL VOC Challenge, 2010.

Face Detection

Partial Faces

Deep learning and CNN

Deep Learning

Deep learning is a branch of machine learning that aims to learn features and representations of data. Data can be represented in many ways, for example an image can be represented as a vector of pixel values or a set of edges and intensity gradients. However, manually creating such a representation is a difficult challenge due to the large volume and diverse nature of the data.

Deep learning aims to replace hand-crafted feature representation with efficient feature learning and hierarchical feature extraction algorithms. It aims to create better representations of underlying features in data and create models that can learn such representations from large unlabeled data. Some deep learning approaches are modeled after the human nervous system and are loosely based on explaining how the brain works through the stimuli and responses of neurons and patterns of information processing and communication.

There are many deep learning architectures such as deep neural networks, convolutional neural networks (CNNs), deep belief networks and recurrent neural networks.

Convolutional Neural Networks

Local receptive fields, shared weights, and spatial sub-sampling allow CNNs to achieve some degree of invariance to shift and distortion. Images are size normalized and centered before being sent as input to the input layer of a CNN. Each unit in a layer receives input from a collection of units that form a small neighborhood in the previous layer.

With such receptive fields operating locally, the CNN can extract elementary visual features such as edges and their orientation, endpoints, or corners. These features are then combined by subsequent layers to obtain features at a more abstract level. A group of neurons in a layer have identical weight vectors and have their receptive fields located at different locations in the image.

Implementation-wise, a single neuron with the weights assumed to be shared can be used to scan the input image and store the states in corresponding locations in a feature map. To do this in parallel, a feature map can be implemented as a layer of neurons sharing a weight vector, and its units perform the same operation on different regions in the image. A convolutional layer is a collection of multiple features that extracts multiple features using feature maps at each location, each representing a certain type of learned feature.

Once a feature is extracted, the absolute location of the feature is not important as long as the relative position with other features is preserved. Therefore, a sub-sampling layer reduces the resolution of the output feature map for the next convolution layer. CNNs usually have alternating convolution and sub-sampling layers, where after each layer the number of feature maps is increased and resolution is reduced.

Each unit in the second convolutional layer in Figure 1.2 can have input connections from multiple feature maps from the previous layer. Sharing these kernel weights leads to reduction in the number of variables making CNNs cheaper in terms of time and storage.

Figure 1.2: Convolutional Neural Network for image processing

Selective Search and Segmentation

Selective Search

Segmentation

Segmentation is particularly useful for partial face detection because it can segment out parts of faces, as seen in Figure 1.3. Regions of similar intensity, texture, and gradients in multiple color spaces are combined in a hierarchical fashion to obtain segments at different scales. Therefore, we get parts of the face such as an eye or a nose, side of a face or edge of the jaw.

Such segments are more informative than random windows at different scales and are therefore more promising for our problem. Finding faces in images with a single color background or detection using color or motion were some of the earlier methods that attempted to detect faces in very limited settings. In 2001, Froba et al.[4] used edge orientations to create a model depicting a face, while Jesorsky et al.[5] used the Hausdorff distance as a metric and created a model using facial feature points.

In 2004, Viola and Jones [6] presented a seminal paper in which they used a cascade of weak classifiers and simple Haar features for face detection, along with Integral images, which enabled real-time detection. Neural networks were introduced in the field of face recognition as early as the late 1990s. Around this time, Rowley, Baluja, and Kanade introduced simple neural networks that looked at small regions of an image and decided whether or not the region contained a face. They also proposed a rotation invariant method that assumed rotation along with face detection in the image.

Norris used principal component analysis and neural networks (PCA and ANN) in face detection and recognition in office environments. The proposed methodology 8 such as normalization, rotational changes and enhancement of light conditions, small windows from the input image were given to the MLP, which detected the rotation of the input window and also decided whether it contained a face or not. A year later, Zoran and Samcovic [11 ], proposed a method for face detection in surveillance videos using Neural Networks. Our method mainly tries to improve face detection by focusing on partial faces along with full frontal faces and profile view faces.

Partial faces may appear in images due to reasons such as occlusion with other objects, non-frontal posture, facial accessories, extreme lighting, shadows, etc. (Figure 3.1). The detection of such partial faces is important in situations where images are captured without user intervention, such as in surveillance scenarios or via hand-held devices, as they may not capture the full frontal view of the face.

Figure 1.3: Left: Input Image, Right: Some of candidate windows generated by Segmentation

Detecting faces and partial faces

Generating Region Proposals
Extracting Features
Scoring Regions
RCNN
Caffe
FDDB
Finetuning CNN

Each of the potential location windows (regions) is then sent to a CNN, which generates features for them. A pre-trained CNN model is refined using a dataset suitable for the purpose, in our case, face dataset. Each candidate window is warped to the size of 227x227 pixels and forward propagated through 5 convolutional layers and two fully connected layers to generate a 4096 dimensional feature vector representing the window.

A binary SVM which is trained using the feature vectors generated by the same CNN using training data. The 4096 dimensional feature vector obtained from the CNN is used as input to find the score for the corresponding region using the pre-trained SVM. In NMS, we greedily select high-scoring detections and ignore detections that are significantly covered by a previously selected detection. We set the threshold to be 30%.

Feature extraction, it extracts a 4096-dimensional feature vector from each region proposal using the Caffe (Section 3.2) implementation of CNN described by Krizhevsky et al. These features are then scored using pre-trained class-specific SVMs to classify the regions. FDDB [14] is a dataset of face regions designed to study the problem of unconstrained face detection.

This dataset contains annotations for 5171 faces in a set of 2845 images taken from the Faces in the Wild dataset. FDDB also has its own evaluation code that can be used to benchmark the performance of different face detection methods. We used a CNN pre-trained on a large auxiliary dataset (ILSVRC 2012) with image-level annotations (ie, no bounding box labels) as done by [2].

To adapt the CNN to the new task of detection and a completely new domain of faces, we refined the pre-trained CNN using the FDDB dataset. After the 10-fold cross-validation of FDB, we refined the CNN, each time omitting the test fold.

Figure 3.3: Pipeline to detect faces in an image.

Experiments and Results

Experiment 1: Regions proposed by Segmentation

Experiment 2: Partial Face Detections

Then we manually checked the number of partial faces detected by the mentioned Viola Jones implementation and compared it with our method. We found that the number of partial faces detected by the Viola Jones method was 22, whereas the number of partial faces detected by our method was 97.

Figure 4.2: Left: Input Image(with Partial Faces), Right: Some of candidate windows generated by Segmentation.

Experiment 3: Face detection evaluation with FDDB

Discrete Score(DS) metrics use a function to map scores of IOU ratios using a δ function against a threshold, whereas Continuous Scores use IOU scores themselves for evaluation.

Figure 4.7: ROC Curves for Discrete scores of various methods, Regions-Face(red) represents performance of our method

Results

FDDB Benchmarking

Scores for partial face regions

With about 95% of the intersection with the ground truth region and about 90% of the IOU, the region is rated above 5.

Discussion

The proposed method can be used in real-world scenarios, such as in a surveillance environment, where images are not captured in a limited manner and may have many occlusions that cause difficulty in detecting faces. It offers very high efficiency thanks to the use of rich hierarchical convolutional neural networks that are very good at extracting and representing features. Candidate windows suggested by Selective Search contain parts of faces that can be very useful for partial face detection.

Figure 5.2: Trend of Scores for candidate bounding boxes along with their intersection overlap and IOU overlap

Direction for Future Work