Yaniv Taigman, Ming Yang ,
Marc Aurelio Ranzato, Lior Wolf.
This paper focuses on building a deep learning model which matches the human level accuracy. This research came out of Facebook AI Lab, which had a huge scope in recognizing faces from uploaded Photos and tagging them together on Facebook.
Conventional pipeline consists for four stages in recognizing faces . They are
Face Detection –> Alignment –> Representation –> Classification .
Face Detection focuses on detecting the front portion of human faces. A huge amount of facial images are used to train the model to detect objects with similar features as the trained images.
Face Alignment is required when the facial images captured are facing from a different angle. It becomes complicated to detect facial elements like eye corners, mouth corners etc when the facial image is not aligned appropriately.
Face Feature Representation is the unique components that differentiate between different faces. The model learns these complex feature representations and uses this knowledge to verify faces in complicated face verification tasks.
Classification consists of interpreting multi-class unseen facial data by using the knowledge acquired from feature representation stage. More the training data, more the feature representations learned by the model, this generalizes the model to classify wider classes of data.
We are revisiting both Alignment step and the Representation set.
Classification accuracy can be improved by employing 3D face modeling in order to apply a piecewise affine transformation to alignment and representation set. Piecewise affine transformation is used to model geometric distortion in images. This will help in the geometric transformation to properly align images of interest.
Following this, we derive a face representation from a nine-layer deep neural network.
Our method reaches an accuracy of 97.35% on the Labeled Faces in the Wild dataset reducing the error of the current state of the art by more than 27% closely approaching human-level performance.
The network architecture is based on the assumption that once the alignment is completed, the location of each facial region is fixed at the pixel level. Due to alignment, the facial region’s pixel value is not disturbed.
Hence it is possible to learn from the raw pixel RGB values.
In Summary, we make the following contributions:
1. The development of an effective deep neural network architecture and learning method that leverage a very large labeled dataset of faces in order to obtain a face representation that generalizes well to other datasets.
2. An effective facial alignment system based on explicit 3D modeling of faces.
3. Advance the state of the art significantly in Labeled Faces in the Wild dataset reaching near-human performance and YouTube Faces dataset decreases the error rate thereby more than 50%.
Aligning faces in the unconstrained scenario is still considered a difficult problem that has to account for many factors such as pose due to the non-planarity of the face and non-rigid expressions which are hard to decouple from the identity bearing facial morphology. The expressions that human faces hold are volatile, hence capturing all the possible expressions are cumbersome.Hence a preprocessing methodology like face alignment is carried out to overcome this complexity.
Face Alignment can straighten up the face when the face Image is facing away in different direction.
The method followed for Face Alignment :
1. Employing an Analytical 3D model [ 3 Dimensional Geometry ] of the face.
2. Searching for similar fiducial points configurations from an external dataset to infer from.
A fiducial marker or fiducial point is an object placed in the field of view of an imaging system which appears in the image produced, for use as a point of reference or a measure.
3. Unsupervised methods that find a similarity transformation for the pixels.
In this paper , we describe a system that includes analytical 3D modeling of the face based on fiducial points , that is used to warp a detected facial crop to a 3D frontal mode. Our alignment is based on using fiducial point detectors to direct the alignment process. We use a relatively simple fiducial point detector , but apply it in several iterations to refine its output. At each iteration , fiducial points are extracted by a Support Vector Regressor trained to predict point configurations from an image descriptor. Our image descriptor is based on LBP Histograms.
LBP Histograms: Local binary patterns (LBP) is a type of Visual descriptor used for classification in Computer Vision.
The image descriptors provide the visual feature content present in an image. Upon this image descriptor , Support Vector Regression predicts a finite sets of points and creates a margin of tolerance . Then fiducial points are extracted within this margin.
We start our alignment process by detecting 6 fiducial points inside the detection crop , centered at center of eyes , tip of the nose and mouth locations . They are used to approximately scale , rotate and translate the image into six anchor locations.
Iterate on the warped image until there is no substantial change , eventually composing the final 2D similarity transformation.
This aggregated transformation generates a 2D aligned crop.
But similarity transformation fails to compensate for out of plane rotation , which is particularly important in unconstrained conditions.
In order to align faces undergoing out of plane rotations , we use a generic 3D shape model and register a 3D affine camera , which are used to warp the 2D aligned crop to the image plane of the 3D shape. This generates the 3D aligned version of the crop. This is achieved by assigning additional 67 fiducial points in the 2D aligned crop using a second SVR .
We manually place 67 anchor points on the 3D shape , and in this way achieve full correspondence between the 67 detected fiducial points and their 3D references.
Recently , as more data has become available , learning based methods have started to outperform engineered features , because they can discover and optimize features for the specific task at hand. Here we learn a generic representation of facial images through a large deep network.
DEEP NEURAL NETWORK ARCHITECTURE AND TRAINING:
We train our DNN on a multi class face recognition task , namely to classify the identity of a face image.
A 3D aligned 3 channels face image of size 152 * 152 pixels is given to a convolutional layer C1 with 32 filters of size 11 * 11 * 3 .
The resulting 32 features maps are then fed to a max pooling layer M2 which takes the max over 3*3 spatial neighborhoods with a stride ( step size ) of 2 , seperately for each channel.
This is followed by another convolution layer C3 that has 16 filters of size 9 * 9 * 16 . The purpose of these three layers is to extract low level features , like simple edges and texture .
Max pooling layers make the output of convolution networks more robust to local translations.
A translation is a geometric transformation that moves every point of a figure or a space by the same distance in a given direction.
When applied to aligned facial images , they make the network more robust to small registration errors. Image registration error occurs when transforming different sets of data into one coordinate system.
However , several levels of pooling would cause the network to lose information about the precise position of detailed facial structure and micro textures.
Hence we apply max – pooling only to the first convolutional layers. We interpret these first layers as a front end adaptive pre-processing stage. While they are responsible for most of the computation , they hold very few parameters . These layers entirely expand the input into a set of simple local features.
The use of local layers does not affect the computational burden of feature extraction , but does affect the number of parameters subject to training. Only because we have a large labeled dataset , we can afford three large locally connected layers.
The use of locally connected layers can also be justified by the fact that each output unit of a locally connected layers is affected by a large patch of the input.
Finally the top two layers are fully connected . Each output unit is connected to all inputs. These layers are able to capture correlations between features captured in distant parts of the face images.
Example: position and shape of eyes and postion and shape of mouth.
The output of the first fully connected layer F7 in the network will be used as our raw face representation feature vector throughout this paper.
The output of the last fully connected layer is fed to a K -way Softmax ( K is the number of classes ) which produces a distribution over the class labels.
The goal of training is to maximize the probability of the correct class . We achieve this by minimizing the cross entropy loss for each training sample.
The loss is minimized over the parameters by computing the gradient of L wrt the parametrs and by updating the parameters using stochastic gradient descent . The gradients are computed by standard back propagation.
One intresting property of the features produced by this network is that they are very sparse. On average 75% of the feature components in the topmost layers are exactly zero. This is mainly due to the soft-thresholding non-linearity is applied after every convolution . Locally connected and fully connected layer , making the whole cascade produce highly non-linear and sparse features. Sparsity is also encouraged by the use of regularization method called dropout , which sets random feature components to 0 during traing.
We have applied dropout only to the first fully connected layer. Due to the large training set , we did not observe significant overfitting during training.
As a final stage we normalize the feartures to be between zero and one in order to reduce the sensitivity to illumination changes.
Verifying whether two input instances belong to the same class or not been extensively researched in the domain of unconstrained face recognition , with supervised methods showing a clear performance advantage over unsupervised ones. By training on the target domain’s training set , one is able to fine tune a feature vector to perform better within the particular distribution of the dataset.
However fitting a model to a relatively small dataset reduces its generalization to other datasets.
In this work , the paper is aimed at learning an unsupervised metric that generalizes well to several datasets.
The unsupervised similarity is simply the inner product between the two normalized feature vectors.
In this paper , Experimentation is carried on with a supervised metric , the X^2 similarity and Siamese network .
WEIGHTED X^2 DISTANCE:
The normalized DEEPFACE feature vector in our method contains several similarities to histogram based features , such as LBP.
LBP Image from PyImageSearch
1 . It contains non- negative values
2. It is very sparse
3. It’s values are between [0,1]
An end to end metric learning approach known as siamese network is tested for face recognition.
Once learned , the face recognition network is replicated twice and the features are used to directly predict whether the two input images belong to the same person.
This is accomplished by :
1. Taking the absolute difference between the features .
2. A top fully connected layer that maps into a single logistic unit ( same or not same ) .
The network has roughly the same number of parameters as the original one , since much of it is shared between the two replicas , but requires twice the computation.
The parameters of the siamese network are trained by standard cross entropy loss and back propagation of error.
RESULT ON LFW DATASET:
The mean recognition accuracy on LFW marches steadily towards the human performance of over 97.5% . Given some very hard cases due to aging effects , large lighting and face pose variations in LFW , any improvement over the state of art is very remarkable and the system has to be composed by highly optimized nmodules.
Deepface couples large feedforward based models with fine 3D alignment . Regarding the importance of each component:
1 . Without frontalization : when using only the 2D alignment , the obtained accuracy is only 94.3% . Without alignment at all , i.e using the centre crop of face detection, the accuracy is 87.9% as parts of the facial region may fall out of crop.
2. Without learning : when using frontalization only , and a naive LBP/SVM combination , the accuracy is 91.4% which is already notable given the simplicity of such a classifier.
To evaluate the discriminative capability of the face representation in isolation , we follow the unsupervised setting to directly compare the inner product of a pair of normalized features.
Quite remarkably , this achieves a mean acuuracy of 95.92% which is almost on par with the best performance to date , achieved by supervised transfer learning . Transfer learning is a machine learning technique where a model trained on one task is re-purposed on a second related task .
Next , we learn a kernel SVM ( C = 1) on top of the X^2 distance vector following the restricted Protocol i.e where only the 5400 pair labels per split are available for the SVM training. This achieves an accuracy 97.00% reducing significantly the error of the state-of-the-art.
Using a single core Intel 2.2GHz CPU , the operator takes 0.18 seconds to extract features from the raw input pixels.
Overall , the Deepface runs at 0.33 seconds per image , accounting for image decoding , face detection and alignment , the feedforward network , and final classification output.
An ideal face classifier would recognize faces in accuracy that is only matched by humans . The underlying face descriptor would need to be invariant to pose , illumination , expression and image quality . It should also be general , in the sense that it could be applied to various populations with little modifications if any at all . In addition , short descriptors are preferable , and if possible , sparse features .in this paper , a 3D model based alignment with large capacity feedforward models can effectively learn from many examples to overcome the drawbacks and limitations of previous methods. The ability to present a marked improvement in face recognition , attests to the potential of sucjh coupling to become significant in other vision domains as well.