Face Recognition: Real-Time Face Recognition System using Deep Learning Algorithm and Raspberry Pi 3B

Kunal Bhashkar
35 min readApr 21, 2020

--

Face Recognition in Real-Time

Tensorflow, Deep CNN, Working of Face Recognition, Multi-task learning networks, Joint alignment-representation networks,Variational Autoencoder, Generative Adversarial Network,Support Vector Machine classifier,Face Alignment, FaceNet, Multi-task based convolution neural network(MTCNN), Triplet Loss, Face Embeddings, VGGFace, DeepFace, OpenFace, Detectron2, ImageNet, VGGNet, DenseNet, ResNet, AlexNet, Inception Model, MobileNet, GoogleNet, Raspberry Pi 3B, Training face using faceNet

A. Tensorflow

Tensors

The word tensor is a generalization of vectors and matrices of higher dimensions or we can say it is a multidimensional array. It can be defined in-line to the constructor of the array() as a list of lists. In the general case, we can say that Tensor is an array of numbers arranged on a regular grid with a variable number of axes.

The following below is a representation of tensors,

source

Tensors are basically an n-dimensional array which can be represented into matrix form. We can perform matrix operation with tensors as below,

source

Tensorflow

TensorFlow is a project started in 2011 by Google Brain as a research project and that become very popular in the Alphabet group all over the years. The framework is popular in the machine learning community by its highly flexible architecture that can leverage different kinds of processing units like CPU, GPU or TPU to execute computation without big modifications of the running code.

Dataflow

In parallel computing, Dataflow in TensorFlow is a common programming language in which the node of the dataflow graph represents units of computation, and the edges represent the data consumed or produced by a computation.

The picture below represents the dataflow work in Tensorflow,

source

In the above example, The Node (Operation) represented by mathematical operations in the drawing.

Tensorflow 2.0

Tensorflow Version 2.0 includes lots of new features including subclassing, a new way of creating model, as well as lots of quality of live updates.

Eager execution

In Tensorflow 2.0, eager execution is enabled by default. A NumPy-like library for numerical computation with support for GPU acceleration and automatic differentiation, and a flexible platform for machine learning research and experimentation. There are many features for eager execution as (i). It is compatible with native Python debugging tools (ii). Error logging is immediate (iii). Code simplification (iv). Built Back Propagation (v). To perform the tensor operation we don’t need to start a graph session.

Ecosystem of Tensorflow

Tensorflow can run on Linux, macOS, and Windows or we can call it cross-platform. It can run on CPUs, GPUs, or TPUs also including mobile and embedded platforms with optional modular CUDA and SYCL extensions for general-purpose computing on graphics processing units. The TensorFlow computations are expressed as stateful dataflow graphs.

The ecosystem of TensorFlow is shown below,

source

TensorBoard

TensorBoard, a suit of visualizing tools, is an easy solution to Tensorflow offered by the creators that lets you visualize the graphs, plot quantitative metrics about the graph with additional data like images to pass through it.

Tensorboard is basically used to visualize the architecture of the data shown below,

source

B. Deep Convolutional neural networks(DCNN)

Like ordinary Neural networks, the Convolutional Neural Networks are very similar to that and have neurons of learnable weights and biases and each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. In the architecture of Deep Convolutional neural networks, the whole network expresses from the raw image pixels on one end to class scores at the other. At the last layer, they have loss function (e.g. Softmax) which is fully connected.

source

In the figure below, each neuron in the CNN is connected only to a local region in the input volume spatially, but to the full depth. There are multiple neurons along with the depth, all looking at the same region in the input. In right-side the neurons from the Neural Network chapter remain unchanged: They still compute a dot product of their weights with the input followed by a non-linearity, but their connectivity is now restricted to be local spatially.

source

At one stage of a CNN is composed in general of three volumes, consisting, respectively, of input maps, feature maps, and pooled feature maps (or pooled maps, for short). Pooled maps are not always used in every stage and, in some applications.

In the figure below, A CNN trained to extract features that are then used by a fully connected network (FCN) to classify handwritten numerals. The input image shown is from the National Institute of Standards and Technology database.

source

The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.

source

C. How Face Recognition (FR) Works?

source

Face recognition (FR) is a technique used for verification or identification of a person’s identity by analyzing and relating patterns based on the person’s facial features but incases of Face Detection is a technology that is used for detect face in an image or videos. It is just a part of Object detection. We can say Face detection is the first step of the Face Recognition process. In the process of a Deep FR system with face detector and alignment. The pipeline of face recognition is shown below,

source

In the FR pipeline, First, a face detector is used to localize faces. Second, the faces are aligned to normalized canonical coordinates. Third, the FR module is implemented. In the FR module, there is a step called face anti-spoofing, that recognizes whether the face is live or spoofed. In another module called face processing, that is used to handle recognition difficulty before training and testing and while training process there is an extraction of discriminative deep feature for this we use different architectures and loss functions. In case of face matching methods that are used to do feature classification when the deep feature of testing data is extracted.

In the figure below show how Deep FR system with face detector and alignment works,

source

Deep Features for Face Matching

After the training process that is done with the massive dataset (Datasets for Facial Recognition: link) with appropriate loss function (eg. triplet loss or cross-entropy based softmax loss), we can test by passing through the networks to obtain a deep feature representation. Once the deep features are extracted, most methods directly calculate the similarity between two features using cosine distance or L2 distance, after that nearest neighbor (NN) and threshold comparison are used for both identification and verification tasks. For making face matching more efficiently and accurately we have to do some postprocessing processes like metric learning, the sparse representation-based classifier (SRC), and so forth.

In Deep Face Recognition research, In 2014 the concept Deepface and DeepID were introduced. After that, Euclidean-distance-based loss always played an important role in the loss function, such as contractive loss, triplet loss, and center loss. In 2016 and 2017, L-softmax and A-softmax further promoted the development of the large-margin feature learning. In 2017, feature and weight normalization also begun to show excellent performance, which leads to the study on variations of softmax. Red, green, blue and yellow rectangles represent deep methods with softmax, Euclidean-distance-based loss, angular/cosine-margin-based loss and variations of softmax, respectively.

In the figure below shows the research about the development of loss functions,

source

The development of different methods of face processing. Red, green, orange and blue rectangles represent CNN model, SAE model, 3D model, and GAN model, respectively show in the figure below,

source

Evolution of Network Architecture For FR

The following mainstream architecture follows by the Deep Face Recognition community follows.

(i). DeepFace: In 2014, DeepFace was the first to use a nine-layer CNN with several locally connected layers with 3D alignment for face processing. The accuracy level on Labeled Face in-the-Wild (A Database for Studying Face Recognition in Unconstrained Environments) reaches 97.35%.

(ii). FaceNet: In 2015, FcaeNet used a large private dataset to train a GoogleNet. It adopted a triplet loss function based on triplets of roughly aligned matching/nonmatching face patches generated by a novel online triplet mining method. The accuracy level of this architecture is 99.63%.

(iii). VGGface: In 2015, VGGface designed a procedure to collect a large-scale dataset from the Internet. It trained the VGGNet on this dataset and then fine-tuned the networks via a triplet loss function similar to FaceNet. VGGface obtains an accuracy of 98.95%.

(iv). SphereFace: In 2017, SphereFace used a 64-layer ResNet architecture and proposed the angular softmax (A-Softmax) loss to learn discriminative face features with angular margin. SphereFace obtains an accuracy of 99.42%.

(v). VGGface2: At the end of 2017, a new large-scale face dataset, namely VGGface2, was introduced, which consists of large variations in pose, age, illumination, ethnicity, and profession. It first trained a SEnet with MS-celeb-1M dataset and then fine-tuned the model with VGGface2.

The commonly used network architectures of deep FR have always followed those of deep object classification and evolved from AlexNet to SENet rapidly.

source

In the above figure, the top row presents the typical network architectures in object classification, and the bottom row describes the well-known algorithms of deep FR that use the typical architectures and achieve good performance. The same color rectangles mean the same architecture.

Multi-task learning networks

Multi-task learning has been used successfully across many areas of machine learning, from natural language processing and speech recognition to computer vision. In Multitask learning for Face recognition, the identity classification is the main task, and the side tasks are pose, illumination, and expression estimations, among others. In this architecture shows below, the lower layers are shared among all the tasks, and the higher layers are disentangled into assembled networks to generate the task-specific outputs.

source

The performance of the multi-task learning framework is highly dependant on the relative weights of the tasks. For tuning the weight we can also dynamically adapt the weights of the tasks according to the difficulty for training the task. Based on a deep multi-task learning Conventional Neural Networks we can use a single input image for facial expression recognition.

The multi-task framework with dynamic weights of tasks to simultaneously perform face recognition and facial expression recognition is shown in below figure,

source

Joint alignment-representation networks

In Joint alignment-representation networks that used a to jointly train FR with several modules like face detection, alignment, and so forth together. Compared to the existing methods in which each module is generally optimized separately according to different objectives, this end-to-end system optimizes each module according to the recognition objective, leading to more adequate and robust inputs for the recognition model.

source

In the above figure, joint multi-view face alignment, Face regions are generated by the multi-scale proposal, then classified and regressed by another network. Here at least five facial landmarks are predicted to remove the similarity transformation of each face region. Multi-view Hourglass Model is trained to predict the response map for each landmark. The second and third rows show the normalized face regions and the corresponding response maps, respectively.

Variational Autoencoder

In Variational Autoencoder (VAE) both encoder and decoder networks are based on deep CNN like AlexNet and VGGNet. By default, a pixel-by-pixel measurement like L2 loss, or logistic regression loss is used to measure the difference between the reconstructed and the original images. Such measurements are easily implemented and efficient for deep neural network training. However, the generated images tend to be very blurry when compared to natural images. This is because the pixel-by-pixel loss does not capture the perceptual difference and spatial correlation between the two images. In case of the same image offset by a few pixels will have little visual perceptual difference for humans, but it could have a very high pixel by-pixel loss. This is a well-known problem in the image quality measurement community. We can try to improve VAE by replacing the pixel-by-pixel loss with feature perceptual loss, which is defined as the difference between two images hidden representations extracted from a pre-trained deep CNN such as AlexNet and VGGNet trained on ImageNet.

The hierarchical representation by Auto-Encoder Networks as shows below,

source

The main idea is trying to improve the quality of generated images of a VAE by ensuring the consistency of the hidden representations of the input and output images, which in turn imposes spatial correlation consistency of the two images. The architecture of the autoencoder network is shown below, in this architecture, the left is a deep CNN-based Variational Autoencoder and the right is a pretrained deep CNN used to compute feature perceptual loss.

source

We can also use deep convolutional variational autoencoder for image generation by replacing pixel-by-pixel reconstruction loss with feature perceptual loss based on a pre-trained deep CNN.

source

In the above picture, the encoder network allows to encode an image to a latent vector and then a decoder network is used to decode the latent vector back to an image that will be as similar as the original image.

Generative Adversarial Network

The two-pathway generative adversarial network (TP-GAN) that contains four landmark-located patch networks and a global encoder-decoder network. Through combining adversarial loss, symmetry loss and identity preserving loss, TP-GAN generates a frontal view and simultaneously preserves global structures and local details. TP-GAN also aims to generate identity preserving images for accurate face analysis with off-the-shelf deep features. In a disentangled representation learning generative adversarial network (DR-GAN), an encoder produces an identity representation, and a decoder synthesizes a face at the specified pose using this representation and a pose code. The general framework of TP-GAN architecture is shown below,

source

In the above architecture, the generator contains two pathways with each processing global or local transformations and the discriminator distinguishes between synthesized frontal views and ground-truth frontal views.

D. FaceNet

In 2015, the researcher at Google has achieved the best results on the range of face recognition benchmark datasets and that system called FaceNet. The FaceNet system is third-party open-source implementations of the model and also available as pre-trained models. The FaceNet system can be used to extract high-quality features from faces, called face embeddings, that can be used to train a face identification system. It directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity i.e faces of the same person have small distances and faces of distinct people have large distances. Using FaceNet Embedding as feature vectors we can implement tasks like face recognition (who is this person), verification (is this the same person) or clustering (find common people among these faces) also i.e face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem, and clustering can be achieved using off-the-shelf techniques such as k-means or agglomerative clustering. This method (FaceNet) uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train images using FaceNet, we use triplets of roughly aligned matching or non-matching face patches generated using a novel online triplet mining method.

Support Vector Machine classifier

The support vector machines (SVMs) are a binary classification method and Face recognition is a K class problem where K is the number of known individuals. The purpose of SVMs is to look for the optimal separating hyper-plane that minimizes the risk of misclassification. After a decent amount of training, SVMs can predict whether an input falls into one of two categories. This is done by constructing a hyperplane in a high dimensional space and then mapping input to points in this space. The face recognition models find dissimilarities between two facial images in difference space. We formulate face recognition as a two-class problem that classes are: dissimilarities between faces of the same person and different people. By modifying the interpretation of the decision surface generated by SVM, we generated a similarity metric between faces that are learned from examples of differences between faces. The figure below shows the Architecture of SVM–NN network for face recognition,

source

The SVM-based algorithm is compared with a principal component analysis (PeA) based algorithm on a difficult set of images. The identification performance for SVM is 77–78% versus 54% for PCA. For verification. the equal error rate is 7% for SVM and 13 % for PCA.

Multi-task based convolution neural network (MTCNN)

In Multi-task based convolution neural network we combine multi-task learning (MTL) with the CNN framework by sharing some layers between different tasks. There are basically three deep face recognition models, Lightened CNN, CASIA-Net and SphereFace, all are relatively light-weight models. This enables us not only to fine-tune the models but also to train them from scratch by using relatively small data sets of facial depth images. In MTCNN we use CASIANet with three modifications. First, batch normalization (BN) is applied to accelerate the training process. Second, the contrastive loss is excluded to simplify our loss function. Third, the dimension of the fully connected layer is changed according to different tasks.

The Pipeline of our cascaded framework that includes three-stage multi-task deep convolutional networks is shown below,

source

In the above figure, There are three-stages: In the first stage, candidate windows are produced through a fast Proposal Network (P-Net), in the second stage refine these candidates through a Refinement Network (R-Net) and in the third stage which is also called output network (O-Net) which produces final bounding box and facial landmarks position.

The architectures of P-Net, R-Net, and O-Net, where “MP” means max pooling and “Conv” means convolution is as below,

source

In above network architecture which consists of five blocks each including two convolutional layers and a pooling layer. BN and ReLU are used after each convolutional layer, which is omitted from the figure for clarity of images. Similarly, no ReLU is used after the conv52 layer to learn a compact feature representation, and a dropout layer with a ratio of 0.4 is applied after the pool5 layer.

Triplet Loss

Triplet-loss training aims at learning score vectors identity verification by comparing face descriptors in Euclidean space. This is similar in spirit to “metric learning”, and, like many metric learning approaches, is used to learn a projection that is at the same time distinctive and compact, achieving dimensionality reduction at the same time.

The function is defined as,

Triplet Function

Triplet loss is a loss function for artificial neural networks where a baseline (anchor) input is compared to a positive (truthy) input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive (truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy) input is maximized.

source

In model structure, the network consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training.

source

One the FaceNet model is trained, we can create the embedding for the face by feeding it into the model. In order to compare two images, create the embedding for both images by feeding through the model separately. Then we can use the above formula to find the distance which will be lower value for similar faces and higher value for different faces. The formula for finding the Euclidean distance between two points are as below,

source: Wikipedia

The comparison for two images in a way of Siamese network as illustrated below,

source

Face Embeddings

A face embedding is a vector representation of features extracted from the face. It is very helpful to find a similarity between two features vector. For example, another vector that is close (by some measure from features vector) may be the same person, whereas another vector that is far (by some measure from features vector) may be a different person. The classifier model that we want to develop will take a face embedding as input and predict the identity of the face.

The FaceNet model will generate this embedding for a given image of a face. It (FaceNet Model) can be used as part of the classifier itself, or we can use the FaceNet model to pre-process a face to create a face embedding that can be stored and used as input to our classifier model. This latter approach is preferred as the FaceNet model is both large and slow to create a face embedding.

The deterministic and probabilistic face embedding is shown in the figure below,

source

In the above figure, the deterministic embeddings represent every face as a point in the latent space without regard to its feature ambiguity but in probabilistic face embedding (PFE) gives a distributional estimation of features in the latent space.

E. VGG Model

The VGG models perform image classification i.e they take images as input and classify the major object in the image into a set of pre-defined classes. This model (VGG) is trained on the ImageNet dataset which contains images from 1000 classes. VGG models provide very high accuracies but at the cost of increased model sizes. They are ideal for cases when high accuracy of classification is essential and there are limited constraints on model sizes. Keras provides both the 16-layer and 19-layer version via the VGG16 and VGG19 classes. VGG 16_bn and VGG 19_bn have the same architecture as their original counterparts but with batch, normalization applied after each convolutional layer, which leads to better convergence and slightly better accuracies. The network architecture of VGG-19 model shown below,

source

In the above architecture, VGG presents the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. VGG networks have increased depth with very small (3 × 3) convolution filters, which showed a significant improvement in the prior-art configurations achieved by pushing the depth to 16–19 weight layers.

VGGFace

The VGGFace refers to a series of models developed for face recognition and demonstrated on benchmark computer vision datasets by members of the Visual Geometry Group (VGG) at the University of Oxford. It consists of 11 1ayers in which eight convolutional layers, and 3 fully connected layers. The VGGFace2 dataset proposed by Cao et al. is annotated with 9,131 unique people with 3.31 million images. The dataset can be obtained from the download link. The variation includes age, ethnicity, pose, profession, and illumination. This is the largest dataset available for face verification. A VGGFace model can be used for face verification also by calculating a face embedding for a new given face and comparing the embedding to the embedding for the single example of the face that is known to the system. A Euclidean distance and Cosine distance are calculated between two embeddings and faces are said to match or verify if the distance is below a predefined threshold that is tuned for specific datasets or applications.

The VGG-Face architecture throughout the network is shown below,

source

The Configuration of the VGG-face model used in the above method is shown below,

source

The layer structure of VGGFace Model that is visualizing low to high-level features captured as a facial expression is propagated throughout the network is shown below,

source

F. DeepFace

DeepFace uses a deep CNN trained to classify faces using a dataset of 4 million examples spanning 4000 unique identities. It is created by a research group on Facebook. It also uses a siamese network architecture, where the same CNN is applied to pairs of faces to obtain descriptors that are then compared using the Euclidean distance. The goal of training is to minimize the distance between congruous pairs of faces (i.e. portraying the same identity) and maximize the distance between incongruous pairs, a form of metric learning. DeepFace uses an ensemble of CNNs, as well as a pre-processing phase in which face images are aligned to a canonical pose using a 3D model. DeepFace achieved the best performance on the Labelled Faces in the Wild (LFW) Dataset as well as the YouTube Faces DB.

The DeepFace that closed the majority of the remaining gap in the most popular benchmark in unconstrained face recognition, and is now at the brink of human-level accuracy. Specifically, with faces, the success of the learned net in capturing facial appearance in a robust manner is highly dependent on a very rapid 3D alignment step. The network architecture is based on the assumption that once the alignment is completed, the location of each facial region is fixed at the pixel level. It is, therefore, possible to learn from the raw pixel RGB values, without any need to apply several layers of convolutions as is done in many other networks.

The Outline of the DeepFace architecture is shown below,

source

In the description of the above layer, it has front-end of single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Colors illustrate feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected layers.

Face Alignment

Face Alignment is also called face normalization which helps to improve face recognition accuracy. The outputs of this normalization technique are the face-centered to the image, rotated such that line joining the center of two eyes is parallel to the horizontal line and it resizes the faces to identical scale. There is the following method which can be used: (i) employing an analytical 3D model of the face (ii) searching for similar fiducial-points configurations from an external dataset to infer from (iii) unsupervised methods that find a similarity transformation for the pixels.

While alignment is widely employed, no complete physically correct solution is currently present in the context of unconstrained face verification. 3D models have fallen out of favor in recent years, especially in unconstrained environments. However, since faces are 3D objects, done correctly, we believe that it is the right way. Steps for face alignment is shown in the figure below,

source

The description of the above figure is, (a) The detected face, with 6 initial fiducial points. (b) The induced 2D-aligned crop. © 67 fiducial points on the 2D-aligned crop with their corresponding Delaunay triangulation, we added triangles on the contour to avoid discontinuities. (d) The reference 3D shape transformed into the 2D-aligned crop image-plane. (e) Triangle visibility w.r.t. to the fitted 3D-2D camera; darker triangles are less visible. (f) The 67 fiducial points induced by the 3D model that is used to direct the piece-wise affine wrapping. (g) The final formalized crop. (h) A new view generated by the 3D model

The pipeline of face alignment is shown below,

The pipeline of the system

3D Alignment

In order to align faces undergoing out-of-plane rotations, we use a generic 3D shape model and register a 3D affine camera, which is used to warp the 2Daligned crop to the image plane of the 3D shape. This generates the 3D-aligned version of the crop as illustrated in Fig. below,

source

G. OpenFace

OpenFace is a deep learning facial recognition model developed by Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. OpenFace provides a free and open-source library for face recognition with deep neural networks. OpenFace version 0.2.0 that improves the accuracy from 76.1% to 92.9%, almost halves the execution time and decreases the deep neural network training time from a week to a day and also improves the alignment process by removing a redundant face detection. The OpenFace uses dlib’s face detector. For processing an image, the OpenFace face detection library is first done to find bounding boxes around faces after that each face is then passed separately into the neural network, which expects a fixed-sized (currently 96x96 pixels) input. One way of getting a fixed-sized input image is to reshape the face in the bounding box to 96x96 pixels. A potential issue with this is that faces could be looking in different directions.

Google’s FaceNet is able to handle the potential issue in the OpenFace library, but a heuristic for our smaller dataset is to reduce the size of the input space by preprocessing the faces with alignment. For Face alignment we first finding the locations of the eyes and nose with dlib’s landmark detector and then performing an affine transformation to make the eyes and nose appear at about the same place. The OpenFace 0.2.0 reformulates the affine transformation without resizing or cropping and then used detection a second time to output an image reshaped and ready to be passed into the neural network. The following figure below shows the logic flow for a single image that’s originally rotated that the alignment corrects.

source

In the figure shown above, the affine transformation is based on the large blue landmarks and the final image is cropped to the boundaries and resized to 96 × 96 pixels. Consider another example in an input face, the affine transformation makes the eye corners and nose close to the mean locations.

source

In the figure illustrated above, the face detection portion returns a list of bounding boxes around the faces in an image that can be under different pose and illumination conditions. A potential issue with using the bounding boxes directly as an input into the neural network is that faces could be looking in different directions or under different illumination conditions.

The Google’s FaceNet is able to handle this with a large training dataset, but a heuristic for our smaller dataset is to reduce the size of the input space by normalizing the faces so that the eyes, nose, and mouth appear at similar locations in each image. The end-to-end network training flow of OpenFace’s is shown below,

source

Example

In mobile computing the OpenFace Project structure of Face Recognition community studies and improves off-the-shelf face recognition techniques in mobile scenarios. Due to lack of availability, these studies often use techniques with an order of magnitude less accuracy than the state-of-the-art.

The following Figure illustrated the working of face recognition model in a mobile device,

source

H. Detectron2 for Face Detection

Face detection is an AI-based computer technology that can identify and locate the presence of human faces in digital photos and videos. Detectron2 is a framework for building state-of-the-art object detection and image segmentation models. It is developed by the Facebook Research team. It can be regarded as a special case of object-class detection, where the task is to find the locations and specify the sizes of all the objects that belong to a given class. In face detection using Detectron2 is finding the faces within a specific image or images. The Detectron2 platform is now implemented in PyTorch and also started with mask-rcnn-benchmark. The The Detectron2 is able to provide fast training on single or multiple GPU servers.

source

Detectron2 includes high-quality implementations of state-of-the-art object detection algorithms, including DensePose, panoptic feature pyramid networks, and numerous variants of the pioneering Mask R-CNN model family also developed by FAIR.

This platform is now using to rapidly design and train the next-generation pose detection models that power Smart Camera. Detectron2 includes all the models that were available in the original Detectron, such as Faster R-CNN, Mask R-CNN, RetinaNet, and DensePose. It also features several new models, including Cascade R-CNN, Panoptic FPN, and TensorMask, and we will continue to add more algorithms.

Family of Detection Codebase,

source

The following video is very useful to understand Face Detection on Custom Dataset with Detectron2 and PyTorch,

source

G. Fundamental Concepts

ImageNet

ImageNet is a dataset (link) of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. As it is publicly available for research and educational use, it has been widely used in the research of object recognition algorithms and has played an important role in the deep learning revolution. It is organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a “synonym set” or “synset”. There are more than 100,000 synsets in WordNet, the majority of them are nouns (80,000+).

Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.

source

The ImageNet dataset is a very large collection of human-annotated photographs designed by academics for developing computer vision algorithms.

The general challenge tasks for ImageNet in most years are as follows:

  • Image classification: Predict the classes of objects present in an image.
  • Single-object localization: Image classification + draw a bounding box around one example of each object present.
  • Object detection: Image classification + draw a bounding box around each object present.

For more detail about ImageNet please check following link: Large Scale Visual Recognition Challenge (ILSVRC): link

AlexNet

AlexNet is the name of a convolutional neural network that has had a large impact on the field of machine learning, specifically in the application of deep learning to machine vision. The architecture of this network is summarized in Figure below,

source

In the above AlexNet architecture it consists of eight layers: five convolutional layers and three fully-connected layers. The network had a very similar architecture as LeNet by Yann LeCun et al but was deeper, with more filters per layer, and with stacked convolutional layers. It consisted of 11×11, 5×5,3×3, convolutions, max pooling, dropout, data augmentation, ReLU activations, SGD with momentum. It attached ReLU activations after every convolutional and fully-connected layer. AlexNet was trained for 6 days simultaneously on two Nvidia Geforce GTX 580 GPUs which is the reason why their network is split into two pipelines. This architecture has some features used that are new approaches to convolutional neural networks as below,

(i). Data augmentation is carried out to reduce over-fitting. This Data augmentation includes mirroring and cropping the images to increase the variation in the training data-set.

(ii). Before AlexNet, the most commonly used activation functions were sigmoid and tanh. Due to the saturated nature of these functions, they suffer from the Vanishing Gradient (VG) problem and makes it difficult for the network to train. AlexNet uses the ReLU activation function which doesn’t suffer from the VG problem.

(iii). Although ReLU helps with the vanishing gradient problem, because of its unbounded nature, the learned variables can become unnecessarily high. In order to prevent this, AlexNet introduced Local Response Normalization (LRN). The idea behind LRN is to carry out a normalization in a neighborhood of pixel amplifying the excited neuron while dampening the surrounding neurons at the same time.

(iv). AlexNet also addresses the over-fitting problem by using drop-out layers where a connection is dropped during training with a probability p=0.5. Although this avoids the network from over-fitting by helping it escape from bad local minima but the number of iterations required for convergence is doubled too.

AlexNet network architecture is illustrated below,

AlexNet Architecture

VGGNet

VGGNet is a convolutional neural network model proposed by K. Simonyan and A. For extracting features from images it is currently the most preferred choice in the community. The weight configuration of the VGGNet is publicly available and has been used in many other applications and challenges as a baseline feature extractor. The architecture of a VGGNet CNN is shown in figure below,

source

The above model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes. It was one of the famous models submitted to ILSVRC-2014. It makes the improvement over AlexNet by replacing large kernel-sized filters (11 and 5 in the first and second convolutional layer, respectively) with multiple 3×3 kernel-sized filters one after another. The VGGNet having two main architecture: VGG-16 and VGG-19. The differences between the “VGG-16 Neural Network” and the “VGG-19 Neural Network” are as follows respectively:

(i). The “VGG-19 Neural Network” consists of 19 layers of deep neural network whereas the “VGG-16 Neural Network” consists of 16 layers of deep neural network respectively.

(ii). The smaller number in terms of deep neural network is used for “ImageNet” and the other bigger number in terms of deep neural network is used for “CIFAR-10” respectively.

(iii). The “VGG-16” network has less weight with column “A”, whereas the “VGG-19” network has more weight in terms of column “C” respectively.

(iv). The size of the “VGG-16” network in terms of fully connected nodes is 533 MB. and the size of the “VGG-19” network in terms of fully connected nodes is 574 MB. respectively.

(v). The smaller net neural network in terms of “VGG-16” are more desirable like “Squeezenet”, “GoogLeNet” etc. , whereas the more larger net in terms of neural network employs certain deep learning techniques as well as certain image classification problems as well respectively.

We can observe that VGG-16 and VGG-19 start converging and the accuracy improvement is slowing down. When people are talking about VGGNet, they usually mention VGG-16 and VGG-19.

source

The ConvNet configurations architecture is shown below,

source

In the above table the depth of the configurations increases from the left (A) to the right (E), as more layers are added.

DenseNet

In the DenseNet architecture each layer is connected to all the others within a dense block where all layers can access feature maps from their preceding layers which encourages heavy feature reuse. This model is more compact and less prone to overfitting. All these good properties make DenseNet a natural fit for per-pixel prediction problems. DenseNet improved the information flow between layers by directly connecting each layer to all subsequent layers and concatenating the feature maps.

source

An important difference between DenseNet and existing network architectures is that DenseNet can have very narrow layers, e.g., k = 12. It refers to the hyperparameter k as the growth rate of the network. It means each layer in the dense block will only produce k features. And these k features will be concatenated with previous layers features and will be given as input to the next layer.

The architecture of Multipath-DenseNet is faster than a densely connected convolutional neural network. A block in Multipath-DenseNet contains n number of dense blocks with a depth of m and growth rate of k. The scale and depth of every parallel layer in each dense block in Multipath-Dense block are the same. Every layer in a dense block of Multipath Dense block processes information on the same scale and depth level. At the end of every dense block, there is a supervised feature transformation block that learns intermediate features from the layers of these independent dense blocks. The architecture of deep DenseNet with three dense blocks is shown below,

source

In the above figure, The layers between two adjacent blocks are referred to as transition layers and change feature-map sizes via convolution and pooling. The DenseNet architectures for ImageNet is illustrated in figure below,

source

In the above table the growth rate for all the networks is k = 32. Note that each “Conv” layer shown in the table corresponds to the sequence BN-ReLU-Conv.

ResNet

The Residual Network (ResNet) is a Convolutional Neural Network (CNN) architecture that was designed to enable hundreds or thousands of convolutional layers. While previous CNN architectures had a drop off in the effectiveness of additional layers, ResNet can add a large number of layers with a strong performance. It was an innovative solution to the “vanishing gradient” problem. Neural networks train via the backpropagation process, which relies on gradient descent, moving down the loss function to find the weights that minimize it. If there are too many layers, repeated multiplication makes the gradient smaller and smaller, until it “disappears”, causing performance to saturate or even degrade with each additional layer. The architecture of ResNet-12 is shown below,

source

In the above architecture when input is loaded into the network, convolution and pooling operation are performed first before ResNet block since no learning has been done before. In both convolution and pooling layers, row stride and column stride are set to 2 so that the height and the width of feature map are reduced by half.

The architecture of ResNet in tabular form is shown below,

ResNet Architecture

Inception Model

The Inception Model is inspired by the performance of the ResNet. There are two sub-versions of Inception ResNet, namely v1 and v2. Now the minor differences between these two sub-versions are:

  • Inception-Rbasically Net v1 has a computational cost that is similar to that of Inception v3.
  • Inception-ResNet v2 has a computational cost that is similar to that of Inception v4.
  • Both sub-versions have the same structure for the modules A, B, C, and the reduction blocks. The only difference is the hyper-parameter settings.

The Inception deep convolutional architecture was introduced as GoogLeNet in (Szegedy et al. 2015a), here named Inception-v1. Later the Inception architecture was refined in various ways, first by the introduction of batch normalization (Inception-v2). Later by additional factorization ideas in the third iteration which will be referred to as Inception-v3. The Inception V3 architecture is shown below,

source

In the above architecture, the output size of each module is the input size of the next one. In this architecture the variations of the reduction technique to reduce the grid sizes between the Inception blocks whenever applicable. Here the convolution is marked with 0-padding, which is used to maintain the grid size. 0-padding is also used inside those Inception modules that do not reduce the grid size. All other layers do not use padding. The outline of the Inception network architecture is shown below,

source

In the above table, we can see that the output size of each module is the input size of the next one. All other layers do not use padding.

MobileNet

The MobileNet is an efficient Convolutional Neural Networks for Mobile Vision Applications. It is small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use cases. It can be built upon for classification, detection, embeddings, and segmentation similar to how other popular large scale models, such as Inception, are used. MobileNets can be run efficiently on mobile devices with TensorFlow Mobile. In figure below shows that how MobileNet models can be applied to various recognition tasks for efficient on device intelligence,

source

The structure of MobileNet is based on depthwise separable filters, as shown below,

source

In the above MobileNet architecture all layers are followed by a batchnorm and ReLU nonlinearity with the exception of the final fully connected layer which has no nonlinearity and feeds into a softmax layer for classification. The batchnorm and ReLU nonlinearity to the factorized layer with depthwise convolution, 1 × 1 pointwise convolution as well as batchnorm and ReLU after each convolutional layer. Down sampling is handled with strided convolution in the depthwise convolutions as well as in the first layer. A final average pooling reduces the spatial resolution to 1 before the fully connected layer.

The MobileNet Body Architecture is illustrated in the table below,

MobileNet Architecture

GoogleNet

GoogLeNet is a convolutional neural network that is 22 layers deep. These networks have learned different feature representations for a wide range of images. GoogLeNet has been trained on over a million images and can classify images into 1000 object categories (such as keyboard, coffee mug, pencil, and many animals). It is reviewed, which is the winner of the ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2014, an image classification competition. From the name “GoogLeNet”, we’ve already known that it is from Google. And “GoogLeNet” also contains the word “LeNet” for paying tribute to Prof. Yan LeCun’s LeNet.

The architecture of GoogleNet is shown below,

source

The GoogLeNet incarnation of the Inception architecture in tabular form is shown below,

GoogLeNet Parameters

Raspberry Pi 3 Model

The Raspberry Pi is a series of small single-board computers developed in the United Kingdom by the Raspberry Pi Foundation. In Raspberry Pi 3B an updated version of the 64-bit Broadcom application processor used, which incorporates power integrity optimizations, and a heat spreader. It contains Dual-band wireless LAN and Bluetooth are provided by the Cypress CYW43455 “combo” chip, connected to a Proant PCB antenna similar to the one used on Raspberry Pi Zero W. Compared to its predecessor, Raspberry Pi 3B+ delivers somewhat better performance in the 2.4GHz band and far better performance in the 5GHz band.

The figure shown below is an architecture of Raspberry Pi 3B,

source

Raspberry Pi. MQTT

MQTT, which originally was an acronym for Message Queue Telemetry Transport, is a lightweight message queue protocol designed for small data packets sent across high latency, low bandwidth links. MQTT is a fairly simple protocol and it’s perfect for Internet of Things projects. It’s also perfect for this security system project. It is available via apt, so installing it is quite easy. There are a number of steps in configuring the Raspberry Pi component of the security system. As I mentioned, I’m using a Raspberry Pi 3.

To connect Raspberry Pi 3 with your GPU machine you need to follow steps as:

(i). Install mosquitto (MQTT) components.

(ii). Configure mosquitto and restart the service.

(iii). Copy in the security.py program and edit it for your installation.

(iv). Configure security.py to run at boot.

(iv). Start security.py.

The video is shown below very useful to explain Raspberry Pi3,

source

H. Training face using FaceNet

Dataset

The dataset you can download from Casia-WebFace[download] and VggFace2[download]. A face detector is run on each image and a tight bounding box around each face is generated. These face thumbnails are resized to the input size of the respective network. Input sizes range from 96x96 pixels to 224x224 pixels in our experiments. The CASIA-WebFace dataset has been used for training. The best performing model has been trained on the VGGFace2 dataset consisting of ~3.3M faces and ~9000 classes.

References to train your dataset using FaceNet

i. FaceNet Face Recognition using Tensorflow: link

ii. Classifier training of inception ResNet v1: link

iii. Alignment using MTCNN face detection: link

iv. Training using the VGGFace2 dataset: link

v. Train a classifier on own images: link

vi. Triplet loss training: link

you can download trained model from here: link

I. Face recognition through webcam

The following video is the result of Deep Learning-based Face recognition using a webcam,

youtube link

Another Link with Raspberry Pi,

youtube link

Summary

In this blog, we have created a system that can perform real-time face recognition with GPU (GeForce GTX 1660 Ti, 16GB DDR4, 256GB PCIe SSD + 1TB HDD) and Raspberry Pi 3B. For better accuracy and speed we have to do many things to improve the performance of this system. Potentially, we can apply knowledge distillation to compress the current model and further reduce the model size using low bit quantization. We could also improve the accuracy of using other machine learning classification methods on the embeddings.

Thank you for reading this blog.

References:

[1]: FaceNet: A Unified Embedding for Face Recognition and Clustering

[2]: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

[3]: Deep Feature Consistent Variational Autoencoder

[4]: Train FaceNet with triplet loss for real-time face recognition on Keras

[5]: How to Develop a Face Recognition System Using FaceNet in Keras

[6]: https://github.com/davidsandberg/facenet

[7]: Deep Face Recognition: A Survey

[8]: Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

[9]: VGGFace2: A dataset for recognizing faces across pose and age

[10]: http://cs231n.github.io/convolutional-networks/

[11]. https://www.youtube.com/watch?v=8eLHZ0R5nHQ

[12]. Dynamic Multi-Task Learning for Face Recognition with Facial Expression

[13]: https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf

[14]: Image similarity using Triplet Loss

--

--

Kunal Bhashkar
Kunal Bhashkar

Written by Kunal Bhashkar

#DataScientist #PursuingPhD #DeepLearning #JNU #NewDelhi

Responses (3)