As explained in the system overview, the simple FER implementation will receive all the input frames from the camera and send them to the FER service for recognition.
However, this simple approach is not practical and unrealistic because today’s mobile phone is usually capable of capturing 60 FPS and at least 30 FPS. Putting the thinking hat on, is it really valuable to send 30-60 real-time frames to the pay-as-you-go service just to recognize the expression? The answer is no because the camera may capture frames without a human or face existing. Besides, there are transitions between different expressions. Recognizing all input frames using a paid service is costly.
Therefore, the potential optimizations are introduced, and they will be explained in the following subchapters. The reason they are addressed as “potential” is that they will be examined and evaluated if they are useful to be implemented.
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
21 3.3.1 Down Sampling
Input Face detection
FER Service Original Images
Images with Faces
Output
FER result Processed results
Figure 3.3.1.1 FER Implementation with Down Sampling
The down sampling optimization is aimed at reducing the number of requests, amount of data transmission and also battery consumption through face detection and limiting the number of requests at a regular interval. As shown in Figure 3.3.1.1, this optimization is done after getting the input frame. It will perform face detection on the input frame and only send the input frame to the FER service for recognition. It will make use of an open-source face detection algorithm to identify and report the coordinates of the square boxes in the input that include faces.
Besides face detection, the number of service calls is also restricted to once per second. It is done by receiving the input frame repeatedly and processing them once per second. Setting the time interval is because the human expression usually does not really change so frequently during a normal conversation, while at the moment of recognition does cost, it is more cost-efficient and work-efficient to reduce the number of recognitions. So, the amount of data transmission could be greatly reduced, thus saving the battery usage compared to sending all the input frames for recognition.
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
22 3.3.2 Spatial Trimming
Input Face detection
FER Service Original
Images
Grayscaling
Face Cropping Grayscale
images
Face coordinates Cropped
faces Output
FER result Processed
result
Figure 3.3.2.1 FER Implementation with Image Processing
While the second optimization is to trim the spatial dimension of the input frame, targeting to reduce the data of transmission and recognition time in each service call, the first approach is to convert the images to grayscale. This step reduces the image bits from 24 bits of green, red and blue to only 8 bits of gray. The grayscale conversion is proven to reduce the image size, thus reducing the bandwidth used in the service call.
On the other hand, face cropping, which is another method in this stage, will reduce the ROI for FER by utilizing the face coordinates obtained from face detection. To ensure reduce the ROI efficiently, some small faces could be filtered by setting the minimum face size to be detected via face detection so that it will not detect all the small faces which are probably not desired to be recognized. It significantly minimizes the amount of noise and extraneous data in the input frame, as well as the bandwidth required to transmit the request to the third-party service with the cropped faces.
3.3.3 Caching Methodologies
Last but not least, the cost of facial expression recognition is still the primary concern when running it nonstop. Hence, two caching methodologies are proposed to reduce the need for using third-party FER services. They are cache comparison and teacher-student machine learning models. Both methodologies will be implemented on the Jupyter Notebook that supports by Python 3 to evaluate their feasibility.
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
23
Input Face detection
FER Service Original
Images
Grayscaling Face Cropping
Grayscale images
Face coordinates
[Low similarity]
Cropped faces
Output FER result
Processed result
Compare with cached images Cropped Faces
[High Similarity] FER result
Figure 3.3.3.1 FER Implementation with Cache Comparison
The first approach is to perform a cache comparison. All face-cropped images and their associated recognized expressions will be cached on the device and will be used for expression comparison between the cached and labeled images and the face-cropped input frame. If it is determined that the face-cropped input frame has a high degree of similarity to one of the cached and labeled images, it will be classified as the same expression as the cached image. On the contrary, the face-cropped input frame will be submitted for recognition to a third-party FER service. Once the result is obtained, the face-cropped image will be labeled with its expression and cached on the device. This approach is intended for situations where the recognizer is used repeatedly for the same person, and the same facial expression from the same person is extremely similar.
Implementing the cache comparison theoretically eliminates the need to use a third- party FER service when a similar expression has previously been recognized.
Input Face detection
Original Images
Grayscaling Face Cropping
Grayscale images
Face coordinates
[High confidence] FER result Output
Processed result
Trained model Cropped faces
[Low confidence] Cropped faces
FER Service Cropped face and FER result
Train model again FER result
Figure 3.3.3.2 FER Implementation with Teacher-student Model Training Another approach is to train a simple FER model on a set of labeled and face- cropped images stored in the device and then use it to predict the expression. At the
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
24 very beginning, the fundamental model is unreliable and has a low probability of correctly predicting the expression. For low-confidence predictions, the face-cropped image will be sent to the FER service for recognition. After obtaining the FER result, the processed result will be used to inform the user. At the same time, the cropped images and identified expressions from the FER service will be added to the dataset in local storage, and the model will be trained again. Over time, the FER model will improve in accuracy and will be capable of accurately predicting the expression. In this design, the FER service acts as a teacher, which trains the student, the local FER model, to classify the expression by providing reliable prediction when the local FER model produces a low-confidence prediction. So, the need to access online FER service will slowly diminish when the local FER model becomes more accurate, thus the amount of data transmission and cost of recognition.
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
25
Chapter 4
FER Implementation 4.0 Overview
The suggested approach, starting from the basic implementation, down sampling and spatial trimming, is incorporated into an Android application in this chapter using Android Studio, OpenCV library, and AWS rekognition library. The created smartphone application has an intuitive user interface and acquires real-time video stream inputs for the recognition process. The outputs of the mobile application include the label of identified facial expression, a square box identifying the face coordinates in the input frame, and voice output. On the other hand, the Caching methodologies will be implemented and evaluated using the Jupyter Notebook, boto3 library and OpenCV library. The Caching methodologies, as well as the approaches evaluations, are all done in Jupyter Notebook due to the high usability and convenience. All the approaches will be evaluated and eventually be chosen to be the final proposed approach based on the evaluation results, which will be described and explained in chapter 5.