Multimodal Authentication Systems

The reported weaknesses of previous works have been used to improve the performance of our multimodal biometric authentication system. In the eighth chapter, all the achievements of the multimodal biometric authentication system are presented.

Motivation Behind Using Biometric Data

Historical Overview

However, the rapid shift from on-premises storage to cloud-based storage served to underscore the security and privacy concerns of data stored remotely beyond the direct physical control of the user. To improve system security and data storage privacy and security, it has become imperative to consider the use of multi-factor user authentication incorporating biometric data.

Biometric Authentication Systems

The collection, storage and analysis of data often overwhelms the ability to store the data on a local machine or server, which in turn has prompted the emergence of the concept of Cloud storage and Cloud-based computing. In our work, we review the recent work on biometric authentication systems, examine several approaches individually and in combination, and then perform an analysis regarding the robustness and overall integrity of the systems.

Thesis Motivation

Such approaches should also be used to analyze the possible weaknesses of the new authentication models and thus provide improved control of system security, availability and integrity.

Goals and Aims of the Current Work

Facial Features Recognition (AI)

Advanced approaches such as transfer learning of the famous AlexNet CNN with a Support Vector Machine (SVM) classifier presented an ideal result of 100% accuracy in the Georgia Tech Face Database [5]; although transfer learning can result in large data bias and. Some studies are still working on improving statistical approaches, such as the Optimized Probability Density Function (OPDF), which uses the Histogram of Dynamic Thermal Patterns (HDTP) as a descriptor [22].

Table 2.1: Achieved TAR on the whole Georgia Tech Face Database

Voice-Based Authentication (AK)

52] replace MFCC with Linear Prediction Cepstral Coefficients and use Naive Bayes classification for the voice verification model, and compare results on the TIMIT data set (Table. 2.2). The results of the study were measured in Equal Error Rate (EER); researchers managed to reduce this indicator from 1.72% to 1.48%.

Other Methods

Recent studies on fingerprint biometrics have introduced new models using 3D features of the finger [35], testing the performance of hybrid models with traditional password and PIN input [45], and input systems using probability-based methods and behavioral biometrics [19 ]. Mobile sensors such as gyroscope, magnetometer and accelerometer produce and collect data about the user's movement.

Multimodal Systems

Replicated Paper

The sensors capture the user's face and voice and compare it with the registration data in the database. At the end of the study, the results of this work were excellent in terms of TAR, FRR, FAR indicators, and the code execution time was significantly shorter than previous works.

Spoofing and Optimization

Spoofing
Optimization
Data Collection
Initial Data Configuration
Dataset Limitation
Dataset Preprocessing
Subset Selection

However, such systems remain susceptible to problems with the physical environment and variations in the user's state. A subset of 111 users from the SpeakingFaces dataset was selected to construct and validate the performance of the multimodal biometric system.

Figure 2-2: Anti-spoofing methods for different levels [17]

TIMIT Dataset

Data Collection

Among them, two sentences are specially constructed using phrases saturated with contexts in which the maximum manifestation of the speaker's dialect affiliation can be expected. The first group consists of 450 separate phonetically balanced sentences that ensure full coverage of the phonemic inventory and the occurrence of phonemes in specific contexts. For the second group, 1890 sentences were selected from the available text corpora with the selection criterion to increase the variety of sentence types and phonetic contexts of phoneme use.

Initial Data Configuration

The corpus uses 61 phonemes that are considered very detailed and for historical reasons have been mapped into 48 phonemes for learning purposes and 39 phonemes for testing and comparison among the scientific community.

Data Preprocessing

Georgia Tech Face Database

Data Collection

Data Preprocessing and Limitations

As noted in Chapter 2, the original architecture of the multimodal biometric authentication system was created by Zhang et al. This chapter provides a detailed overview of the original working model, the parameters and settings for our replication of that model, describing the differences between the two and a brief description of the limitations of the model. Using the Mel-Frequency Cepstral coefficients, features are extracted from the cleaned data and the classification is passed into the Gaussian mixture model, where the matching score is also calculated as in the face verification.

Figure 4-1: Flow chart of the multimodal authentication system implemented by Zhang et al

Model Components

Face Module (AI)

After merging the idea of integral images and Haar cascades, the issue of the improved object detection was considered. 59] used the improved version of the LBP coding algorithm on Eq.4.1 which reduces category number of LBP 2𝑝 part by 𝑝(𝑝−1) + 2. The improved version of algorithm was shown to achieve the invariance in the gray scale as well as in the rotation of the feature image.

Voice Module (AK)

The wavelet and the level of decomposition are selected and the wavelet decomposition of the original signal is calculated. Further, the obtained values are used to determine approximately the starting and ending points of the audio signal. For this, the posterior maximum likelihood method is used to estimate the mixture parameters.

Figure 4-8: Principle of the improved VAD method by Zhang et al. [59]

Fusion Score

Limitations of Prior Work

Visual Part (AI)

The face region of each participant in the SpeakingFaces training subset was different in size after the face detection phase of Haar cascades. 59] did not mention whether the processes of reshaping feature images or whitening the non-face background of the images were used to maintain the same image size. Nevertheless, for the recreated model, the reforming feature image in the database was implemented in the face matching process during the testing of the database.

Audio Part (AK)

Fusion Score

Parameters and Initial Setup

Facial Module (AI)
Speech Module (AK)
Fusion Score
Facial Recognition (AI)
Voice Identification (AK)

To evaluate the effect of the number of neighbors used in face recognition on the accuracy of the model, several values were tested. The number of neighbors is used to improve face identification within the image. Overall, decreasing the scaling factor led to a nonsignificant decrease in the top 5 accuracy scores on participants' frontal faces.

Table 5.1: Achieved TOP5 accuracy for Speaking Faces for various number of neigh- neigh-bors with scaling 1.01 and Euclidean distance

Multimodal Biometric System

As can be observed from the results of the TOP 5, TOP 7 and TOP 10 ranking fusion results, the smaller number of similar users checked by the database led to the lower accuracy of the authentication rate, but also modest false discovery rate. Rate it. All results from multimodal biometric systems were obtained by testing 3 frontal faces of 98 subjects on the database with 3 random user audio files. However, due to the randomness of the audio files, the TAR value may fluctuate within 2-3% of the stated values.

Validation on External Datasets

Georgia Tech Face Database (AI)

In addition, certain preprocessing procedures were necessary to reduce the effect of the noisy background (see Appendix A). The performance of the current unimodal component is superficial compared to the top 5 accuracy of the state-of-the-art approach. The replicated model has less accuracy on the validation of Georgia Tech Face Database due to the different approach in the user authentication part.

TIMIT (AK)

The difference of 3.85% may occur due to the absence of a description of the parameter values in the previous work; without these values it was difficult to fully reproduce their results. To test the authentication system for attempts to spoof the biometrics of the user's face, it was necessary to obtain images with combined facial features of both the intruder and the existing user. In this case, based on the audio recording of the system user, a new artificial speech signal was created by the work of Yuxuan Wang et al.

Table 5.11: Comparison of approaches’ results with validation dataset TIMIT As a result of testing the model on the TIMIT dataset, the recognition accuracy reached 96.5%, which is close to the value obtained using the same architecture in the work of Zhang

No Spoofing

As expected, higher thresholds and ranking number lead to greater vulnerability of the system to external attacks. In general, the performance of the threshold-based system is better for the no-fraud scenario. However, for the complete analysis of the security of the system, the dynamics of the change of the FAR value according to other fraud scenarios for both coupling systems should be reviewed.

Face-only Spoofing

As can be observed from the tables, the targeted attacks when an intruder tried to enter the system by providing authorized users' login are more successful than attempts when the attacker knows all authorized usernames in the database and tries to entering the system with each of them. For rank-based fusion scoring, the attacker only provides his biometrics and the system decides whether to grant access and under which authorized users' credentials. The two multimodal systems were able to maintain almost the same level of TRR compared to results from the no spoofing scenario.

Voice-only Spoofing

It can be seen that the model is dramatically vulnerable under voice attack cases and the FAR value for each setup is higher than 83%. As a result, in more than 4 out of 5 cases when an intruder has voice recordings of the authorized user, the attack on the system is successful. On the other hand, the performance of the ranking-based fusion system which can be reviewed from table 6.7 is a little improved compared to the no spoofing and face only spoofing scenarios.

Both Biometrics Spoofing

7-1 we can conclude that the top5 system provides a uniform level of security for a system with a TAR value above 72% for all presented attacks. For further research, it is recommended to compare the system responses under other fusion systems and also increase the variety of attacks on the system. In addition, the security level of the system can be improved by adding additional model components, such as liveness detection, and can be verified on video type data from the SpeakingFaces dataset.

Table 6.9: Performance of the ranking based fusion score under morphed data attacks come, it can be clearly observed that both systems provide better security on the morphed results rather than on original intruder biometric data.

Integrity

Availability

Reliability
Time-Cost and Accuracy Analysis
Hardware and Computational Resources
Dataset
Model
Time

The reliability check of the implemented system on the SpeakingFaces subgroup was performed during the implementation phase. Two major variables: number of neighbors and scaling factor - caused several crashes of the system. In addition, the review of existing studies, which focused on the confidentiality and optimization aspects of the system, was investigated.

Figure 7-2: Effect of the number of neighbors on the single user authentication

Future Works

Appendix A describes the architecture of the constructed biometric authentication model, developed and tested using Speaking Faces and the noted validation datasets (Ch. 3). The first two sections present the face module pseudo-code with additional required setup and all included libraries and internal functionality. The last section describes the pseudocode for the voice module architecture in full detail.

LBP Pixel Calculation

A-1 represents the encoding scheme of the LBP recalculation for the value in the power_val array that is multiplied by the value of adjacent pixels, representing 0 for those that are less intensive than the center pixel and 1 for those that have the same intensity or have more from the center pixel.

Distance Calculation

Creation of Database

Testing System

Additional Parameters for Georgia Tech Face Database

Voice Module (AK)

Collection of Data
DWT Application
VAD Application
MFCC Feature Extraction
Training and Testing Parts

All samples from the first test set of the dataset are written to the model training file, and 20% of the test files amounting to 12 voice samples are written to the second file. The required properties are read from the audio file and then adjusted at the desired interval. Based on the user number, the configured model is written to a file with the appropriate name in the folder with all models on Google Drive.

Fusion Score

Threshold Based

To calculate the prediction accuracy among all user files, the percentage of finding the correct target among the top 5 of all similarity values and finding the required target are initially calculated separately.

Ranking Based

Multi-modal biometric scheme for human authentication technique based on fusion of voice and face recognition. An expert multi-modal person authentication system based on feature-level fusion of iris and retina recognition. Voice verification based on Russian language dataset, mfcc method and anomaly detection algorithm.