The proposed approach uses late fusion to integrate information from different sources and generate a unified summary of the interviews. The effectiveness of the proposed method is evaluated on the entire MIT interview dataset, which includes 138 mock job interviews conducted with MIT students. The effectiveness of the proposed approach was demonstrated on the MIT interview dataset, which contains 138 interview sessions with 69 students applying for internships.
Multimodal Video Analysis
The late fusion approach involves developing separate models for each modality and then merging their outcomes using techniques such as averaging, weighted sum, majority voting, concatenation of predictions from different modalities into a single vector, or deep neural networks, such as illustrated in Figure 1. A good example of late fusion, shown in [14], also provides a completely new deep neural network (DNN) that combines audio, video and text modalities to recognize emotions in many media.
Multimodal Interview Analysis
They examine the MIT interview dataset to determine the relationship between certain prosodic variables, such as pitch, intensity, and others, and the prospect of receiving a positive evaluation during the interview. As summarized in Table 2.2, despite the potential value of information provided in the MIT interview dataset, deep learning algorithms have not yet been applied to it.
Proposed Model
- Visual Modality
- Audio Modality
- Lexical modality
- Fusion and Final Classification
The CNN component of the model is responsible for analyzing facial expressions and identifying emotions displayed by the candidate during the interview. The final fusion and classification component of the model uses a Feedforward Neural Network (FNN) to classify the candidate as recommended or not recommended for the position. Grayscale conversion reduces the dimensionality of the data and simplifies the input to the model.
Flip and rotation provide additional variations of the input data, which can improve the model's ability to recognize faces in different orientations. We then freeze the weights of the convolutional layers and only train the weights of the final fully connected layer. These techniques help to artificially increase the size of our dataset, which can help prevent overfitting and improve the generalization performance of the model.
We replace the last fully connected layer and refine the model's weights on our target dataset. However, larger hidden units can increase the computational complexity of the model and require more training data.
Ensembling
Evaluation Metrics
Ensembling with Random Forest
For the FNN Final prediction, if we prepared the annotated data that thresholded the target column “recommend hiring” in the preprocessing step, for the Random Forest Regressor we take the original intensity of the recommendation column as a continuous variable. A random forest regressor model is instantiated with 100 estimators and a random state of 42 using the RandomForestRegressor class from the sklearn.ensemble module. This model loads data from a CSV file and performs a five-fold cross-validation to evaluate its performance.
The data is divided into 5 sets of training and validation data, where a Random Forest regressor is trained on each set of training data and used to make predictions on the corresponding validation data. This approach provides a more robust evaluation of model performance than a simple train-test split and helps ensure that the model does not fit the training data. The R2 score measures the proportion of variance in the target variable that is predictable from the independent variables.
The output of the regressor is a float number which is considered the candidate's success rate. After the Random Forest regressor is trained on each fold of the training data, the model is evaluated on the test set to assess its ability to generalize to new, unseen data.
Correlation of the Behavioral Traits
Using FNN
Using Random Forest
The plot makes it clear that the most important quality in an interview is not to come off as awkward. Looking at the output in Figure 4-3, we can see that the predicted outcome has strong positive correlations with columns like focused, not_awkward, and excited. This suggests that candidates who were more focused, less awkward and more enthusiastic during the interview tended to receive higher predicted scores.
On the other hand, we can see that the predicted result has a weak positive correlation with the column engagement and structured_answers. This suggests that although engagement and structured responses are important factors in an interview, they may not be as strongly correlated with overall performance and expected outcome.
Case Study
As you can see from the results, all other behavioral traits show high results on average, so in the previously mentioned contexts, this candidate had to be hired. However, it is important to mention that if a candidate scores 0 in one of the engagement or structured_response columns, it is important to consider the prior correlations. Overall, it is important to consider the antecedent contexts and assess candidates holistically, taking into account various factors related to the job and the needs of the organization.
While a low score in the engagement or structured response columns can be a red flag, it is imperative to consider other factors and evaluate the candidate on a holistic basis.
Compare Results
The table clearly shows that our model performed better in AUC for almost all the features. For labels not stressed, recommend employment, focused, smiling and structured responses, high accuracy of over 90% was observed. For some of the brands, there is a large shift of about 20%, which shows the effectiveness of the proposed model.
Custom Datasets
Custom Actors Dataset
However, even though the data was not labeled, we looked through the videos ourselves and realized that sometimes the model can show inaccurate results. This could be because the model was trained on a specific dataset based on interview content and the adapted dataset's videos are just random actors telling different things.
Custom interview dataset
Based on self-evaluation of the adjusted dataset, the predicted results show very good results. The relatively high accuracy of the proposed model in analyzing additional videos with the same content can be attributed to several factors. One possible explanation is that the model has learned to identify patterns in the interview content that are consistently associated with certain outcomes.
For example, the model has learned to associate certain verbal signals, such as hesitations or repetitions, with specific emotions or behavior. If the questions are highly predictive of the outcomes being analyzed, the model may be able to accurately predict these outcomes regardless of the specific context or individual being interviewed. This indicates that the model may not be completely accurate in predicting outcomes in every situation.
Personalized data may contain biases in terms of race, gender, accent, or other characteristics, and the model may learn to associate certain interview content with specific results based on those biases. Further research and testing is needed to address these limitations and to ensure that the model is robust and effective in different contexts and populations.
Demonstration
Running Example
Next, we use the FNN component to analyze the audio extracted from the interview. FNN processes extracted features such as pause number, average pause length, average power, and average voice rate to obtain the prediction of whether the candidate expressed an engaged tone during the interview. Finally, we use the LSTM component to analyze the candidate's text-based responses to the interview questions.
Transcript from the interview was analyzed to provide a prediction of whether the candidate's responses were well structured and organized, indicating strong communication skills and the ability to convey information effectively. This linked data frame contains the analysis of the candidate's facial expressions, tone of voice and text-based responses during the interview. We then feed this data frame into the final fusion and classification component of the model, which uses a Feedforward Neural Network (FNN) to classify the candidate as either recommended or not recommended for the position.
The output of the FNN represents the model's recommendation for hiring the candidate based on the combined analysis of their facial expressions, tone of voice, and text-based responses. In addition to the FNN, an ensemble technique using a Random Forest Regressor was also included in the final classification component of the proposed model for multimodal interview analysis.
Frontend
The proposed multimodal interview analysis system has the potential to provide invaluable insights for recruitment recommendations by utilizing deep learning techniques. A fusion approach is used to merge information from multiple modalities into a single data frame and then use tabular models to predict overall candidate performance, resulting in highly successful hiring recommendations with an accuracy of 92.3%, which is higher than any individual modality. only. These results surpass previous research, which achieved AUCs of approximately 80%, and illustrate that all models proposed in this study achieve state-of-the-art performance scores.
Overall, the effectiveness of the proposed system to capture and analyze information across different modalities can lead to more informed employment decisions and has the potential to be used in other contexts where the analysis of human emotional behavior is relevant. Moreover, the proposed model for multimodal interview analysis showed promising results when evaluated on a customized interview dataset. The model was able to accurately predict outcomes in interviews with the same content, achieving an average accuracy of around 80%.
This high accuracy can be attributed to several factors, such as the model's ability to identify consistent patterns in the interview content and the predictive power of the interview questions. Nevertheless, these findings provide a good basis for future research and development of the proposed model.
Papers using First Impression v2 dataset
Researches done on MIT Interview Dataset
List of Interview questions
Performance of Emotion recognition from Video Frames
Audio classification result
Text classification result
Comparing Results with [1] based on ROC AUC
Custom Dataset Results
Predicted outcomes on Custom Interview Dataset