INTERACTIVE EMIRATE SIGN LANGUAGE E-DICTIONARY BASED INTERACTIVE EMIRATE SIGN LANGUAGE E-DICTIONARY BASED ON DEEP LEARNING RECOGNITION MODELS

Emirati interactive sign language electronic dictionary based on deep learning recognition models”, I hereby solemnly declare that this is the original research work carried out by me under the supervision of Dr. According to the ministry of community development database in the United Arab Emirates (UAE), about 3,065 people with disabilities are hearing impaired (Emirates News Agency - Ministry of Community Development). They usually need sign language (SL) interpreters, but as the number of hearing impaired individuals increases, the number of sign language interpreters may be almost non-existent.

In addition, specialized schools lack a unified sign language dictionary (SL), which can be linked to the fact that the Arabic language has a diglossia character, resulting in the coexistence of many dialects of the language. In addition, there is generally insufficient research on Arabic SL, which may be associated with the lack of unification in Arabic Sign Language. Therefore, we present an electronic dictionary (e-dictionary) in Emirate Sign Language (ESL) consisting of four functions, namely dictation, alpha webcam, vocabulary and spell, and two datasets (letters and vocabulary/sentences) to help the community in exploring and unifying the ESL.

Ahmed is also one of the 8 winning nominees for the first ever Abu Dhabi Artificial Intelligence Summit 2022, to showcase his work on "Emirate Sign Language" and to showcase the first ever Emirate Sign Language e-Dictionary system.

Introduction

Therefore, a hearing impaired person can easily learn and explore new words just by typing them, and the system will show the corresponding character, a user can easily say a word, and the screen will show the corresponding characters. The system is also designed to help newcomers and children with hearing loss as it can teach them the alphabet, spelling and the first three levels of ESL vocabulary. Finally, the ESL e-Dictionary is aimed at educational institutions teaching ESL to become an easily accessible electronic reference work and ensure the unification of Emirate sign language in the region.

Methods

Arabic Sign Language Recognition

Arabic Sign Language Datasets

In their scientific paper, In has promised to release a data set called: The Jumla Qatari Sign Language Corpus, in a promise to solve the problem of the lack of alignment in the Arab deaf communities. In 2018, the United Arab Emirates launched the Emirate Sign Language (ESL) Dictionary in its official portal under the Zayed Higher Organization (ZHO) by people with determination to unify the ESL signs in the UAE [17], but as far as the search went, no ESL dataset was found.

Object Detection

FCOS
YOLO

Where precision is represented in the y-axis and recall is represented in the x-axis, as shown in Figure 2. In the Microsoft COCO (MS COCO) challenge, the guidelines [21] set a threshold for IoU, which is the minimum is value of IoU to consider a detection as positive. One of the most popular anchor-free detectors may be YOLO version 1 [23], where instead of using anchor boxes, it provides bounding boxes at locations near the center of objects and uses only those locations near the center, as it is believed to provide accurate to be able to produce detection [22].

In the shared heads between function levels, the central areas are parallel to the classification layer [22]. The author of [22] introduced the 'center-ness' branch to the model, as he has observed that many of the low-quality bounding box predictions generated by locations far from the center of the object are the major problem of FCOS goods. The centerness branch estimates the score associated with the low-quality bounding box through the formula shown in (5).

The centerness decreases from 1 to 0 as the frame deviates from the center of the object. Continuous improvements were made throughout the experiments by moving the center-ity branch to the regression branch, or sampling only the central parts of the ground truth boxes for positive sampling, etc. Finally, at the time of the publication and the experiment, the FCOS w/improvements outperformed all other object detection models, including YOLO version 2 at the time, at 44.7 AP.

YOLOv4 [27] addressed a very important problem at the time of their publication, which was that today's most accurate neural networks do not work in real time, in addition, they require a large number of Graphics Processing Units (GPUs ) for the training part of the discovery. To induce a trick in the neural network that there is no object of interest in the image. In the second part, the neural network is trained to find an object in those altered images in the normal way.

Figure 1: IoU Definition for Showing Boundary Boxes Overlaps

Emirates Sign Language E-Dictionary

Dataset

Key Gestures

Keyframes

Thus, as a video enhancement technique, a low frame rate on the min-shot algorithm can result in a higher accuracy [30][31]. Therefore, each recording of our dataset went through the process of converting the recordings to Graphics Interchange Format (GIF) images [32] and was set to 10 keyframes per second. The number of low frames per second ensures the efficiency of the system and helps to explore towards the low-shot learning model.

Figure 17: Chosen Frames Sequence are Indicated by F0 to F9, which Mark the Chosen Frames Per Every 30 Frames

System Design

Dictation
Vocabulary
Alpha Webcam
Spell

Clicking the dictation button activates the feature, when the microphone is turned on and waiting for voice input, the microphone state changes to MIC ON. If a tag is found in the directory, the system searches the character image library to retrieve the matching character. Once acquired, the system plays the corresponding character in the logo frame and replaces the logo with the character as shown in Figure 21.

The Dictionary feature allows the user to search for tokens in a textual manner, where the user enters the word in the input box shown in Figure 22, and the system retrieves and plays the corresponding token if found. 21 Figure 22: Vocabulary input box has a text input field where the user can look up vocabulary and sentences in the sign language. The YOLOv8 model is a state-of-the-art machine learning and object detection model to detect and recognize objects [7].

The said feature detects boards being demonstrated via webcam and shows how accurately the board is being executed. The user enters the word or name to be spelled in the alphabet of the ESL into the input box and the input is first checked to see if the input is valid/empty. If not, the input is processed and the corresponding letters are displayed. the user sequentially.

Figure 19: The Flowchart of the System Main Functions Excluding the Spell Feature, which is Shown in Figure 25

ArSL21L Benchmark on YOLO Latest Models

YOLOv7

YOLOv8

Training
Performance evaluation

As well as the box, focus and classification losses from the validation stage also showed a rapid drop until about 100 epochs where they also started to plateau. After training YOLOv8x, we introduced the new and unseen set of the dataset to make predictions, examples shown in Figure 29.

Figure 28: YOLOv8x Plots of Box Loss, Classification Loss (CLS), Distribution Focal Loss (DFL), Precision, Recall and mean Average Precision (mAP) During the Training where the X-Axis Refers to the Number of Epochs

Performance Evaluation

Dictation Performance

Text-to-Sign Performance

Response Performance

Dictation response time
Text-to-Sign response time

33 Table 10 shows the given commands and the average response time for the system in both stages. To measure the response time of the vocabulary function, we need to calculate the time it takes from receiving the request to translate to the system displaying the first frame of the corresponding character. 10 random words were used, as shown in Table 11, and the average response time of the system is 0.36415 seconds.

Table 10: Average Dictation Response Time

Discussions

Moreover, it was observed that the signers always emphasize the key gestures of a sign, in order to distinguish between two signs that have almost the same demonstration. As can be seen, there are many similarities between the two signs, where the main gesture is to bring the hand to the nose, denoted by K1 and K2, the subsequent gesture must determine which sign the signer is trying to sign. The signer in Figure 34 repeats the key movements K2 and K3 once more to distinguish between the color brown and yellow in the signing process.

However, we chose to ignore this emphasis on key strokes as we followed the Zayed Higher Organization (ZHO) sign dictionary of determined people.

Figure 32: Real Life Signing Scenario of the Sentence "I want a translator"

Conclusions and Future Works

Further work can be done by expanding the ESL e-Dictionary's dataset in terms of increasing its size and exploring toward the little-shot learning algorithm to provide a real-time continuous sign language interpreter to realise. In addition to context searching, understanding the Arabic language would improve our system in terms of reducing WER and creating a real-time continuous sign language interpreter for the Arabic language. Edalatishams, “Accuracy of ASR Dictation Programs: Ensure Current Programs Have Been Improved,” Proceedings of the 10th Pronunciation in Second Language Learning and Teaching Conference, pp.

Wahed, “Translation of Arabic Speech to Arabic Sign Language Based on Cloud Computing,” Egyptian Informatics Journal, vol. El Atawy, "Intelligent Arabic Text to Arabic Sign Language Translation for Easy Deaf Communication," International Journal of Computer Applications, vol. Jemni, "For a translation system from Arabic text to sign language," Universal Learning Design: Proceedings of the Conference Universal Learning Design, pp.

Zaiton, "An Avatar-Based Translation System from Arabic Speech to Arabic Sign Language for Deaf People," International Journal of Information Science Education, p. Mahmoud, "Automatic translation of Arabic text into Arabic Sign Language", Universal Access in the Information Society, p. He, “FCOS: Fully Convolutional One-Stage Object Detection,” Proceedings of the IEEE/CVF International Conference on Computer Vision, p.

Han, "Few-Shot Learning of Video Action Recognition Only Based on Video Contents," Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, s. Liangliang, "Video2gif: Automatic Generation of Animated Gifs from Video," Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, s. Dollár, "Focal Loss for Dense Object Detection," Proceedings of the IEEE International Conference on Computer Vision, pp.