1 Handwriting Recognition in Indian Regional Scripts: A Survey

A method based on recursive contour following is proposed in the same article for effective component character segmentation. The performance details of some of the OHR techniques for Oriya script are given in Table V. The features used for the recognition purpose were based on the direction information obtained from the arc tangent of the gradient.

MLP-based pattern classifiers are used in the work for most classification tasks. The features used in the MQDF were the directional features of the contour points of the components.

Advancements in Kannada and Telugu OHR

It is very clear that the highest reported accuracy for handwritten Gujarati numerals is 81.66% [Desai 2010] and for handwritten characters is 71.66% [Prasad et al. To the best of our knowledge, there are no reports of any work towards recognition of handwritten Gujarati compound characters and words. To the best of our knowledge, there are no reports of any work towards the recognition of handwritten compound characters and words in Oriya script and remain open research problems for the near future.

Four different types of structural features were used for the recognition of handwritten Kannada numerals in the scheme proposed by Dhandra et al. A support vector machine is used for the classification of handwritten Kannada numbers using zone and distance metric functions in the scheme proposed by Rajashekararadhya and Ranjan [2009a]. 2006] proposed a scheme where the chain code features are fed to a quadratic classifier for recognizing handwritten Kannada numbers.

Rajashekararadhya and Ranjan [2008b] proposed a system, in which features based on area and distance metrics are extracted for recognition of handwritten numbers in Telugu and Kannada. 2008] used a modified quadratic classifier (MQC) for recognizing handwritten characters belonging to Telugu, Kannada and Tamil scripts. 2010] for handwritten Telugu character recognition, the probability of identification of a given input character was obtained using five distance measurement methods.

To the best of our knowledge, no work has been reported so far in the direction of recognizing compound handwritten characters and handwritten words in Kannada and Telugu scripts.

Advancements in Tamil and Malayalam OHR

One of the challenging works in Kannada and Telugu scripts is their unbounded handwritten word recognition due to the segmentation problem due to the position of compound characters and modifiers. In the work of Paulpandian and Ganapathy [1993], topological features are extracted from handwritten Tamil characters. In the first phase of the proposed scheme, an input character is clustered with one of few smaller groups of characters using K-means clustering (KMC) technique.

In the second stage, they used the histogram features of the chain code computed from the input character outline together with a separate MLP classifier for each set to recognize the characters. Character recognition was performed using a Class Modular Neural Network (CMNN) with the mentioned features. An unknown input character is classified into one of two classes and then recognized as one of the characters in that class.

In the same article, an attempt is made to use the fuzzy concept as a tool for the classification of handwritten Tamil characters. The feature vector consists of the distances of the sample from the frame in 16 different directions. In the second stage, pixel density features along with a statistical classifier based on interval estimation were used for the recognition process.

The implementation details of some of the OHR techniques in Tamil and Malayalam scripts are given in Tables X and XI.

Advancements in Urdu OHR

They also noted that an MLP network requires significant time for training but only a fraction of a second for testing. The training time required for a statistical classifier like MQDF was much shorter than an MLP and the testing time is longer than an MLP. Recognition of handwritten compound characters and words has yet to be investigated in Tamil and Malayalam scripts.

For each of these points, the shape context is a histogram of relative positions of the remaining points. The similarity between two instances is the weighted sum of the cost of matching shape contexts and bending energy (BE), which is the cost of the work it takes to transform one instance into another. For faster processing, Haider and Yusuf [2007] also presented a gradual pruning approach based on the differences between the test object and the objects in the prototype set.

2009] for the classification and recognition of handwritten Urdu words, gradient, structure and cavity features (GSC) were used along with a support vector machine (SVM). The performance details of some of the OHR techniques for Urdu language are given in Table XII. From the table, it is clear that the technique proposed by Liu and Suen [2009] for handwritten Urdu numerals outperforms others in terms of accuracy (99.73).

From the experiments conducted by Liu and Suen with the gradient direction histogram function, it became clear that four classifiers (DLQDF, PNC, CFPC, and SVM) had very high accuracy compared to MLP and MQDF.

HANDWRITTEN SCRIPT IDENTIFICATION

KNN Based Techniques
Neural Network Based Techniques
Support Vector Based Techniques
Other Techniques

For the recognition of such documents, we need to identify the different script part to apply the respective script OCR. Identifying handwritten scripts is very complex and challenging due to the different writing styles of individuals. The presence of characters of similar shape in two or more scripts poses other challenges in the identification of Indian scripts.

To use this feature, morphological reconstruction and regional descriptors were used in the same work as identification features. For example, in Devanagari script documents we can get many vertical lines, while in Malayalam script documents we get many convex shaped features in a repetitive manner. Bangla, Latin, Devanagari, Kannada, Malayalam, Tamil, Telugu and Urdu scripts are considered for this work.

Roy and Pal [2006] proposed a scheme for word-wise identification of handwritten Roman and Oriya scripts for Indian postal automation. Most of the Oriya characters have a higher cavity part in the lower side where as such characteristic cavity cannot be obtained in Roman. To take care of such void pattern in manuscript identification, Roy and Pal [2006] proposed such an approach.

Based on these sorting decisions, the script's rule-based assumption engine is designed to assume the font of a numeric zip code.

DATABASES FOR INDIAN REGIONAL SCRIPT OHR

For script identification of lines of handwritten text, Chaudhuri and Bera [2009] proposed a dual method based on the interdependence between text line gap and interline gap. Indian Institute of Science, Bangalore and University of Mysore Kannada CDAC, Thiruvananthapuram and Kannur Malayalam University. For developing some of the datasets (at Indian Statistical Institute, Calcutta) like Bangla number, characters, city name and Oriya characters, various factors are considered.

Some of these datasets are very large and are collected from different categories of people including schoolchildren, students, university students, businessmen, employees and unemployed people to obtain different handwriting styles. Also, some templates of these datasets are collected against a noisy background to make the dataset complex in nature. In addition, some elements of the dataset were scanned by low-resolution scanners to obtain inferior quality data.

The size of some datasets (Kannada, Tamil and Telugu) is not very large and there is a need to develop large datasets for these scripts. A new large Urdu manuscript database was collected at the Center, which contains 60,329 isolated digits, 12,914 digit sequences with/without decimals, 1,705 special symbols, 14,890 isolated characters, 19,432 words (mainly financial related) and 318 Urdu dates in various patterns. for Pattern Recognition and Machine Intelligence (CENPARMI), Canada [Sagheer et al. The set of 156 characters shown in this collection includes, in addition to independent vowels and consonants, also compound characters and vowel diacritics that appear as separate characters to the left or right of the base consonant prevent.

An offline version of the data is available in the form of two-level TIFF images, generated from the online data using simple piecewise linear interpolation with a constant thickening factor applied.

SOME GENERAL OBSERVATIONS

Better Pre-Processing for Higher Accuracy
Selection of Features
Resolution of Most Confusing Characters
Research Related to Word Recognition
Coordination Among Researchers
Research Related to Handwritten Word Spotting
Classiﬁer Combinations
Script Speciﬁc Observations
Access to Research Materials
Reliability Measure of the System

Thus, the preprocessing of the character image is an important step before the feature extraction and classification step. A review shows that the histogram of the directional elements, of both local contour and gradient, is widely used for character recognition. For example, using script-based functions will certainly increase the accuracy of the OHR system.

Also, combining multiple classifiers considering complementary features can significantly improve the reliability of the recognition system and this should be studied for Indian regional scripts. Due to different people's writing styles, Stoke-based functions may not produce good results. Generally, these character pairs or triplets have a very small distinguishing feature, as we can see in the printed version of the characters.

Many of the researchers conduct short-term research as part of their academic program. Many of the ancient and historical achievements are written by hand and such techniques will certainly be useful to historians and language students. Many of the works associated with the regional OHR script are not available online for evaluation, as many researchers prefer to publish their works at local conferences and workshops.

Most authors working on the OHR Indic script have not provided any indication of the reliability of their systems.

CONCLUSIONS

This article presents a review of research related to offline recognition of handwritten characters of Indian regional scripts. On Handwritten Bangla Character Recognition, InProceedings of the Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP'06). In Proceedings of the Conference of the Institute for Computer Science, Social Informatics and Telecommunication Engineering (ICST'08).

Proceedings of the 2nd International Conference on Emerging Trends in Engineering and Technology (ICETET'09). A broken Bangla number merging and recognition system for Indian postal automation, Indian Proceedings of Conference on Computer Vision, Graphics and Image Processing (ICVGIP'04).