Aksara Jawa Text Detection in Scene Images using Convolutional Neural Network

(1)

Aksara Jawa Text Detection in Scene Images

using Convolutional Neural Network

Muhammad Labiyb Afakh

∗

, Anhar Risnumawan

†

, Martianda Erste Anggraeni

‡

,

Mohamad Nasyir Tamara

§

, and Endah Suryawati Ningrum

¶

∗_{Computer Engineering Division,}†§¶_{Mechatronics Engineering Division,}‡_{Department of Creative Multimedia}

Politeknik Elektronika Negeri Surabaya (PENS)

Kampus PENS, Jalan Raya ITS Sukolilo, Surabaya, 60111 Indonesia

Email:∗[email protected], {†anhar,‡martianda, §nasir_meka,¶endah}@pens.ac.id

Abstract—Aksara jawa is an ancient Javanese character, which has been used since 17th century. The character is mostly written on stones to describe history or naming such as places, wedding, tombstones, etc. This character is however gradually ignored by people. Thus, it is extremely important to preserve this near loss heritage culture. In this paper, as a step toward preserving and converting visual information into text, we develop Aksara Jawa text detection system in scene images employing deep convolutional neural network to localize the occurrence of Aksara Jawa text. This method mainly differs from the existing Aksara Jawa text works that employ manually hand-crafted features and explicitly learn a classifier. The features and classifier of this method are jointly learned from which the back-propagation technique is employed to obtain parameters simultaneously. A text confidence map is then produced followed by bounding boxes formation which is estimated and formed to indicate the occurrence of text lines. The experimental results from our method show promising results on both Aksara Jawa and English text.

Keywords—Aksara Jawa, text detection, scene image, deep learning, convolutional neural network, joint learning

I. INTRODUCTION

Aksara jawa is an ancient Javanese character which has been used since 17th century during Mataram kingdom. In-donesian history is mostly written in this text which is typically written on stones. Some urban people still use this character for example as a place name, tourist spot, wedding, tombstone, etc. Today, this heritage culture is slowly been ignored as people increasingly become more educated. In order to preserve this, an effort toward digitizing Aksara Jawa text is extremely important to be done.

Converting visual information into text has gain wide atten-tion due to its usefulness in many real world applicaatten-tions, such as tourists navigation, spotting place name, robots navigation, assisting visually impaired people, enhancing safe vehicle driving, etc. [1], [2]. Aksara Jawa detection is highly necessary for a tourist who visits Jawa to navigate to a location or get a point of ancient script quickly, such as shown in Fig. 1. This as well can be used by Javanese people to understand the script both for research or non-research applications. Thus, there is a genuine need to develop Aksara Jawa text detection to preserve the nearly loss heritage Javanese culture and help people to get the meaning of the ancient script.

Toward the goal of digitizing Aksara Jawa text, in this paper we develop a system to detect Aksara Jawa text in scene

Fig. 1. Aksara Jawa text in scene images. Background is usually complex, such as text is written on stone, cluttered non-text objects, regular pattern of roof, which are not so distinguishable from the text.

images. Differing from the problems of existing Aksara Jawa text works [3]–[7] where most the characters are typically lied on uniform backgrounds, e.g., white background such as in document, detecting Aksara Jawa text in scene images is thus more challenging due to the complexity of the background, high variation of fonts, size, and color. Despite of those complexity, Aksara text has to be robustly detected indicated by forming bounding boxes on the detected text.

Existing Aksara Jawa text works [3]–[7] are generally uti-lized many features which are cleverly integrated followed by a classifier to distinguish text or no-text then post-processing such as bounding box formation. This has been done so as to get better discriminative system. The feature is individually designed and manually tuned before connecting to a classifier followed by post-processing. The features are thus prone to error due to imperfection from manual design and require tedious jobs of hand-tuning before getting desired results. Moreover, features and classifier only has one-way connection, the features influence the classifier but not the other way round.

(2)

and has a dual connection between features and classifier. The features and classifier are jointly learned for which the back-propagation technique is employed to obtain parameters simultaneously. A text confidence map is built from input image by using trained CNN text detector. Then, bounding boxes are estimated and formed to indicate the occurrence of text lines. The experimental results from our proposed method show promising results on both Aksara Jawa and English text.

The rest of this paper is organized as follows. Section II describes related work. In Section III explains the methodology and it consists of III-A Aksara Jawa characters, III-B describes the CNN structure, and III-C explains the bounding boxes formation. IV and V describe experiments and conclusion, respectively.

II. RELATEDWORK

Feature extraction based on image segmentation and skele-tonization has been done by [3] for recognizing Aksara Jawa text in documents. After skeletonization, post-processing are done including simple closed curve, straight lines and curve. Several experiments have been done using various parameters and Java characters in order to obtain the optimal parameters. In practice, it is difficult to implement this method as it contains many parameters.

Directional Element Feature (DEF) feature has been em-ployed by [4] and a SVM classifier to recognize Aksara Jawa text in documents. DEF is built by counting up image edge neighborhood element in each character. This feature becomes an input to a SVM classifier. It is however unclear how the method deals with noises for example as the character varies.

An image processing techniques such as character seg-mentation, normalization, grayscale, and binarization have been employed by [6] and a neural network classifier trained using common back-propagation has been used. The method however contained so many heuristic rules which are difficult to achieve in practice. It is also not clear the effectiveness of the method for Aksara Jawa detection.

Template matching technique was used by [7]. It is unclear how the work tests the performance of the system as template matching have a well known generalization problem. The method might not work on a slight variation Aksara characters.

Deep learning, [9], has recently shown excellent results in various visual perception tasks, such as image classification by [10], [11], image segmentation by [12]–[14], and object detec-tion by [15]–[19]. Deep learning such as convoludetec-tion neural networks has the ability to automatically learn the effective representation of features found on the visual input, which makes it most suitable for most visual detection tasks. The features are studied automatically without involving manual design so that the parameters can be learned purely from the data.

The above existing Aksara Jawa text works [3]–[7] have fo-cused on document analysis where the text is usually monotone written on uniform background such as white paper. Moreover, the features tend to be manually design with many heuristic rules followed by a classifier to distinguish text or non-text. The features are thus prone to error due to imperfection from manual design and require tedious jobs of hand-tuning before

getting desired results. Moreover, features and classifier only has one-way connection, the features influence the classifier but not the other way round. In this work, by leveraging deep network and sliding window technique we show it is possible to solve those problems to build a robust Aksara Jawa text detection system in scene images.

III. AKSARAJAWA TEXT DETECTION USINGCNN

Overall overview of the proposed system is shown in Fig. 2. Given an input scene image, response map is computed from the trained CNN, and bounding boxes are formed to indicate the detected text lines using non-maximal suppression. One can see that Aksara Jawa characters have relatively similar shape which can cause false detection such as due to regular pattern of roof. This differs from English text detection such as [20], [21] which are based on connected components analysis.

A. Aksara Jawa character

Aksara Jawa has 20 base characters (ngelegenascript), 20 pair-characters as closing of vowels, 8 main characters (Murda

script, some unpaired), 8 pair of main characters, 5 rekan

characters and its pairs, some sandhangan as a controlling vowel, some special characters, punctuation marks, and writing marks. Some of these characters are shown in Fig. 3. Different from English text, Aksara Jawa has punctuation marks at the above and bottom of the characters that makes the detection relatively difficult. We found that these marks can deteriorate detection accuracy as it produces many bounding boxes for a single text line.

B. Text Detector CNN

In this work, CNN is employed to learn the features representation, optimizing the features and the classifiers si-multaneously. This in contrast to the existing manual features design that are optimized through tedious job of trial-and-error process individually optimizing the features and the classifiers.

Multiple classifiers which are connected in series are a common way to increase the performance of the detection system. For example of those works are [22], [23] which utilize cascaded classifier to boost the accuracy.

Conventional detection algorithms are to classify a rectan-gle region uwhether text or non-text. Those methods can be summarized as follows: a set of hand-crafted featuresQ(u) = (Q1(u), Q2(u), . . . , QN(u)) are extracted. The method then learns a binary classifier kl for each label. The classifier is a separate stage from the features. The primary objective is then to maximize to detect the labels l which is contained in region u such that l∗ ₌

argmaxl∈LP(l|u), where labels

L = {text,non-text} and a posterior probability distribution

P(l|u) =kl(Q(u))over labels given the inputs.

(3)

Fully

(a) Block diagram of overall system

(b) Input image

(c) Response map

(d) Forming bounding boxes

Fig. 2. Overview of Aksara Jawa text detection system using input scene image. Response map is shown as the probability output [0,1]from CNN,

ranging from blue(lowest) - yellow - red(highest). Note that most of the probability values spread around the detected text line. Best viewed in color.

to indicatec-th channel feature map. A new feature mapfn m+1

is produced after each convolutional layer such that,

fmn+1=hm(g

m indicate then-th net input, filter kernel, and bias on layer m, respectively. Normalization, subsampling, and pooling, which are usually intertwined with convolutional layers, are used to build translation invariance in local neigh-borhoods. This work used an activation layer hm such as the Rectified Linear Unit (ReLU) hm(f) = max{0, f}. Pooling layer is obtained by taking maximum or averaging over local

neighborhood contained in c-th channel of feature maps. The process starts with f1 equal to the resized box region u,

performs convolution layer by layer, and ends by connecting the last feature map to a logistic regressor for classification to get the correct label output probability. All the models parameters are learned simultaneously solely from the training data using commonly Stochastic Gradient Descent (SGD) and back-propagation by minimizing loss over training labels data.

Fig. 3. Aksara Jawa base characters and vowel.

In order to localize text information from scene images, a sliding window is applied on multiscaled input image. Each window is classified by our trained CNN obtaining text confi-dence map. Based on this map, the estimated bounding boxes are calculated to acquire candidate of text lines. Multiscale input image is constructed ranging from 10% to 150% of the original input image size with 10% increasing step.

We employ the CNN architecture which is shown in Fig. 4. On each sliding window it generates an image patch

u ∈ R32×32

which is used as input. First, the input u is normalized by its mean and variance which is then then becomes an input to two convolutional and pooling layers. It is interesting that, the normalization is important to reduce loss as the input is centered and bounded to a value of [0,1]. The number of the first and second filters are N1 = 96 and

N2 = 128, respectively. ReLu activation layer is intertwined

on each convolutional layer and the first fully-connected (FC) layer. ReLu is simply passing all the feature maps if it is not less than zero hm(f) = max{0, f}. In practice ReLu help to increase the learning process convergence rate instead of using sigmoid function as it could easily deteriorate the gradient.

In our experiments, we found empirically that sigmoid function after last fully-connected layer can yield betters performance. This is because the output is bounded to be within the labels which are{0,1}for non-text and text labels, respectively. Therefore, we apply sigmoid activation function followed by softmax loss layer after the last fully-connected layer. On testing, the network provides probability output value of each label.

(4)

Input patch

Fig. 4. Structure of convolutional neural network employed in this work.

space or maps onto higher dimension to further apply linear operations.

The above existing frameworks can be viewed as a kind of CNN [24], where each stage represents layers. The extracted features can be represented by feature maps including its channels and non-linear transformation as a convolutional operation as in Eq. 1 containing non-linear activation function of convolution between the filter kernel with feature maps. However, the main difference in CNN is, features integration, pooling, non-linear transformation, are all served in single layer. The computation is more efficient and it is fully feed-forward so that the method can optimize from input to output mapping consisting the features and the classifier. A new layer can be intertwined easily for a more discriminative system.

0 5 10 15 20 25 30

Fig. 5. Precision score comparison for each image.

0 5 10 15 20 25 30

Fig. 6. Recall score comparison for each image.

C. Generating bounding boxes

From the response maps, we apply non-maximal sup-pression (NMS) as in [25] to form bounding boxes. More specifically, for each row in the response maps, the possible overlapping line-level bounding boxes are scanned iteratively. Each bounding box is then scored by the mean value of

probability contained in response maps. Then NMS is applied again to remove overlapping bounding boxes.

IV. EXPERIMENTALRESULTS

The CNN network is trained using the well known IC-DAR 2003 training set and synthetic data from [26] mixed with Aksara Jawa font text and scene images collected from internet. We have collected 90 scene images containing Aksara Jawa. 60 images are used for training and 30 for testing. Some of the training data is used as validation set. We set the parameters as follows: Max iteration is 450 thousand, momentum parameter is 0.9, learning rate of SGD is 0.05 and for every 100 thousand it is decreased by multiplication of 0.1. The validation set is tested for every 1000 iterations. PC desktop core i5 with 4Gb RAM running GPU 1Gb memory, the learning proses takes about one week. The performance is measured using precision and recall as in [19]. The ratio between the number of correctly predicted bounding boxes and the number of predicted bounding boxes is called precision. The ratio between the number of correctly predicted bounding boxes and the number of ground truth bounding boxes is called recall. Therefore, lower precision and high recall would likely be produced by predicting with many bounding boxes. On the other hand, high precision and lower recall would likely be produced by predicting with less number bounding boxes. A good balance between high precision and high recall is the best performance.

Since the existing Aksara Jawa text works [3]–[7] do not have ready software to make comparison, we implement ourselves a code to represent their algorithms, that is to represents many features which are integrated cleverly which is followed by a classifier to discriminate text or no-text then post-processing such as bounding box formation. More specifically, connected components are extracted on gray im-age using a well-known MSER detector. Some trivial non-text connected components are removed using its geometrical structures that are stroke width, aspect ratio, eccentricity, euler number, extent, and solidity. Then the remaining non-trivial non-text connected components are further removed using SVM classifier. Then post-processing such as merge the text components to form a text lines bounding boxes are formed.

(5)

(6)

TABLE I. OVERALL PRECISION AND RECALL

characters could be touching its neighborhood components and the characters stroke width are likely too small. These lead to the misclassification of text since those connected components analysis tends to investigate components’s geometrical struc-ture. Table I shows the overall precision and recall score. It is interesting to note that our method show higher precision while not degrading much recall, while the previous works show high recall compared to its precision. This is because the previous works estimate quiet a number of bounding boxes and it could be the method not precise enough to predict the occurrence of bounding boxes.

Qualitative results from our method and the previous method is shown in Fig. 7. The background is very complex containing such as, regular pattern of roof, grass, trees, blur, and characters are away too small with a lot of noise on its surrounding. Despite those complexity, our method shows relatively well to detect Aksara Jawa, and English text lines, though it missclassifies few low resolution characters. Those characters are too small for even human to recognize correctly. It is interesting to noted that the existing method could not detect the text lines correctly, for example in the middle images shown in Fig. 7. This is because of the imperfect geometrical structure of the extracted components due to many noises, touching between characters, or probably the framework not deep enough to create better discriminative system.

V. CONCLUSION

In this paper, we have presented a method to detect Aksara Jawa text in scene images to solve the problem of the existing works. Existing works tend to utilize many features which are manually designed and tuned and a classifier for inference. In practice, those works are hard to be implemented and can deteriorate the detection system. Interestingly, in this method we have shown features integration, pooling, non-linear trans-formation, are all packaged in a layer. The computation is more efficient and it is fully feed-forward. The experiments show encouraging results, we believe it will bring a lot of benefit for the future works on Aksara Jawa text detection and text analysis.

REFERENCES

[1] K. Jung, K. In Kim, and A. K Jain, “Text information extraction in images and video: a survey,” Pattern recognition, vol. 37, no. 5, pp. 977–997, 2004.

[2] J. Liang, D. Doermann, and H. Li, “Camera-based analysis of text and documents: a survey,”International Journal of Document Analysis and Recognition (IJDAR), vol. 7, no. 2-3, pp. 84–104, 2005.

[3] R. Adipranata, M. Indrawijaya, G. S. Budhi et al., “Feature extrac-tion for java character recogniextrac-tion,” in International Conference on Soft Computing, Intelligence Systems, and Information Technology. Springer, 2015, pp. 278–288.

[4] M. D. Sulistiyoet al., “Pengenalan aksara jawa tulisan tangan meng-gunakan directional element feature dan multi class support vector ma-chine [written aksara jawa recognition using direction element feature and support vector machine],”KNTIA, vol. 3, 2016.

[5] Y. I. Nurhasanah, I. A. Dewiet al., “Sistem pengenalan aksara sunda menggunakan metode modified direction feature dan learning vector quantization [aksara sunda recognition system using modified direction feature and learning vector quantization methods],”Jurnal Teknik In-formatika dan Sistem Informasi, vol. 3, no. 1, 2017.

[6] I. Al Farqiet al., “Aplikasi pengenalan aksara carakan madura dengan menggunakan metode back propagation [application of aksara carakan madura recognition using back propagation method],” Jurnal Ilmiah Teknologi Informasi Asia, vol. 9, no. 1, pp. 18–34, 2015.

[7] R. Y. Astuty and E. D. Kusuma, “Pengenalan aksara jawa menggunakan digital image processing [aksara jawa recognition using digital image processing],” inProceedings of the Informatics Conference, vol. 2, no. 2, 2016.

[8] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”Neural computation, vol. 1, no. 4, pp. 541–551, 1989.

[9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, 2015.

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

[11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014. [12] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchi-cal features for scene labeling,”IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1915–1929, 2013. [13] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and

J. Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 5, pp. 855–868, 2009.

[14] A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14–22, 2012.

[15] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.

[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.

[17] ——, “Region-based convolutional networks for accurate object de-tection and segmentation,”IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 142–158, 2016.

[18] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,”

IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.

[19] I. A. Sulistijono and A. Risnumawan, “From concrete to abstract: Multilayer neural networks for disaster victims detection,” inElectronics Symposium (IES), 2016 International. IEEE, 2016, pp. 93–98. [20] A. Risnumawan and C. S. Chan, “Text detection via edgeless stroke

width transform,” inISPACS. IEEE, 2014, pp. 336–340.

[21] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, “A robust arbitrary text detection system for natural scene images,”Expert Systems with Applications, vol. 41, no. 18, pp. 8027–8048, 2014.

[22] L. Neumann and J. Matas, “Real-time scene text localization and recognition,” inCVPR. IEEE, 2012, pp. 3538–3545.

[23] ——, “On combining multiple segmentations in scene text recognition,”

ICDAR, 2013.

[24] A. Risnumawan, I. A. Sulistijono, and J. Abawajy, “Text detection in low resolution scene images using convolutional neural network,” in International Conference on Soft Computing and Data Mining. Springer, 2016, pp. 366–375.

[25] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recog-nition,” inICCV. IEEE, 2011, pp. 1457–1464.