Research on Fragments Reassembly Based on Feature of Chinese Character and Template Matching

(1)

Published by Canadian Center of Science and Education

Research on Fragments Reassembly Based on Feature of Chinese Character and Template Matching

Gui-Sen Xu^1,3, Yuan-Biao Zhang^2,3, Yi Lin⁴, Yu-Jian Lin⁴ & Xin-Guang Lv^2,3

1 Electrical and Information School, Jinan University, Zhuhai, China

2 Packaging Engineering Institute, Jinan University, Zhuhai, China

3 Key Laboratory of Product Packaging and Logistics of Guangdong Higher Education Institutes, Jinan University, Zhuhai 519070, China

4 International Business School, Jinan University, Zhuhai, China

Correspondence: Yuan-Biao Zhang, Packaging Engineering Institute, Jinan University, Zhuhai 519070, China.

E-mail: [email protected]

Received: June 26, 2014 Accepted: July 7, 2014 Online Published: July 29, 2014 doi:10.5539/cis.v7n3p92 URL: http://dx.doi.org/10.5539/cis.v7n3p92

Abstract

The technology of fragments reassembly is widely employed in many scientific fields, such as judicial evidence recovery, restoration of historic documents, accessing to military intelligence and so on, which is based on computer vision and pattern recognition. In this paper, an efficient method for Chinese fragments reassembly is presented. The proposed reassembly method is based on the feature of line spacing and Chinese characters’

feature that the same font has the same height. Considered the feature of English characters that English letters are connected components, this paper proposes a model for English fragments reassembly, which is based on template matching. Using the proposed methods on digitally scanned images of actual fragments of paper image prints has verified the robustness and reliability of two models.

Keywords: fragments reassembly, the feature of Chinese characters’ line spacing, template matching, connectivity of English letters

1. Introduction

Fragments reassembly arises in many scientific fields, such as judicial evidence recovery, restoration of historic documents, accessing to military intelligence and so on. The manual execution of fragments reassembly is very difficult, as it requires great amount of time, skill and effort. Specifically, if the number of fragments is very large, reassemble manually will be impossible to be accomplished in a short time. Therefore, the automation of such a work is very important and can lead to faster, more efficient, painting reassembly and to a significant reduction in the human effort involved.

Fragments reassembly is an application technology based on computer vision and pattern recognition, which can be divided into two parts. The first part is to access to information of fragments by image preprocessing. The other part is to reassemble fragments based on accessed information. Conventional methods utilized corner feature of edge, outline feature, area feature and so on to reassemble fragments (Jia et al., 2005; Wolfson, 1990;

Hori, 1999; Kong, 2001; Li, 1995; Gao, 2007). These methods are based on geometric characteristics of fragments. Therefore, they can’t recover fragments with similar shape. In this field, some scholars have made related researches. Otsu N. presented Otsu algorithm based on the principle of least square method, which solved the problem of searching optimal threshold by calculating the maximum variance between grey classes (Ostu, 1979). Luo Z. solved the problem of feature extraction of Chinese document, by exacting line spacing and the height of characters (Luo, 2012). Gao et al., solved the problem of labeling 8-connected region in the model of English fragments reassembly, by combining the advantages of the mark line based method and region growing method (Gao, 2007).

Considering those methods, which are based on geometric characteristics, can’t recover fragments with similar shape, this paper combined the advantages of above methods, and proposes subregional scoring method to optimize the extraction of Chinese line spacing. Because line spacing of English is not obvious, template matching method is proposed to reassemble English fragments, which is based on 8-connected region of English

(2)

process of based on th In course algorithm.

1, means a Finally, bi transforma spacing.

If a paper spacing. T line, can b paragraph some fragm cases caus method is more area appropriat into anoth more accu

f model is div he similarity o of locating lin The next step all the pixels a inary image m ation not only

r is divided int Therefore, acco be found. How

and the end o ments’ line spa sed large erro

presented to o as. And then th te accuracy. Fi her problem th urate.

vided into two of edges of bin ne spacing in f p is to compare are white, this matrix is trans greatly reduce

Fig to two parts, ording to this f wever, becaus of paragraph, c acing feature a ors to the resu optimize the ex

his paper com inally, the pro at compares s

Figure

Figure 3. Fra

sub-progress arized pixel m fragments, firs e the value of t pixel row will formed into a e the amount o

gure 1. Transfo the break edg feature of leftm

e the size of contain large b are very simila ult of classific xtraction of lin mpares the sim

oblem, which c similarity betw

e 2. Fragments

agments that v 93 including the matrixes.

stly, the proce the same pixel l be map into 1 a column vect of data process

ormation of bi ges must have most fragments

fragments is blank area, as s

ar, but not real cation. In orde ne spacing. Th milarity betwee compares sim ween those are

s containing la

very similar bu

classification ss of image bi row in binary 1; otherwise, t tor with 0 and sing，but fully

nary image the text line s, all other frag

so small, som shown in Figu lly matching, a er to solve th he first step is en the same ar ilarity between as of two vect

arge blank area

ut not really ma

process and r inarization is c y image matrix

this pixel row d 1, as shown y embodies the

with same hei gments, which me fragments,

ure 2. Another as shown in F his problem, su

to divide the reas of two fr n two fragmen tors. This met

a

atching

reassembly pro completed by x. If all number will be map in n in Figure 1.

e feature of the

ight and same locate on the from beginnin special case is igure 3. Those ubregional sco column vector ragments unde nts, is transfo thod led to a r

ocess Ostu rs are nto 0.

This e line

e line same ng of s that e two oring r into er the rmed result

(3)

After usin score was the simila fragments, which are have made Finally, ba similarity row. When specially. U the similar the feature 6, the line edge is 21p similar situ

2.2 Steps o Step 1: C

^A Step 2: Div

g subregional also low. For t arity of other , which contain

very similar b e classification ased on result of edges of bi n reassembling Under this situ rity of edges o e that printed d

spacing is 28 px, we just ne uation, fragme

of Algorithm Comparing bin

^{( ) 1}

A ki  ；or viding bi ^Tinto

Figure 4. Sc scoring metho the case in Fig

areas is high ning a large nu but not really m n result more a

t of classificat inarized pixel g fragments, t uation, there is of binarized pix

documents hav 8px and the bre

ed to search fo ents can be rea

Figure 5. Fra

Figure 6. Spe

nary image m

 A ki ^{( ) 0} . Fin o

a

areas, an

chematic diagra od, the similar gure 2, only the h. Therefore,

umber of blank matching, can’

accurate.

tion, the fragm matrix. The n the case that b no black pixe xel matrix. Th ve fixed line sp

eak edge just or a fragments ssembling by t

agments that th

ecial case that t

matrixes  Ai o nally,  Ai is tr nd denoting as

ams of subregi rity between tw

e area containi subregional sc k, can be reass

’t be reassemb ments in the s next operation break edge loc el in the edge, s erefore, an im pacing and cha

located in blan row with 7px the same meth

he break edge

the break edge

of the first i

ransformed int s b1i b2i  b

ional scoring m wo fragments

ing a large num coring method sembled togeth bled by mistak same row wer is to find the cated in blank

so fragments c mproved method

aracters distan nk area. Bacau

line spacing u hod.

located in blan

e located in bla

fragments, if to column vec

^T

bai . Every are method

in Figure 3 w mber of blank d can ensure her. What’s mo ke. It can be sa

re reassembled correct order area is neede can’t be reassem

d is presented, nce. For instanc

use the line sp under break ed

nk area

ank area

f the first k r tor  bi^T. ea contains H

as low, so the has a low scor that two mat ore, two fragm aid that this me d, according to of each fragm ed to be consid mbled accordi , which is base ce shown in F pacing above b dge. If encount

row all is 1,

/ a pixels(^Hi total re, so tched ments, ethod o the ments dered ng to ed on igure break ering

then s the

(4)

Step 4: Ca

W th Step 5: Af

th to Step 6：A

th an to co 3. Model f Considerin Furthermo templates.

3.1 Buildin Found out pictures, th was implem

3.2 Extrac According within fra boundary sample fra Figure 9.

8-connecte Step 1：M Step 2：A

m

represent tolerable max alculating total

Within the faul he same row, o fter classifying he similarity of o be reassembl After reassembl his paper takes nd bottom edg o the feature t

orrectly.

for English Fr ng English let ore, the numbe

So fragments ng of English C t the font type

he library of c mented, such a

ction of Letters g to this featu agments is ma of connected agments of L a Figure 8 and ed region.

Marked connec According to th

s the first

m

x numbers of di score of two f

lt tolerance



or they are in d g all fragments f edges of bina led according t ling each row, s each row as a ges. If the brea that printed do ragments Rea tters are comp er of English pu can be reassem Characters Tem of this Englis character templ

as Figure 7.

s

ure that every arked in the o component. F and H. The foll d figure 9 are ted componen he boundary lo

m

subregion i ifferent value b fragments



^{, if}Zscore¹⁰ different row.

, reassembling arized pixel m to the feature t the next step i a fragment. We ak edge located ocuments have assembly

posed of 52 u unctuations is mbled accordin mplates Librar sh document, a lates can be se

Figure 7. Bina

English letter order. The nex Finally, all lett lowing is the e schematic di nt of L and H in

cation of B an 95

in fragments.

between two a

1 a

score sco

i

Z Z





00(a2), the fi g fragments be matrix. If the br that printed do is to extract the

e recover all ro d in blank area e fixed line sp

uppercase and small. Theref ng to those tem ry

and then found et up. Finally,

ary template of

r is an 8-conn xt operation is ters are extrac extraction proc iagrams, which n the order, as nd L’s 8-connec



represent areas.

ore( )i . irst

i

fragmen elong to the sam reak edge locat

cuments have e edge of top a ow fragments a, fragments n pacing. Finally

d lowercase le fore, it’s conve mplates.

d all pictures o the process of

f letter O

nected region, s to extract th cted in the sam cess of letter L

h is aimed to shown in Figu cted region, ex

ts fault tolera

nt and the firs me row in the ted in blank ar fixed characte and bottom of e

according to th need to be reas

y, all fragmen

etters and Eng enient to make

of characters. A f character tem

each of conn he whole lette

me method. F L and H. The re o better explai ure 8.

xtracted letter L

ance, which is

st

j

fragment a

order, accordi rea, fragments ers distance.

each row. And he similarity o ssembled accor nts are reassem

glish punctuat English chara

According to mplates binariz

nected compon er according to Figure 8 show

esults are show in the princip

L and H.

s the

（2）

are in ng to need d then of top rding mbled

tions.

acters

these ation

nents o the

s the wn in le of

(5)

3.3 Detect After extra vector{ }M

characters whether th transforma method is 3.3.1 Initia 3.3.2 Tran 3.3.3 For

End f

tion of Comple acting the lett

T, according to embodied in here is a same ation. If there i shown as follo alize the Param sform those bi

count=1,2,3, Transform tem If rows of

{M_XOR}^T

Add each Else

Add each End if If Sum Sum XO

Sum Sum

End if If Sum0

This cha End if for

ete Letters ters from fragm

o the model fo

{ }M ^T, which i template as th is the matched ow：

meters S inary images o

…,TP

mplates into co

{ }M ^T is equal

{ }M ^TXOR M{ mod



h value of {M

OR

mXOR

aracter is comp

Figure 8. Bin

Figure 9. Mar

ments, the firs or Chinese frag is convenient t he fragment, th d template, it s

100000 Sum 

of letters into{M

olumn vectors;

l to rows of {M

d_el}^T;

}^T

MXOR and assi

}^T

M and assign

plete.

nary image of L

rked image of

st step is to tr gments reassem

to compare th he next step is shows that the

000 , number

}^T M .

;

mod_el}^T M

ign toSumXOR. n toSumXOR.

L and H

ransform binar mbly. This pro he characters a

to compare {

character is co

r of torn piece

ry images of l cess makes all and templates.

{ }M ^Tand templ omplete. Pseud

( ) es TP .

letters into co l the informatio

In order to d lates with the do algorithm o

lumn on of detect same of the

(6)

around bre Figure 10.

Figure 11,

In order to flowchart

eak edge. If th . Otherwise, it the area inside

o reassemble a of the model i

ere aren’t inco t means these e wireframe is

Figure 10. C

Figure 11. Inc all fragments i

s shown in Fig

omplete charac two fragments s the area aroun

Complete chata

complete chata in the same ro gure 12.

97

cters, it means s aren’t match nd break edge.

acters inside ar

acters inside ar ow, the searchi

these two frag hed, as shown

.

rea around brea

rea around bre ing procedure

gments are ma in Figure 11.

ak edge

eak edge is iteratively

atched, as show For Figure 10

applied. Algor wn in 0 and

rithm

(7)

4. Experim The perfor paper ima appendix

72 180px ( model for 13.

Figur ments and Re rmance of the age prints (CU

3 and append (CUMCM, 201

English fragm

re 12. Algorith esults

proposed met UMCM, 2013 dix 4 included 13). Model for ments reassemb

hm flowchart o

thods was eval ). There were d 209 fragmen r Chinese fragm bly was applie

f model for En

luated using d e Chinese fra nt images resp ments reassem ed to reassemb

nglish fragmen

digitally scanne agments and E pectively, and mbly was appli le appendix 4.

nts reassembly

ed images of a English fragm d each fragme ied to reassemb

. The results a y

actual fragmen ments. Furtherm

ent image had ble appendix 3 are shown in F

nts of more, size 3 and

igure

(8)

Table 1. A

4.1 Analys Table 1 sh reassembly the result o appendix 3 the reassem with the sa error in the

Application resu Mo Mo sis on Result of hows that the y time efficien of recovery. O 3 was too wid mbly of all row ame line spaci e process of cl

(a) Figure

ults of the mod odel for Chines odel for English

f Appendix 3 e recovery rate ncy is high. In One of the impo

de, as shown in ws. It can be se ing. Another r lassifying.

e 13. Results o

dels

se Fragments h Fragments

e of model fo the process of ortant reasons n Figure 14. T een that the m reason is that

Figure 14. Co 99 of appendix 3 a

Recover 88.57 96.17

or Chinese fra f reassembling

is the line spa This reason cau model is more s

the line spacin

ompare of line

and appendix 4

ry rate Co 7%

7%

agments reass g fragments in acing of the sec

used serious er suitable for rea ng of few frag

spacing

(b) 4

omput. time 3.03s 47.38s

sembly reache appendix 3, tw cond row in th rror in the pro assembling the gments was ve

ed 88.57% and wo reasons affe he last paragra ocess of compl document pri ery similar, cau

d the ected ph in eting nting using

(9)

4.2 Analys Table 1 sh higher. Th recognition The font i fragments, we only n based on te In the reas the feature Figure 15 extracted t and lower connectivi impact on

In this ex fragments region, ou Furthermo have the s again. Fina in daily lif 5. Conclus This pape reassemble that the sam English c characteris Therefore, had high re In daily li handwritte fragments recognizin method of English fra 2013).

As a futur calculation Acknowle The author Logistics Universitie the proje (No.2011B

sis on Result of hows that the he reason is t n. However, w in fragments o , Zapf Calligra need to set up

emplate match ssembling proc e of English ch .These Englis these characte rcase letters m ity of letters i accuracy, whi

xample, all fra can’t solve th ur model for ore, most fragm

ame shape. W ally, we get ma fe.

sion

r considered t e Chinese frag me font has th characters’ co stics, the prop , the models ar

ecovery rate an ife, the handw en documents, reassembly is ng Chinese ch f double elasti agments reasse re work, in ord n will be inves edgments

rs acknowledg of Guangdon es, the project ects of Zhuh B050102013 &

f Appendix 4 recovery rate that this mode we utilized met

of appendix 4 aphic 801’s tem

a new templat hing presented cess, occasion haracters’ 8-co h punctuation rs through con meet the requir

in our model.

ch caused by t

F agments were he reassembly r English do ments in daily When people to any fragments

the different f gments, the m he same height onnectivity is

posed models re very practic nd the robustn written docum we need to re s similar to th haracters and E

c mesh is use embly, method der to speed up stigated.

ge the financia ng Higher Ed of the Natura hai Science,

& 2012D05019

of model for el was based thod of exhaus 4 in this exam mplate library te library. It ca

in our paper c ally there wer nnected region ns didn’t meet

nnectivity, they rement of 8-co Therefore, in those special p

Figure 15. Not e 180 72px im

of these fragm cument fragm life have the s rn papers, they with same sha

features betwe model based on is proposed. I proposed. C s can accurate cal in daily life ness and reliabi ments are very ecover those f he method pre English letters d to recognize d of principal c p calculating,

al support of th ducation Instit al Science Fou Technology, 990033).

English fragm on template m stion in the pro mple was Zapf y was set up. I an be seen tha can be widely e re little mismat

n. However, no t the requirem y were divided onnected regio n the process punctuations, c

connected pun mages. The m ments. By extr ments reassem

similar shape.

y usually torn ape. Therefore

een Chinese c n the feature o In order to reas Compared w ely recover d e. After a pract ility of models y common. In fragments. In g esented in this

s. In model o e Chinese char curves algorith using intellige

his research by tutes , the Fu undation of Gu

Industry, Tr

ments is very h matching, whi ocess, caused th f Calligraphic f we reassemb at the English employed in d tched cases. D ot all English c ment of 8-conn d into two con on. And this p of reassemblin could be ignore

nctuations method based

acting the cha mbly can rec For example, papers in half e, it can be seen

characters and of line spacing

ssemble Englis ith traditiona document frag tical experimen s were verified n order to sav general, the m paper. The m of handwritten

racters (Chen, hm is used to re ent algorithm t

y the Key Labo undamental R uangdong Prov rade and In

high but the c ich has very he higher com 801. In order ble document w

document frag daily life.

Due to English characters are nected region nnected areas.

paper focused ng fragments ed.

on geometric aracters accord over the doc fragments pro f, then fold the n that our mod

d English char g and Chinese sh fragments, t al model bas gments with th

nt, it can be se d.

ve some impo method of hand main difference n Chinese frag

2009). In mo ecognize Engl to optimize the

oratory of Prod Research Fund vince (No.S201 nformation Te

computation co high rate of mputation cost.

r to reassembl with different gments reasse model is base

connected, su . When this p

But, all uppe d on the use o by the model

c characteristic ding to 8-conne cument accura oduced by shre em in half and del is very prac

racters. In ord characters’ fe the model base sed on geom he similar sh een that our mo ortant fragmen

dwritten docum e is the metho gments reassem odel of handwr

ish characters e process of m

duct Packaging ds for the Ce 12010008773) echnology Bu

ost is letter le all font, mbly ed on ch as paper rcase of the l, the

cs of ected ately.

edder d torn ctical

der to ature ed on metric apes.

odels nts of ments

od of mbly, ritten (Ma, model

g and entral , and ureau

(10)

101

Gao, H. B., & Wang, W. X. (2007). New connected component labeling algorithm for binary image. Journal of Computer Applications, 27(11), 2776-2777.

Gao, X. T., Sattar, F., Quddus, A., & Venkateswarlu, R. (2007). Local natural scale based contour corner detection using wavelet transform. Proceedings of Information, Communications and Signal Processing, 25(6), 329-333. http://dx.doi.org/10.1109/ICICS.2005.1689061

Hori, K., Imai, M., & Ogasawara, T. (1999). Joint detection for potsherds of broken earthenware. Computer Vision and Pattern Recognition, (2), 440-445. http://dx.doi.org/10.1109/CVPR.1999.784718

Jia, H. Y., Zhu, L. J., Zhou, Z. T., & Hu, D. W. (2005). A Shape Matching Method for Automatic Reassembly of Paper Fragments. Computer Simulation, 23(11), 180-183. http://dx.doi.org/10.3969/j.issn.100 6-9348.2006.11.046

Kong, W. X., & Kimia, B. B. (2001). On solving 2D and 3D puzzles using curve matching. Computer Vision and Pattern Recognition, (2), 583-590. http://dx.doi.org/10.1109/CVPR.2001.991015

Li, H., Manjunath, B. S., & Mitra, S. K. (1995). A contour-based approach to mutlisensor image registration.

Image Processing, 4(3), 320-334. http://dx.doi.org/10.1109/83.366480

Luo, Z. (2012). Semi-auto stitching of scrapped paper based on character characteristic. Computer Engineering and Applications, 48(5), 207-210. http://dx.doi.org/10.3778/j.issn.1002-8331.2012.05.060

Ma, C., & Yu, M. (2013). Analysis and extraction of structural features of off-line handwritten characters based on principal curves algorithm. Computer Engineering and Applications, 49(3), 202-206.

Otsu, N. (1979). An automatic threshold selection method based on discriminate and least squares criteria.

Denshi Tsushin Gakkai Ronbunshi, (63), 349-356.

Wolfson, H. J. (1990). On curve matching. Pattern Analysis and Machine Intelligence, 12(5), 483-489.

http://dx.doi.org/10.1109/34.55108 Copyrights

Copyright for this article is retained by the author(s), with first publication rights granted to the journal.

This is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).