Clustering Metagenome Fragments Using Growing Self Organizing Map

(1)

ICACSIS 2013

ISBN: 978-979-1421-19-5

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

Clustering M etagenom e Fragm ents Using G rowing

Self O rganizing M ap

~

Marlinda Vasty Overbeek. Wisnu Ananta Kusuma, Agus

BUQ[lQ

Departement 0.( Computer Science

-Faculty of Mathernatics and Natural Science. Bogor Agricultural University

Email: [email protected]@[email protected]

Abstract=- The microorganism sarnples taken

directly from environment are not easy to

assemble because they contains mixtures of

microorganism. If sampie cornplexity is Hr)' high and comes from highly diverse environment, the

difficulty of assembling

DNA

sequences is

increasing since the interspecies chimeras can

happen, To avoid this problem, in this research, we

proposed binning based on cornposition using

unsupcrvised learning. We ernployed trinucleotide

and tetranucleotide frequency as Ieatures and

GSOM algorithm as clustering method. GSOM

was implemented to map featurcs into high

dimension feature space. We tested our method using small microbial community dataset. The quality of cluster was evaluated based on the

folIowing parameters topographic error.

quantization error, and error percentage. The evaluation results show that the best cluster can be obtained using GSOM and tetranucleotide.

I. INTROOllCTION

M

ETMiENOMICS i~ a study of analyzing high

complexity of rnicrobial community which

allo ws culture - independent [IJ, 12J. As we know.

only I% of microorganism can be cultured by

standard cultivation techniques. The rest should be taken directly from the environment, narned as rnetagenome sarnple. This kind of sampie contains rnixtures of microorganisms. This characteristic makes assembling process becornes more difficult because it will yield more interspccics chirneras [5

J.

To solve the problem, we used binning proccss before or after assembling metagcnome fragmcnts. Binning is a techniques to classify or cluster organism based on taxonomy [5]. [6].

There is two binning approach. the first approach is binning based on hornology such as [3LAST [7). [81

and MEGAN [9). The second one is cornposition based approach. The composition approach applicd unsupcrvised learning and supcrviscd learning as a method and oligonucleotide as an input in the

features spaces. There are ma ny application developed based on th is approach. Sorne applications that

ernployed

unsupervised learning are TETRA [10), Self Organizing Clustering [II], Self Organizing Map [12J. and Growing Self Organizing Map [I], [13]. The ones that used supervised learning are PhyloPythia [14). Naive Bayessian Classification [15]. and Phymm

[161-One of researches used GSOM combined with

oligonuclcoude to explore the genorne signatures. Clear species-specific separation of sequence was obtained in the > 8 kbp fragrnents test. The fragrnents

were derived from 30 species, which is separated into 3 dataset, 10 spec ies per set [ I].

ln this research. we employed binning based on composition with unsupervised learning. We proposed

I kbp DNA sequence derived from 18 species. We reads the fragments uniformly. The previous research [ I]used long fragments (8 kbp). Using short length (I kbp) gene a poor performance [5], [17]. In this research, we will overcorne the Iimitation of using short fragrnent. The purpose of this research is to know the performance of GSOM in c1ustering the mctagcnorne fragrnents with short fragmenl (I kbp fragrnent lenght).

II. MATERIAL AND METItODS

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

(jf'(JlI'il1 R , Self Organizin Map (GSOM)

GSOM consists of 3 main phase (Figure I). which wcrc initialization phase. growing phase and. smoothing phase [18). [19].

Initializotion phase

ln this phase. the algorithm initialize four starting nodcs. Four starting nodes which were randomly selectcd from the input dataset. The initialization nodcs wcre shown in Figure 2.

(2)

ICACSIS

2013

G T

=

-D

x

ln$F )

Where D is dataset dimension, SF is Spread Factor.

SF

value was determined by user and SF took the values between zero and one.

GSO~I PH.-\s[

SC~lI~G

W~I InlwliQ'tOOft

roros,"",,"""'oo .•l'r

aa

~m(l''''''~ln~

r~u...(" _~

.

-' ~ -;e,:

~:

--:? .~ .

.I" -

.-.~-.-•••••.,..Iu-n!wwp

hg I (i<"(Jt\! phase Contains thrcc phase which are miualisation

phase (irnualisation ~ starting nodcs). growing phase

(nodcs IS growing) as a irnportant phase, and smoothing

phase (final number or nodcs)

09

OB

U'

0)

c :

~ll

,

--~--r

. -.-.-~--__~_~-J-_J

f, : r • n" 0g I

lI(\I!:-'O"'~l "

II!! 2 lmuahvatron 'I~rllng nodc lbc values " bciwccn one and

ICIII. random"

(,'r(JII/II}: "h (/\('

Grow ing phase is the most important phase in GSO~l method. bcacausc in this phase thc map would

be set as dynamic to overcornc the lirnitation of static

map structure of SOM. Belo» is the pseudocode of gw\\ing phase ln performing mctagcnornc fragmcnts clustcring shown 111Figurc J

Error values is the distance bciwccn thc input and \\ inncr nodc The growth proccss depcnded on the growth threshold. Whcn a nodc is nol in the boundarv uf the network. it cannot gro\\ nc\\ ncighbors duc its position

ISBN: 978-979-1421-19-5

Smoothing phase

ln the final phase. the learning rate and

~~L!~ ~i~~:~~:i~~~:o~w~lght ~cd~s)

:.'.! 1':.1:1. ':::'.'C11

G~t~rr.I'~ :~ar'i~q R~:ea'~

:;';'1;hr.:·:-l:,:,·:.j ~.i:,;,

:: (\I:~_~-?r i s l1-?t~rT_i~el.."i tr oe. =t:,rr.~,=,::it.l.r:'!1 rratz1. X I

o~;,~;\r.;-.. ·..•-~iq~ -:~'-:Cl ,~"1dits

:l~i;h!"~.=~

!:-.l:r a s e -= r=:T ir·:·:1'- ",-i:1:-_-=r

': : ({'-.!~~

=

r r ...·r ;.(" ....~ '0":

r;:-(;1"·:' •.; -t_-?- ~.,::_~~

El

~-=

r'l·o·tr _L1Jo_,_"d ',J'_ ivil.L

l:.~,dif

:~iti~li:~t~o~ ~ew Lear~~~g Rate

~~.:1 :l-=lqhi:',:rr.·:,,=,,j S:z-=

~ 'l(~ i t

=,.... .. -.::.~ r1t! 1 t.-.l·j "'·_-,rrr,,-,~it :0:1

: 0:; ••••. ~ !"t:--:."':'- •..:0,; ~1.1·~r':-_"'''

F , .':

ncighborhood parameter would be dccreased. This

parameter changcd in every iteration. When the minimum level is rcached. the value would be close to zero.

Comparcd with the prcvious work hy Chan ef al [II, wc IIed I kbp fragmcnts Icnght in-read of 8 khp Using short Iragrncnts incrcascd the cornplexity

or

clustcrcd rrucroorganisrn and made the project more difficult. since it nlways caused fragments 10 overlap

and made thcm being mi-taken tn fragrnents

asscmbl ing.

Because of that. in this research \\ e transformcd the

Ieaturcs cxtracuon rcsult to 0 - I values. The data

transtormation was uscd to reducc the data variation

and hclpcd to increase the level of truth.

1II RI \1 I I

The proposcd binning method \\as tested Pil

simulatcd mcragcnomc Iragrncnt gcnerated h~

Meta Sun 120) The <irnulatcd datasct of rnicrobc-D:"J!\ scqucncc \\ as randomly <arnpled from NC B 1 database 1211 ln this research. wc randornly took IR

microbcs. c) mrcrobcs for data trairunu and Q

microbc-for data Ic<;lmg \\ith I kbp fra!!lllcnt-IcIl!!lh and thcn clustcrcd into ~ drflcrcnt ph;lulll. f',.,:fl'ol>at'fl· r/.r.

IJU Ltcruulct cs, and thl.nnv.Iia«: rcspcctivc ly. Fach <rt of the gCll(lllle 'cqllcnce \\;\S scparaicd into two order' (Tf oligollllcleot ide frequcnc ics (trinuc lcotidc and

(3)

ICACSIS 2013

---

-

--_._---0.6

for

rrinucleoride frequency and 0.8

for

tetranuc

Icotidc frequency.

The sirnulated daraset was extracted

by

koma

frequencies method to get the specific oligonucleotide frequcncy and to put it in the composition

matrix.

After e xtracting. each of the dataset was scaled to obtain

a dataset

between zero and one. After

scaling.

we separated the data set iruo data training and data testing (Table I and Table II). We used data training to obtain the trained model and used data testing to evaluare the GSOM algorithrn. The flowchart of

" AULI.I

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

!J." 1..\ r""',ISI~li

~lJcrobc,

.-\udJth'llbacJllu,> fcnvoranv SSJ chrornosorne

Bunchncra aphuhcola (('onara tujufihna)

chromosornc

.3 Burkholdcr ragtumac B tiR Ichromosorne I

Hlanabactcnum sr (Hlabcruv grgantcus}

chrornosornc

Havobactenum branchrophilurn 1'1.·15

Pro otclla denucota ~U~8<1chrumosornc /)

7 ( hl,,,",,lta numdurum NI~~ lhl~Ill\J"phJlJ Icl" lc l'~'I

" 'lm k a O lJ n(~C\Cll\l\

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

Ilhf\lflH l,nm t: chrornovornc S"11 1.11

I AHl.I: II

[),\Tr \ TESTINlj

II,,,, IIJlJ"llI'I"" ,"In It>rt,,(k, ·\Tee 1'~N

c hrornovomc

Hrucclla (anIS .-\ Iel'ChhH1111snnJc 1

Rh""t>'IHII ,'11, C1·N ~~

II~cl"'I>ldc' f,ag 1ii'11.1XK

Pr~'\t)I\.·II~nll:I~I1Il1()g\.·I1ICtl ,\1el' :!5X-l5

t:hrornp~IHnl' I

(, I'r\':\l'lrll.1 fU I1 IIf)I\..\.IIJ ~ .~ \.h h1Ilh · ...,· llh .·

l'hlilnl\ll"phtl" 1",,,,,,,11>1>1.10: :\K.I'l

R l'~raehl~In\J,,, ac antha.mmocbac 1'\'-7

chrornosornc

<) waddha chondrophrla \\ -;1' X('·I()·U

2

GSOM algorithm for clustcring rnetagenorne fragments are shown in I· igure 4.

To evaluare the clustering pcrformance. we used topology prcservation (topographic error). mapping precision (quantization error) and error perccntage

122J. 1231·

Wc also used time parameter to calculate

the efficiency

Quantization error is a common error mcasurernent that rneasure the averagc distance between each data vector and its Best Matching Unit (BMU) Definition of BMU is a a randomly sarnplcd vcctors tha count the nearcst distance bctw cen vcctors

1:::2

J Short ly. quantization error rncasurc the mapping precission bctwccn input vccior ii and ncarcst weight vccror

mz;.

qe

=

2. ~

Iin -

mlillzyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

! Y L

.\'

ISBN:

978-979-1.:121-19-5

The expected map is obrained when the value of qe

(

s---=-T_~

/ C.~wt

~

II~ -t

;' 0""""",, /

LI __

"""_"'r(-"r_'_

,.--1-.

{ E n d )

'---'

:\IlJI~ ~ISproccdurc clu ....tfrmg rnctagcnornc fr3gllll"nl\

u'Ing (iSOr-l algouihm

1,\111.1. 111

ME !,\(jDJUMI' 1·I{Mi1\lLN 1S M',\I.YSIS V,\LlIE

l nnuclcoudc tctr anuclcoudc

rl!rxJ~raptll(

() llfl7 (Jn,,(,

error

()uanlllalll>n _1.1\).1

fJ 7~2 error

Irror pcrcc ntag c I ~ 7.l". I ~ .lRo.

Func (see)

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

(,00 ₂₈₈₀

rcachcd the minimum value.

To rneasure the topology preservation. we use topographic error. The topographic error considcrcd the map structure and cxplaincd the correspondcnce-betwccn input data. lhis error measures the proportion of all data vcctors for which first and second B\~U are not adjaccnt vcctors

122J.

.'1

1~

-Ce

=

"ii

L

u(xi)

Where ll(xi) = 1 if ii data vcctor first and second B\lL:s are adjaccnt and O. othcrwise.

The error pcrccntagc used in this research was calculatcd based on the rcsult of rnisclassification of the mctagcnornc fragrncnts data.

(4)

ISB~: 9i8-9i9-1-t21-1!l-5

--- --- -- -

-ICACSIS 2013

lnn ' frcquenc, usrng (iSOM a

We use 06 SF value to control the neuron growth

stop growing in 28 x 13 mapmc

m

Map

with the values of 18, 74°'0 and 1848

~o for using trinuclcotidc and tetranucleotidc. rcspcctive ly. This

tcndency was also shown by topology preservation,

Both of oligonucleotide frequcncies gavc the

topographic error of 0_067 for trinucleotidc frequcncy

and 0_066 for tctranucleotide frcqucncy

l lowcvcr both topograhic error rcsult wcre cnough

10 prove that every BMUs in the map gr id was not adjacent vectors. It show ed that both maps gavc a quite good map prcservc to clustcring a metagcnomc

fragments.

Morcovcr b~ analyzing their quantization error. wc can concludc that tctranuclcoridc gavc bener clustcr than trinuclcotidc. Tctranuclcotidc frcqucncy gave a ma pp mg prccrsion result of 0,7~~. berter than that of trinucleotide frequcncy which is I J04, II mcans that clusters constructed using tctranuc lcotide frcqucncy

feature are more dense, Wc also showcd the rnapping

rcsulrs using trinucleotidc Ircqucncy (Figure ~) and

uS.ing letranuclcotide frcqucncy (Figure 6)

IV, COI"CI.lISIO~

Our method. cornbrrung (jSO~1 and

oligo-nuclcotide can show 3good pcrformance in clustcr ing mctagcnomc fragmcrns ln phylurn level with 5h0r1 fragment (1 kbp) 'l hc rcsults showcd that the pcrforrnance of clustcring using tctranuc lcoride was bener than using trinuclcoridc. The error pcrccntagc rcsult of using tctra-nuclcoudc is 18 .tRO 0 and the

quantization error \\3S 0 ~72 less than lIsIng tr

r-nuclcotidc \1orc()\ er, the ropographic error ISsl11<1l1.

lrg 6 \brrlOg tctranuclcoudc rr~411~n(\ l""l~ (i'i< )\1 ;tI~pfllhm

\\'~ li"

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

I) ~<"1' value Ip ,"1111,,1 11\,' neuron ~r""lh Map

"l\'r ~rtl\\ In~ 11\30 X 12 mar

"d/~

whic h Illcall~ the neuron in m.ip ;:.rid I' not adi~I(L'llt

UIlC ~1I1(1anothcr I~a,ed on these rc-ults \IC can

conclude th~lt ljS{)\l al~l'rithlll I' suitablc tor ll1apflll1~ the lIlet:q;cIlUI1lL'

fragmcn: kll~th ~1I1dha-,

fra;:'lIlcllh \Iuh

the llppl)rtllllitl

{ ~~ (h,lll. II,,, ... I,II\~ .md <, " 11.11-...:.111111;':1.' '\ "'1Il~ l;rll\\H1~ \.:11·( Ir~.II\I'-IIl~ \1.11''' Ip Impf!'\": the

HlIlIlIl1~ 1"1\1.."'" II) I11\HPIlI1ll'I11.li \\ hllk-lICllllllll." \th'll!lIn

\l'lPI(lIllll~ j"IIJllo d ,., "'''''Il'/II. fllC dll I /{I"/'" h'/"!I',,"

\ 1 ~IIPS : .,S

\1 \ (I \!;Jlk\ \h.-I.I!'Uh'lllh.' .~ III: ....'11: i('1111111.'1 \ ,,!!!..bk

hill' \\\\\\ maun, •...lltll1l;llIc\ ,'r~ pllhlh ••1I11'fl' 1111111 -, !1~1I.1\.lJll~1 '1 ~.1",1I Jlal lLH.1 \\,,1,+1.11

"'111111\111111(' 1111d'\'i'fll.lflllll,IIL'd '\..· .I\\.I!O ""'1 II!'" iI:1 'H

,n /1"'II/J/j"i"l'\ \t'\ \..;, pr 211' ~ I·J ~tlll.1

I f \ rrurr " !~\.· f'IIIl;:I,· nI I Ill'Id\."H"'\:r~·\ I 111Ir,,:rl1

I) RII•...•...h I \ I1'\.11 I) \\:1 I ".1111 ....(11 Jo.... I '.(1 '11 \\

"1..1"1\!l I I. 'i" •.••j (\\ jj ~Il.'r \1 \\ i,'ill.l' i\ '.r.t!"'11 ti \\hill" I I'rh:r' '11 J Ilt'lll1l.1Il R !',H"'Il" Ij

!'...:Jc!!·ldhl·:l ( !'!.::;:':1. ...1; ~ t-.:.t~l.:r.:.. .1Ih! ! 1 fI •..•iili:h

j11\ H,·tl1ll\·nl.Jl :'(I'II'lIil '\'lt'I~'llll \Cqtllll •.ll1~ ,·1 1111..'

'.H~_.I"\., (,<.1 .. -,•.. !.";I.!.. \'.! ' ~ ilt· :'10-.1

rp

l~ . -1 ~I"'~

\h"\~"!,!:,"r\.. ... Hill {,I"lllll'! \h:I<I;..'\,.'II,·mc \·1.:/· ..• ·.• : /' i" "0' I':, ,. , tO' ":, \ \,,' I :"'r . : -,

[I J

I~I

1-'1

I' I

\\ \ ~,\I' lm.~ 'l1h "l': \rrr •. Lhc .. f,' Ill1rr .. \1Il~' fhe

1\"fl"rrll,l~ld .! ~l' 11,'\.' 1)', \ 'l'4lh.1hl" \"l'Jllt-l\ .1:HJ

\hl.l~':il n n , ( 1,1,'111 •..tt;, 11 "1 'h"rI I"I:':l1lll\l •• Ir"11! 'v c vl

r'(!.~· I.lllIl1 '~~i".~I~· .:I -k .. ln ..llitJi( -f Ic ...hl:,·;··f'\

,'''1_'

'~i .tf

(5)

ICACSIS

2013

ISBN: 978-979-1421-19-5

(S]

Fragrnent Binning.'

IEEE Sympostum OlfCIBCB. \"01. 8. pp.

4&-53.2008.

H. Zheng and

1-1. Wu. "A Novel LDA and PCA-based

I-lierarchical Scheme for Metagenomic Fragrnent Binning ..'

IEEE Symposium on CIBCB. 2009.

D. II. Huson. r\. F. Auch. J. Oi. and S. C Schuster.

"MEGAN analysis of metagenomic data MEGAN analysrs

of rnetagenomic data:' Genome Research. \01. 17. pp. I-II.

2007.

1-1.Feeling, J. Waldmann. T. Lornbardot, M. Bauer. and F. O.

Glockner, "TETRA: a web-service and a stand-alone

program for the analysis and comparison of tetranucleoiide

usagc panerns in DNA sequences.," B.\/C biomformaucs.

\01. 5.p. 163, Oct 2004.

K. Amano, H. lchikawa, H. Nakamura, H. Numa. K.

Fukami-Kobayashi. Y. Nagamura. ind N. Onodera.

"Self-organizing Clustering: Non-hierarchical Clustenng for l.arge

Scale DNA Sequence Data," IPSJ Digital Courier. \'01. 3.

no. 2.pp. 193-197.2007.

T. Abe. S. Kanaya, M. Kinouchi, Y.Ichiba, T. Kozuki, and

T lkemura, "lnformatics for Unveiling Iliddcn Genornc

Signatures.' Genome Research, vol. 179, no.~. pp. 693-702.

2003.

A. L. Hsuand S. K.Halgamuge. "Enhanccment ofTopology

Preservation and lIierarchical Dynamic Self- Organising

Maps for Data Visualisation."

A C Mc+iardy. H. G. Martin. P. Hugenholtz. and I.

Rigoursos. "Accuratc Phylogenenc Classrflcauon ni"

Vnriablc-Lcnght D'I,\ Fragrncnt." Sa! .I!e!hods. \01 .1. ~,\

I.pr 63-72. 20 I 0 (9]

(lOI

(111

( 121

(131

( I~]

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

289

\15] G. Rosen. E. Garbarine, D. Caseiro. R. Polikar, and B.

Sokhansanj. "Metagenorne Fragrnent Classification using

N-mer Frequency Profiles." Advances in Bioinformatics, 2008.

A. Brady and S. L. Salzberg, "Phymrn and Phvmmlsl.:

Meragenomic Phylogenetic Classification with Interpolated

Markov Models." Sat

M ethods, vol. 6. no. 9. pp. 673-676,

2010.

S Prabhakara and R. Acharya, "Unsupervised Two-Way

Clustering of Metagenomic Sequences." Journal of

Biomedicine and Biotechnology, vol. 2012, 2012.

G. Zhu. "The Growing Self-organizing Map for Clustering

Algorithms in Programming Codes." IEEE lnternattonal

Conference on AIC!, vol. 3. pp. 178-182,2010.

D. D. Silva. DAIahakoon, and S. Dharmage. "Cluster

Analysis using the GSOM: Parterns in Epidemiology." in

IEEE Intemational Conference on ICIAF. 2007. pp. 63-69.

D. C. Richter. F.On. A. F.Auch, R. Schrnid, and D. H.

Huson. "Metasim - Sequencing Sirnulator for Genornic

Sequence." PLoS OSE, vol. 3, no. 10, pp. 1-34. 2008.

S Fcderhcn. 'The NCBI Taxonomy database." Xucleu: Actds Research, 1'01. 40. no. Decernber 201 I. pp. 136-143. 2012.

E. A. Uriane and F. D. Martin. "Topology Preservation in

SOM." lnternattonal Journal of Applied Mathematics and

Computer SCiences, pp. 19-22.2005.

J. Vesanto. M. Sulkava, and J. I-Iollm. "On the

Dccomposition of the Self-Organizing Map Distortion

Mcasure Dccomposiuon of the SOM distortion rncasurc ..

116]

(17]

(181

119]

(20]

121]

(221