ICACSIS 2013
ISBN: 978-979-1421-19-5
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Clustering M etagenom e Fragm ents Using G rowing
Self O rganizing M ap
~
Marlinda Vasty Overbeek. Wisnu Ananta Kusuma, Agus
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
BUQ[lQzyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Departement 0.( Computer Science
-Faculty of Mathernatics and Natural Science. Bogor Agricultural University
Email: marlinda_vasty@yahoo.com.ananta@ipb.ac.id.pudesha@yahoo.co.id
Abstract=- The microorganism sarnples taken
directly from environment are not easy to
assemble because they contains mixtures of
microorganism. If sampie cornplexity is Hr)' high and comes from highly diverse environment, the
difficulty of assembling
DNA
sequences isincreasing since the interspecies chimeras can
happen, To avoid this problem, in this research, we
proposed binning based on cornposition using
unsupcrvised learning. We ernployed trinucleotide
and tetranucleotide frequency as Ieatures and
GSOM algorithm as clustering method. GSOM
was implemented to map featurcs into high
dimension feature space. We tested our method using small microbial community dataset. The quality of cluster was evaluated based on the
folIowing parameters topographic error.
quantization error, and error percentage. The evaluation results show that the best cluster can be obtained using GSOM and tetranucleotide.
I. INTROOllCTION
M
ETMiENOMICS i~ a study of analyzing highcomplexity of rnicrobial community which
allo ws culture - independent [IJ, 12J. As we know.
only I% of microorganism can be cultured by
standard cultivation techniques. The rest should be taken directly from the environment, narned as rnetagenome sarnple. This kind of sampie contains rnixtures of microorganisms. This characteristic makes assembling process becornes more difficult because it will yield more interspccics chirneras [5
J.
To solve the problem, we used binning proccss before or after assembling metagcnome fragmcnts. Binning is a techniques to classify or cluster organism based on taxonomy [5]. [6].
There is two binning approach. the first approach is binning based on hornology such as [3LAST [7). [81
and MEGAN [9). The second one is cornposition based approach. The composition approach applicd unsupcrvised learning and supcrviscd learning as a method and oligonucleotide as an input in the
features spaces. There are ma ny application developed based on th is approach. Sorne applications that
ernployed
unsupervised learning are TETRA [10), Self Organizing Clustering [II], Self Organizing Map [12J. and Growing Self Organizing Map [I], [13]. The ones that used supervised learning are PhyloPythia [14). Naive Bayessian Classification [15]. and Phymm[161-One of researches used GSOM combined with
oligonuclcoude to explore the genorne signatures. Clear species-specific separation of sequence was obtained in the > 8 kbp fragrnents test. The fragrnents
were derived from 30 species, which is separated into 3 dataset, 10 spec ies per set [ I].
ln this research. we employed binning based on composition with unsupervised learning. We proposed
I kbp DNA sequence derived from 18 species. We reads the fragments uniformly. The previous research [ I]used long fragments (8 kbp). Using short length (I kbp) gene a poor performance [5], [17]. In this research, we will overcorne the Iimitation of using short fragrnent. The purpose of this research is to know the performance of GSOM in c1ustering the mctagcnorne fragrnents with short fragmenl (I kbp fragrnent lenght).
II. MATERIAL AND METItODS
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
(jf'(JlI'il1 R , Self Organizin Map (GSOM)
GSOM consists of 3 main phase (Figure I). which wcrc initialization phase. growing phase and. smoothing phase [18). [19].
Initializotion phase
ln this phase. the algorithm initialize four starting nodcs. Four starting nodes which were randomly selectcd from the input dataset. The initialization nodcs wcre shown in Figure 2.
ICACSIS
2013
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
G T
=
-Dx
ln$F )Where D is dataset dimension, SF is Spread Factor.
SF
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
value was determined by user and SF took the values between zero and one.GSO~I PH.-\s[
SC~lI~G
W~I InlwliQ'tOOft
roros,"",,"""'oo .•l'r
aa
~m(l''''''~ln~
r~u...(" ~
.
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
-' ~ -;e,:
~:
--:? .~ .
.I" -
.-.~-.-•••••.,..Iu-n!wwp
hg I (i<"(Jt\! phase Contains thrcc phase which are miualisation
phase (irnualisation ~ starting nodcs). growing phase
(nodcs IS growing) as a irnportant phase, and smoothing
phase (final number or nodcs)
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
09
OB
U'
0)
c :
~ll
,
--~--r
. -.-.-~--__~_~-J-_J
f, : r • n" 0g IlI(\I!:-'O"'~l "
II!! 2 lmuahvatron 'I~rllng nodc lbc values " bciwccn one and
ICIII. random"
(,'r(JII/II}: "h (/\('
Grow ing phase is the most important phase in GSO~l method. bcacausc in this phase thc map would
be set as dynamic to overcornc the lirnitation of static
map structure of SOM. Belo» is the pseudocode of gw\\ing phase ln performing mctagcnornc fragmcnts clustcring shown 111Figurc J
Error values is the distance bciwccn thc input and \\ inncr nodc The growth proccss depcnded on the growth threshold. Whcn a nodc is nol in the boundarv uf the network. it cannot gro\\ nc\\ ncighbors duc its position
ISBN: 978-979-1421-19-5
Smoothing phase
ln the final phase. the learning rate and
~~L!~ ~i~~:~~:i~~~:o~w~lght ~cd~s)
:.'.! 1':.1:1. ':::'.'C11
G~t~rr.I'~ :~ar'i~q R~:ea'~
:;';'1;hr.:·:-l:,:,·:.j ~.i:,;,
:: (\I:~_~-?r i s l1-?t~rT_i~el.."i tr oe. =t:,rr.~,=,::it.l.r:'!1 rratz1. X I
o~;,~;\r.;-.. ·..•-~iq~ -:~'-:Cl ,~"1dits
:l~i;h!"~.=~
!:-.l:r a s e -= r=:T ir·:·:1'- ",-i:1:-_-=r
': : ({'-.!~~
=
r r ...·r ;.(" ....~ '0":r;:-(;1"·:' •.; -t_-?- ~.,::_~~
El
~-=
r'l·o·tr _L1Jo_,_"d ',J'_ ivil.L
l:.~,dif
:~iti~li:~t~o~ ~ew Lear~~~g Rate
~~.:1 :l-=lqhi:',:rr.·:,,=,,j S:z-=
~ 'l(~ i t
=,.... .. -.::.~ r1t! 1 t.-.l·j "'·_-,rrr,,-,~it :0:1
: 0:; ••••. ~ !"t:--:."':'- •..:0,; ~1.1·~r':-_"'''
F , .':
ncighborhood parameter would be dccreased. This
parameter changcd in every iteration. When the minimum level is rcached. the value would be close to zero.
Comparcd with the prcvious work hy Chan ef al [II, wc IIed I kbp fragmcnts Icnght in-read of 8 khp Using short Iragrncnts incrcascd the cornplexity
or
clustcrcd rrucroorganisrn and made the project more difficult. since it nlways caused fragments 10 overlapand made thcm being mi-taken tn fragrnents
asscmbl ing.
Because of that. in this research \\ e transformcd the
Ieaturcs cxtracuon rcsult to 0 - I values. The data
transtormation was uscd to reducc the data variation
and hclpcd to increase the level of truth.
1II RI \1 I I
The proposcd binning method \\as tested Pil
simulatcd mcragcnomc Iragrncnt gcnerated h~
Meta Sun 120) The <irnulatcd datasct of rnicrobc-D:"J!\ scqucncc \\ as randomly <arnpled from NC B 1 database 1211 ln this research. wc randornly took IR
microbcs. c) mrcrobcs for data trairunu and Q
microbc-for data Ic<;lmg \\ith I kbp fra!!lllcnt-IcIl!!lh and thcn clustcrcd into ~ drflcrcnt ph;lulll. f',.,:fl'ol>at'fl· r/.r.
IJU Ltcruulct cs, and thl.nnv.Iia«: rcspcctivc ly. Fach <rt of the gCll(lllle 'cqllcnce \\;\S scparaicd into two order' (Tf oligollllcleot ide frequcnc ics (trinuc lcotidc and
ICACSIS 2013
---
-
--_._---0.6
for
rrinucleoride frequency and 0.8for
tetranuc
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Icotidc frequency.The sirnulated daraset was extracted
by
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
komafrequencies method to get the specific oligonucleotide frequcncy and to put it in the composition
matrix.
After e xtracting. each of the dataset was scaled to obtaina dataset
between zero and one. Afterscaling.
we separated the data set iruo data training and data testing (Table I and Table II). We used data training to obtain the trained model and used data testing to evaluare the GSOM algorithrn. The flowchart of" AULI.I
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
!J." 1..\ r""',ISI~li
~lJcrobc,
.-\udJth'llbacJllu,> fcnvoranv SSJ chrornosorne
Bunchncra aphuhcola (('onara tujufihna)
chromosornc
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
.3 Burkholdcr ragtumac B tiR Ichromosorne I
Hlanabactcnum sr (Hlabcruv grgantcus}
chrornosornc
Havobactenum branchrophilurn 1'1.·15
Pro otclla denucota ~U~8<1chrumosornc /)
7 ( hl,,,",,lta numdurum NI~~ lhl~Ill\J"phJlJ Icl" lc l'~'I
" 'lm k a O lJ n(~C\Cll\l\
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Ilhf\lflH l,nm t: chrornovornc S"11 1.11I AHl.I: II
[),\Tr \ TESTINlj
II,,,, IIJlJ"llI'I"" ,"In It>rt,,(k, ·\Tee 1'~N
c hrornovomc
Hrucclla (anIS .-\ Iel'ChhH1111snnJc 1
Rh""t>'IHII ,'11, C1·N ~~
II~cl"'I>ldc' f,ag 1ii'11.1XK
Pr~'\t)I\.·II~nll:I~I1Il1()g\.·I1ICtl ,\1el' :!5X-l5
t:hrornp~IHnl' I
(, I'r\':\l'lrll.1 fU I1 IIf)I\..\.IIJ ~ .~ \.h h1Ilh · ...,· llh .·
l'hlilnl\ll"phtl" 1",,,,,,,11>1>1.10: :\K.I'l
R l'~raehl~In\J,,, ac antha.mmocbac 1'\'-7
chrornosornc
<) waddha chondrophrla \\ -;1' X('·I()·U
2
GSOM algorithm for clustcring rnetagenorne fragments are shown in I· igure 4.
To evaluare the clustering pcrformance. we used topology prcservation (topographic error). mapping precision (quantization error) and error perccntage
122J. 1231·
Wc also used time parameter to calculatethe efficiency
Quantization error is a common error mcasurernent that rneasure the averagc distance between each data vector and its Best Matching Unit (BMU) Definition of BMU is a a randomly sarnplcd vcctors tha count the nearcst distance bctw cen vcctors
1:::2
J Short ly. quantization error rncasurc the mapping precission bctwccn input vccior ii and ncarcst weight vccrormz;.
qe
=
2. ~
Iin -
mlillzyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
! Y L
.\'
ISBN:
978-979-1.:121-19-5
The expected map is obrained when the value of qe
(
s---=-T_~
/ C.~wt
~
II~ -t
;' 0""""",, /
LI __
"""_"'r(-"r_'_
,.--1-.
{ E n d )
'---'
:\IlJI~ ~ISproccdurc clu ....tfrmg rnctagcnornc fr3gllll"nl\
u'Ing (iSOr-l algouihm
1,\111.1. 111
ME !,\(jDJUMI' 1·I{Mi1\lLN 1S M',\I.YSIS V,\LlIE
l nnuclcoudc tctr anuclcoudc
rl!rxJ~raptll(
() llfl7 (Jn,,(,
error
()uanlllalll>n 1.1\).1
fJ 7~2 error
Irror pcrcc ntag c I ~ 7.l". I ~ .lRo.
Func (see)
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
(,00 2880rcachcd the minimum value.
To rneasure the topology preservation. we use topographic error. The topographic error considcrcd the map structure and cxplaincd the correspondcnce-betwccn input data. lhis error measures the proportion of all data vcctors for which first and second B\~U are not adjaccnt vcctors
122J.
.'1
1~
-Ce
=
"ii
L
u(xi)Where ll(xi) = 1 if ii data vcctor first and second B\lL:s are adjaccnt and O. othcrwise.
The error pcrccntagc used in this research was calculatcd based on the rcsult of rnisclassification of the mctagcnornc fragrncnts data.
ISB~: 9i8-9i9-1-t21-1!l-5
--- --- -- -
-ICACSIS 2013
lnn ' frcquenc, usrng (iSOM a
We use 06 SF value to control the neuron growth
stop growing in 28 x 13 mapmc
m
Map
with the values of 18, 74°'0 and 1848
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
~o for using trinuclcotidc and tetranucleotidc. rcspcctive ly. Thistcndency was also shown by topology preservation,
Both of oligonucleotide frequcncies gavc the
topographic error of 0_067 for trinucleotidc frequcncy
and 0_066 for tctranucleotide frcqucncy
l lowcvcr both topograhic error rcsult wcre cnough
10 prove that every BMUs in the map gr id was not adjacent vectors. It show ed that both maps gavc a quite good map prcservc to clustcring a metagcnomc
fragments.
Morcovcr b~ analyzing their quantization error. wc can concludc that tctranuclcoridc gavc bener clustcr than trinuclcotidc. Tctranuclcotidc frcqucncy gave a ma pp mg prccrsion result of 0,7~~. berter than that of trinucleotide frequcncy which is I J04, II mcans that clusters constructed using tctranuc lcotide frcqucncy
feature are more dense, Wc also showcd the rnapping
rcsulrs using trinucleotidc Ircqucncy (Figure ~) and
uS.ing letranuclcotide frcqucncy (Figure 6)
IV, COI"CI.lISIO~
Our method. cornbrrung (jSO~1 and
oligo-nuclcotide can show 3good pcrformance in clustcr ing mctagcnomc fragmcrns ln phylurn level with 5h0r1 fragment (1 kbp) 'l hc rcsults showcd that the pcrforrnance of clustcring using tctranuc lcoride was bener than using trinuclcoridc. The error pcrccntagc rcsult of using tctra-nuclcoudc is 18 .tRO 0 and the
quantization error \\3S 0 ~72 less than lIsIng tr
r-nuclcotidc \1orc()\ er, the ropographic error ISsl11<1l1.
lrg 6 \brrlOg tctranuclcoudc rr~411~n(\ l""l~ (i'i< )\1 ;tI~pfllhm
\\'~ li"
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
I) ~<"1' value Ip ,"1111,,1 11\,' neuron ~r""lh Map"l\'r ~rtl\\ In~ 11\30 X 12 mar
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
"d/~whic h Illcall~ the neuron in m.ip ;:.rid I' not adi~I(L'llt
UIlC ~1I1(1anothcr I~a,ed on these rc-ults \IC can
conclude th~lt ljS{)\l al~l'rithlll I' suitablc tor ll1apflll1~ the lIlet:q;cIlUI1lL'
fragmcn: kll~th ~1I1dha-,
fra;:'lIlcllh \Iuh
the llppl)rtllllitl
{ ~~ (h,lll. II,,, ... I,II\~ .md <, " 11.11-...:.111111;':1.' '\ "'1Il~ l;rll\\H1~ \.:11·( Ir~.II\I'-IIl~ \1.11''' Ip Impf!'\": the
HlIlIlIl1~ 1"1\1.."'" II) I11\HPIlI1ll'I11.li \\ hllk-lICllllllll." \th'll!lIn
\l'lPI(lIllll~ j"IIJllo d ,., "'''''Il'/II. fllC dll I /{I"/'" h'/"!I',,"
\ 1 ~IIPS : .,S
\1 \ (I \!;Jlk\ \h.-I.I!'Uh'lllh.' .~ III: ....'11: i('1111111.'1 \ ,,!!!..bk
hill' \\\\\\ maun, •...lltll1l;llIc\ ,'r~ pllhlh ••1I11'fl' 1111111 -, !1~1I.1\.lJll~1 '1 ~.1",1I Jlal lLH.1 \\,,1,+1.11
"'111111\111111(' 1111d'\'i'fll.lflllll,IIL'd '\..· .I\\.I!O ""'1 II!'" iI:1 'H
,n /1"'II/J/j"i"l'\ \t'\ \..;, pr 211' ~ I·J ~tlll.1
I f \ rrurr " !~\.· f'IIIl;:I,· nI I Ill'Id\."H"'\:r~·\ I 111Ir,,:rl1
I) RII•...•...h I \ I1'\.11 I) \\:1 I ".1111 ....(11 Jo.... I '.(1 '11 \\
"1..1"1\!l I I. 'i" •.••j (\\ jj ~Il.'r \1 \\ i,'ill.l' i\ '.r.t!"'11 ti \\hill" I I'rh:r' '11 J Ilt'lll1l.1Il R !',H"'Il" Ij
!'...:Jc!!·ldhl·:l ( !'!.::;:':1. ...1; ~ t-.:.t~l.:r.:.. .1Ih! ! 1 fI •..•iili:h
j11\ H,·tl1ll\·nl.Jl :'(I'II'lIil '\'lt'I~'llll \Cqtllll •.ll1~ ,·1 1111..'
'.H~_.I"\., (,<.1 .. -,•.. !.";I.!.. \'.! ' ~ ilt· :'10-.1
rp
l~ . -1 ~I"'~\h"\~"!,!:,"r\.. ... Hill {,I"lllll'! \h:I<I;..'\,.'II,·mc \·1.:/· ..• ·.• : /' i" "0' I':, ,. , tO' ":, \ \,,' I :"'r . : -,
[I J
I~I
1-'1
I' I
\\ \ ~,\I' lm.~ 'l1h "l': \rrr •. Lhc .. f,' Ill1rr .. \1Il~' fhe
1\"fl"rrll,l~ld .! ~l' 11,'\.' 1)', \ 'l'4lh.1hl" \"l'Jllt-l\ .1:HJ
\hl.l~':il n n , ( 1,1,'111 •..tt;, 11 "1 'h"rI I"I:':l1lll\l •• Ir"11! 'v c vl
r'(!.~· I.lllIl1 '~~i".~I~· .:I -k .. ln ..llitJi( -f Ic ...hl:,·;··f'\
,'''1_'
'~i .tf
ICACSIS
2013
ISBN: 978-979-1421-19-5
(S]
Fragrnent Binning.'
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
IEEE Sympostum OlfCIBCB. \"01. 8. pp.4&-53.2008.
H. Zheng and
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
1-1. Wu. "A Novel LDA and PCA-basedI-lierarchical Scheme for Metagenomic Fragrnent Binning ..'
IEEE Symposium on CIBCB. 2009.
D. II. Huson. r\. F. Auch. J. Oi. and S. C Schuster.
"MEGAN analysis of metagenomic data MEGAN analysrs
of rnetagenomic data:' Genome Research. \01. 17. pp. I-II.
2007.
1-1.Feeling, J. Waldmann. T. Lornbardot, M. Bauer. and F. O.
Glockner, "TETRA: a web-service and a stand-alone
program for the analysis and comparison of tetranucleoiide
usagc panerns in DNA sequences.," B.\/C biomformaucs.
\01. 5.p. 163, Oct 2004.
K. Amano, H. lchikawa, H. Nakamura, H. Numa. K.
Fukami-Kobayashi. Y. Nagamura. ind N. Onodera.
"Self-organizing Clustering: Non-hierarchical Clustenng for l.arge
Scale DNA Sequence Data," IPSJ Digital Courier. \'01. 3.
no. 2.pp. 193-197.2007.
T. Abe. S. Kanaya, M. Kinouchi, Y.Ichiba, T. Kozuki, and
T lkemura, "lnformatics for Unveiling Iliddcn Genornc
Signatures.' Genome Research, vol. 179, no.~. pp. 693-702.
2003.
A. L. Hsuand S. K.Halgamuge. "Enhanccment ofTopology
Preservation and lIierarchical Dynamic Self- Organising
Maps for Data Visualisation."
A C Mc+iardy. H. G. Martin. P. Hugenholtz. and I.
Rigoursos. "Accuratc Phylogenenc Classrflcauon ni"
Vnriablc-Lcnght D'I,\ Fragrncnt." Sa! .I!e!hods. \01 .1. ~,\
I.pr 63-72. 20 I 0 (9]
(lOI
(111
( 121
(131
( I~]
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
289
\15] G. Rosen. E. Garbarine, D. Caseiro. R. Polikar, and B.
Sokhansanj. "Metagenorne Fragrnent Classification using
N-mer Frequency Profiles." Advances in Bioinformatics, 2008.
A. Brady and S. L. Salzberg, "Phymrn and Phvmmlsl.:
Meragenomic Phylogenetic Classification with Interpolated
Markov Models." Sat
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
M ethods, vol. 6. no. 9. pp. 673-676,2010.
S Prabhakara and R. Acharya, "Unsupervised Two-Way
Clustering of Metagenomic Sequences." Journal of
Biomedicine and Biotechnology, vol. 2012, 2012.
G. Zhu. "The Growing Self-organizing Map for Clustering
Algorithms in Programming Codes." IEEE lnternattonal
Conference on AIC!, vol. 3. pp. 178-182,2010.
D. D. Silva. DAIahakoon, and S. Dharmage. "Cluster
Analysis using the GSOM: Parterns in Epidemiology." in
IEEE Intemational Conference on ICIAF. 2007. pp. 63-69.
D. C. Richter. F.On. A. F.Auch, R. Schrnid, and D. H.
Huson. "Metasim - Sequencing Sirnulator for Genornic
Sequence." PLoS OSE, vol. 3, no. 10, pp. 1-34. 2008.
S Fcderhcn. 'The NCBI Taxonomy database." Xucleu: Actds Research, 1'01. 40. no. Decernber 201 I. pp. 136-143. 2012.
E. A. Uriane and F. D. Martin. "Topology Preservation in
SOM." lnternattonal Journal of Applied Mathematics and
Computer SCiences, pp. 19-22.2005.
J. Vesanto. M. Sulkava, and J. I-Iollm. "On the
Dccomposition of the Self-Organizing Map Distortion
Mcasure Dccomposiuon of the SOM distortion rncasurc ..
116]
(17]
(181
119]
(20]
121]
(221