Genomic sequencing and assembly for E. tenella Houghton

(1)

Genomic analysis of the causative agents of coccidiosis in domestic chickens

Adam J Reid^1*, Damer P Blake^2,3, Hifzur R Ansari⁴, Karen Billington³, Hilary P Browne¹, Josephine Bryant¹, Matt Dunn¹, Stacy S Hung⁵, Fumiya Kawahara⁶, Diego Miranda-Saavedra⁷, Tareq B Malas⁴, Tobias Mourier⁸, Hardeep Naghra^1,9, Mridul Nair⁴, Thomas D Otto¹, Neil D Rawlings¹⁰, Pierre Rivailler^3,11, Alejandro Sanchez-Flores¹², Mandy Sanders¹, Chandra Subramaniam³, Yea-Ling Tay^13,14, Yong Woo⁴, Xikun Wu^3,15, Bart Barrell^1¥¨, Paul H Dear¹⁶, Christian Doerig¹⁷, Arthur Gruber¹⁸, Alasdair C Ivens¹⁹, John Parkinson⁵, Marie-Adèle Rajandream^1†, Martin W Shirley²⁰, Kiew-Lian Wan^13,14, Matthew Berriman¹, Fiona M Tomley^2,3*, Arnab Pain^4*

Supplemental methods

Details of genome assembly...3

Genomic sequencing and assembly for E. tenella Houghton...3

Genomic sequencing, assembly and annotation of tier 2 genomes...3

Genomic sequencing, assembly and annotation of tier 3 genomes...4

Whole genome (Optical) map generation...5

Analysis of genome completeness...6

Genome annotation...7

Gene finding for E. tenella...7

Functional annotation...8

Classification and analysis of gene families...9

Metabolic reconstruction for E. tenella...9

Identification and classification of protein kinase genes...9

ApiAP2 transcription factor analysis...10

Identification and classification of novel gene families...10

Calculation of Ka/Ks...11

Sag gene identification and characterization...11

(2)

Retrotransposon analysis...13

Transposon identification...13

Copy number variation of retrotransposons between strains of E. tenella...13

Repeat analysis...15

Placing HAARs in structural context...15

Proteomic analysis of HAARs...16

Indel analysis of repeats...16

Functional enrichment in groups of genes...16

References...17

(3)

Details of genome assembly

Genomic sequencing and assembly for E. tenella Houghton

Genomic DNA was prepared for Sanger sequencing and optical mapping from sporozoites purified by passage through columns of diethylaminoethyl cellulose and prepared in agarose plugs to minimise physical shearing of the extracted DNA prior to treatment overnight with Proteinase K (Ling et al. 2007). Genomic DNA was prepared for Illumina sequencing from purified sporulated oocysts using a Mini-BeadBeater-8 (Blake et al. 2012).

Sanger capillary sequencing clone libraries with average insert sizes 1.2, 1.7, 3, 35 and 65 kbp were generated. 897,191 reads (87.7% paired), free of

mitochondrial, vector and chicken sequences were used for assembly. Estimated genomic coverage for base quality >= 20 was ~8x. Illumina GAIIx sequencing libraries with insert sizes of 300 bp and 3 kbp were used to obtain 54 bp and 76 bp paired-end reads with a combined coverage of ~160x.

Capillary reads were assembled using ARACHNE v3.2 using default parameters (Batzoglou et al. 2002). IMAGE (Tsai et al. 2010) was used to fill gaps in scaffolds and extend contigs with Illumina reads, running 6 iterations (3 with k-mer=31 and 3 with k-mer=27) and BWA mapping. Using iCORN (Otto et al. 2010) the consensus sequence from the ARACHNE-IMAGE assembly was corrected with the Illumina reads. All Illumina reads which did not map to this assembly were assembled using Velvet (Zerbino and Birney 2008) and these contigs added to the final assembly.

Genomic sequencing, assembly and annotation of tier 2 genomes

Paired-end Illumina libraries were prepared from 200ng of genomic DNA using a TruSeq Illumina DNA library preparation kit on a Diagenode IP-Star machine using the protocol "Illumina library prep" until adapter ligation and purification.

After ligation, libraries were run on 2% Gel and size selected for 500 bp insert size. These were gel-extracted, PCR performed using standard Illumina PCR

(4)

cycle, and samples run on bioanalyzer for insert size and qubit for quantification.

They were normalized, pooled and submitted for sequencing on an Illumina HiSeq 2000 platform to a depth of 199x theoretical genome coverage for E.

acervulina H, 288x for E. maxima W and 559x for E. necatrix H as in (Kozarewa et al. 2009).

Sequencing reads were clipped, deleting bases with low quality (SGA preprocess version 0.9.9 (Simpson and Durbin 2012); parameters: m 51, permute-

ambiguous, f 3, q 3). We then assembled reads using Velvet (Zerbino and Birney 2008); parameters: -exp_cov auto; -ins_length 300 -ins_length_sd 30; -cov_cutoff 5; -min_contig_lgth 200 -min_pair_count 10. We chose high k-mers to better assemble short repeats. We used k=71 for E. necatrix and E. maxima and k=65 for E. acervulina. Next we scaffolded the contigs with SSPACE (Boetzer et al. 2011), running it iteratively, each time reducing the number of mate pairs required to join contigs as follows: 200,100,50,30,20,20,10 and 10. We set the n parameter to 31. To close sequencing gaps we used IMAGE (Tsai et al. 2010) running three iterations with a k-mer of 85 for each genome, then six iterations with a k-mer of 61, setting the smalt_minScore parameter to 45.

For each tier 2 genome we identified ~200 gene models with which to train Augustus by manual annotation of models generated by CEGMA (Parra et al.

2007) and models transferred from E. tenella using RATT (Otto et al. 2011), with reference to syntenic gene models and RNA-seq data. These curated models were used to train Augustus, with which a final set of gene models were predicted.

Genomic sequencing, assembly and annotation of tier 3 genomes Paired-end Illumina libraries were prepared (as stated before) with 500bp fragments and sequenced using an Illumina HiSeq 2000 to a depth of 143x theoretical genome coverage for E. brunetti H, 520x for E. mitis H and 102x for E.

praecox H (Kozarewa et al. 2009).

(5)

Tier 3 genomes were assembled as for Tier 2 genomes, with the exception that no gap closing was run. We used a k-mer of 71 for E. mitis and E. brunetti and 75 for E. praecox. For E. mitis we used a reduced set of 125 million randomly

selected reads.

Gene models were predicted using Augustus trained with E. tenella gene models as these were considered to be the highest quality of the four species with curated gene models.

Whole genome (Optical) map generation

DNA embedded in agarose plugs (100μl) was washed in TE (pH 8), melted, mixed with 100 lμ of an agarase solution (4μl agarase (1000U/ml)/96μl TE) and heated to 42°C for 12 hours, resulting in high molecular weight DNA samples.

DNA samples were diluted to approximately 500pg/ l and 2 L of DNA applied μ μ to a MapCard which was run on the OpGen Argus® system following

manufacturer's protocols. MapCard reagent chambers were loaded with JOJO™

stain, OpGen enzyme, buffer and antifade. The card was then cycled on the Argus® MCP (MapCard Processing Unit) for approximately 25 minutes, with digestion at 37°C using AflII. Contig assembly was performed using the Argus®

MapManager™ software with minimum molecule size 250 Kb, minimum fragments per molecule 12 and minimum molecule quality 0.4. Overlapping contigs were combined, reassembled and extended until either telomere regions were identified or no further extension was possible. The QC module of

MapManger™ from OpGen was used to identify mis-assemblies, which were resolved manually where possible.

We used an in-house script to convert the alignment of optical map and genome assembly into a scaffolded genome assembly. Resulting scaffolds were used to analyse synteny between genomes and look at long range repeat structure, but all other analyses were performed using whole, unscaffolded assemblies. We first generated placement files from MapSolver v3.2 using default settings. We

(6)

then generated each scaffold by extracting the matching regions of supercontigs from the genome assembly and joining them with gaps (Ns) of the length

specified in the placement file. Where necessary a reverse complement of the sequence was used. Where a region of the map matched multiple parts of the genome, we took only the longest match.

Analysis of genome completeness

We initially used CEGMA (Parra et al. 2007) to identify orthologues of core eukaryotic gene families (KOGs) within Eimeria genomes and compared this to values found in Toxoplasma gondii ME49 and Plasmodium falciparum 3D7 genomes. The values for Eimeria genomes were much lower than those for Toxoplasma and we reasoned this might be due to genome fragmentation rather than genome incompleteness, as the genomes appeared to be roughly the correct sizes based on the similarity of optical map and genome assembly sizes. We then identified the genes relating to each KOG found in Toxoplasma but not in Eimeria and looked to see whether we had identified one-to-one orthologues for these using OrthoMCL. In most cases we found the missing KOG member in each Eimeria genome and adjusted the CEGMA completeness values by adding in the newly identified KOGs. In E. tenella we identified a further 45 KOGs, E. necatrix 35, E. acervulina 33 and E. maxima 28.

(7)

Genome annotation

Gene finding for E. tenella

A combined approach was used to predict a complete set of protein coding genes in E. tenella. We initially prepared 1000 gene models from a previous version of the E. tenella genome assembly that had been curated to reflect a variety of evidences, including our RNA-seq transcriptome data. These were used to train a series of methods that were combined using JIGSAW (Allen and Salzberg 2005).

The gene prediction programs Augustus (Stanke et al. 2006), GlimmerHMM (Majoros et al. 2004) and SNAP (Korf 2004) were trained using the 1000 curated gene models, using default parameters. The trained parameters were then used to run each method on the earlier genome assembly, producing predictions, which could be used to train JIGSAW. A 95%-identity non-redundant dataset of apicomplexan proteins from UniProt (UniProt 2013) was also BLASTed against the old assembly to produce homologue information. Four lanes of paired-end RNA-seq data, each from a different life stage, prepared as described below, were mapped against the old assembly using TopHat v1 (Trapnell et al. 2009) (with -r 40, -I 10000). Transcripts were predicted from this mapping using Cufflinks (Trapnell et al. 2010) (-Q 10 -I 10000) and datasets included as four distinct gene predictors in the combined gene prediction. Augustus, GlimmerHMM and SNAP provided “acc don coding start stop intron” information to JIGSAW, whereas Cufflinks predictions provided only “acc don intron”. BLAST hits to our database of apicomplexan proteins from UniProt were included as homology information.

We then ran JIGSAW in training mode using the 1000-gene curated set as the gold standard. This resulted in trained parameters which could be used to combine the same set of predictors on the new genome assembly. Each

predictor (Augustus, GlimmerHMM, SNAP) was run on the new assembly, using Eimeria-specific training parameters derived from the 1000 gene models on the old assembly. Each of the four RNA-seq datasets was mapped to the new

assembly and transcripts predicted using Cufflinks. The BLAST dataset was also run against the new assembly.

(8)

These eight lines of evidence were used to predict a set of 8787 gene models for the E. tenella assembly. Roughly half fell into OrthoMCL groups in a comparison with T. gondii and N. caninum and we identified homologues for nearly 98% of core eukaryotic genes. Roughly 90% of the Eimeria proteins in Swiss-Prot had homologues in the predicted E. tenella proteome; the missing 10% were apicoplastic. Coverage of UniProt (Swiss-Prot/TrEMBL) was only ~60%, however, those missing were predictions from the published chromosome 1 sequence which we consider dubious. Subsequent manual work on the gene models led to an overall reduction in number to 8603.

Functional annotation

Initial functional annotation (product calls) was performed using the following algorithm. First assign product descriptions from Eimeria-specific proteins in UniProt. For those remaining unannotated, assign a product name from a one- to-one orthologue in T. gondii. For those remaining unannotated, assign a product name from a homologue in UniProt with E-value <= 1e-10.

Annotations from T. gondii orthologues were suffixed “putative”, while those from UniProt were suffixed “related”. Where an orthologue existed in T. gondii but had no annotation or annotation was “hypothetical protein” or similar we applied the term “hypothetical protein, conserved”. Where there was no significant match at all we used the annotation “hypothetical protein”.

We performed an InterPro scan (Hunter et al. 2009) on the gene models for E.

tenella and extracted Pfam (Punta et al. 2012) and Gene Ontology (Ashburner et al. 2000) data for further analysis.

(9)

Classification and analysis of gene families

Metabolic reconstruction for E. tenella

The 8,786 E. tenella genes were searched using the following homology-based enzyme prediction tools: (i) DETECT (Hung et al. 2010) (cutoff ILS > 0.2, at least 5 positive hits), (ii) BLAST (E-value > 1e-10), (iii) PRIAM (Claudel-Renard et al.

2003) (E-value > 1e-10). To account for EuPathDB-specific annotations for highly conserved apicomplexan enzymes, enzymes shared by P. falciparum and T. gondii that produced a BLAST hit (E-value > 1e-10) to the E. tenella gene model were included as an additional dataset (based on PlasmoDB and ToxoDB gene-EC mappings). The BRENDA resource (Barthelmes et al. 2007) provided biochemical evidence for 15 enzymes, with evidence for an additional 68 enzymes from the supplemental resource AMENDA. The final set of 571 E.

tenella enzymes was obtained by integrating datasets from BRENDA, DETECT, T.

gondii orthologs, apicomplexan-conserved enzymes and enzymes identified by both BLAST and PRIAM.

Identification and classification of protein kinase genes

We searched the set of predicted protein sequences for all seven Eimeria species using the Kinomer HMM Library. The group-specific E-value cutoffs were

adjusted for using the Kinomer HMM Library (Miranda-Saavedra et al. 2012) using HMMER 3 (Eddy 2011) as described in Talevich and Kannan (2013).

A ROPK-specific HMM profile was obtained from a multiple alignment of 33 ROPKs reported in Toxoplasma (Peixoto et al. 2010). The entire T. gondii proteome was searched with this ROPK HMM and the highest E-value amongst the 33 sequences used to generate the ROPK HMM was 1e-30. This value was thus used as the cut-off to detect Eimeria ROPKs. We identified twenty-one ROPKs across all Eimeria genomes which were added into the HMM. The Eimeria genomes were researched with the new HMM, but no more convincing ROPK genes were found. Subsequently new work was published which identified a novel, Eimeria-specific ROPK subfamily (Talevich and Kannan 2013). This family

(10)

was not reliably identified using HMMs, but required more manual work to verify examples. Therefore we used OrthoMCL families to identify orthologues of the ROPKs identified by Talevich & Kannan across Eimeria species.

Other rhoptry and also dense granule genes were identified using OrthoMCL and BLAST analyses.

ApiAP2 transcription factor analysis

We compiled proteomes of 23 species from public databases such as EuPathDB (Aurrecoechea et al. 2009) and the NCBI Genome database

(http://www.ncbi.nlm.nih.gov/genome) and searched for AP2 domains

(PF00847) using HMMER 3 (Eddy 2011). We then used TBLASTN to identify Api- AP2 genes that were missed in the Eimeria genomes during automated

annotation using the previously identified genes as query with >=70% identity match and e-value of <=1e-⁵. Orthologs clusters were generated using OrthoMCL v2.0 (Li et al, 2003) with percent match of 10 and inflation parameter I=1.2. We mapped the respective orthologous groups with available E tenella expression profiles for different stages. Hierarchical clustering, heatmap and Venn diagrams generated using ‘gplots’ (http://cran.r-project.org/web/packages/gplots/) package of R (http://www.r-project.org/) to produce Supplemental Figure S6.

Identification and classification of novel gene families

We identified six OrthoMCL clusters of E. tenella genes, which contained multiple genes with no orthologue in T. gondii and no significant similarity to protein sequences in UniProt by BLAST. We built amino acid multiple sequence alignments of each family using MUSCLE (Edgar 2004) and refined these manually. Hidden Markov Models (HMMs) were built from the alignments and the other Eimeria genomes were searched for homologues using HMMER 3 (Eddy 2011). ESF1 was too diverse to be used in building a HMM and therefore

(11)

Calculation of Ka/Ks

Using OrthoMCL we identified one-to-one orthologues genes across all seven Eimeria genomes. We eliminated all gene models without a proper start codon.

Next, we removed all low complexity regions using dustmasker (Morgulis et al.

2006), with standard parameters and conserving the codons. Nucleotide sequences were transformed into amino acids, on which a further low complexity filter was used with default parameters (Wootton and Federhen 1996). After performing multiple alignments with MUSCLE (Edgar 2004), these were back translated and cleaned with Gblocks (Castresana 2000); parameters: - t=c -p=n -b4=4. From the cleaned alignments we calculated the pairwise Ka/Ks

between E. tenella and all other species or E. acervulina and all other species using Perl module Bio::Align::DNAStatistics (CPAN). We noticed that esf1 and esf2 family members appeared to have high Ka/Ks ratios. We used a one-tailed Wilcoxon rank sum test (wilcox.test in R v3.0.0) to determine whether these families (and the sag family) had significantly higher ratios than genes in general.

Sag gene identification and characterization

Known SAG amino acid sequences (SAG1-23) were compared against the E.

tenella assembly using tblastn. Regions with E-value <= 0.01 were manually curated using Artemis (Carver et al. 2008) resulting in 79 gene models. All genes were roughly the same length (~260 amino acids); dotplots were used to

determine whether individual genes contained multiple domain copies but no evidence was found. We clustered these sequences into three families using BLAST and the heatmap function in R and built a Hidden Markov Model for each one using HMMER 3 (Eddy 2011). We searched the predicted translations of all seven Eimeria genomes with the HMMs. We then clustered these sequences using TRIBE-MCL (Enright et al. 2002) and identified three families across all species, one of which included two of the previous families. The sagA family was too diverse to align and was split into two subfamilies using hierarchical

clustering, resulting in families sagA1, sagA2, sagB and sagC.

(12)

An in house script called Pseudofind was used to determine likely fragmented or pseudogenous sags by looking for regions of the genome with sequence

similarity (by BLAST) to sags, which had not been predicted as genes. Pseudofind then clustered overlapping hits into single hits. For E. tenella we manually

identified loci corresponding to individual pseudogenes and further collapsed the hits where necessary. For other genomes we report the number of

computationally clustered fragments. Pseudofind is available as part of the supplemental material.

GPI anchor addition sites were predicted using big-PI trained on protozoa (Eisenhaber et al. 1999). Signal peptides were predicted using SignalP with default options (Petersen et al. 2011).

The HHpred server (Soding et al. 2005) was used to determine distant relationships between our sag HMMs and those from the PDB70 and Pfam databases.

(13)

Retrotransposon analysis

Transposon identification

Contigs from all seven Eimeria genomes were translated into amino acid

sequences in all six frames, and searched using HMMsearch (Eddy 2011) with an HMM build from a range of reverse transcriptase proteins. For all significant hits (E≤0.01) flanking sequence of 7.5kb in each direction (where possible) were collected. Within each genome, sequences were searched against each other using BLAST (Altschul et al. 1997). Any two sequences producing a match of least 1 kb in length were grouped, and sequences clustered together from this. All clusters with at least five member sequences were aligned and kept for further analysis.

Alignments of sequence clusters were overlayed with TE protein domains (HMMsearch using the Pfam models for reverse transcriptase, protease, integrase, RNaseH, gag and chromodomain) and from this, candidate TE sequences were extracted manually. For each group of extracted candidate sequences a consensus sequence was constructed using simple majority rule.

LTR sequences were predicted using LTRharvest (Ellinghaus et al. 2008).

Notably, with the exception of one sequence, all consensus sequences contain a chromodomain. A RepeatMasker library (http://repeatmasker.org) consisting of the candidate sequences was used to scan all seven Eimeria genomes. The

greatest proportion of a genome being annotated as transposable elements was 2.12% (E. brunetti).

Copy number variation of retrotransposons between strains of E. tenella Illumina sequencing reads from Nippon and Wisconsin strains of E. tenella were mapped against the E. tenella Houghton reference genome using SMALT

(Ponstingl, unpublished). The indexing step was run with parameters: -k 13 -s 2, the mapping step: -y 0.9 -i 1000 -r 0. We then ran CNVnator v0.2.7 (Abyzov et al.

2011) with a bin size of 300 or 1000. We searched for overlaps between regions

(14)

predicted by CNVnator with both bin sizes and each retrotransposons we had identified. We then inspected each example visually in Artemis (Carver et al.

2008).

(15)

Repeat analysis

Placing HAARs in structural context

We extracted protein sequences, secondary structure and solvent accessibility annotation from the DSSP database (Kabsch and Sander 1983) for each chain in the Protein DataBank. We then aligned PDB-XRAY disorder annotation from MobiDB (Di Domenico et al. 2012) to these using in-house scripts. We BLASTed E. tenella proteins sequences against the PDB sequence database and identified those regions which matched a PDB sequence with E-value <= 1e-40 and sequence identity >= 40%. We then ran SEG (Wootton and Federhen 1996) to identify which subsequences contained HAARs (>= 5aa). Where a HAAR was present we aligned E. tenella and PDB sequences using MUSCLE (Edgar 2004) and realigned these to the structural annotation, preserving the alignment between the two sequences. The structural context was then assessed manually.

For several E. tenella protein sequences we determined homology models of tertiary structure using Swissmodel Workspace (Arnold et al. 2006). E. tenella ribosomal protein L38 (ETH_00013290) was modeled using the Drosophila melanogaster template 3J39k (Protein DataBank), which had 44.71% sequence identity and E-value 1.60e-38, resulting in a model with QMEAN Z-score -1.23 for residues 1 to 85. The E. tenella ATP synthase beta chain (ETH_00026155) was modeled using the bovine template 1h8eE (67.54% identity, E-value 1.4e-175) resulting in a model with QMEAN Z-score -1.57 for residues 64 to 559. E. tenella GMP synthetase (ETH_00009385) was modeled on the P. falciparum template 3u0wA (39.54% identity, E-value <0.001). E. tenella ribosome-interacting GTPase 1 (ETH_00025450) was modeled using the yeast template 4A9A (60%

sequence identity, E-value 4.25e-124) resulting in a model with QMEAN Z-Score - 0.812 for residues 2 to 370. To examine the structural context of HAARs we use the iterative magic fit function in Swiss-PdbViewer (Guex and Peitsch 1997) to generate superpositions of model and template, we then displayed them using Jmol (http://www.jmol.org/). Primary sequence alignments were generated using MUSCLE (Edgar 2004) and Jalview (Waterhouse et al. 2009).

(16)

Proteomic analysis of HAARs

We extracted peptide identifications from proteomic studies of E. tenella (Oakes et al. 2013) that were directed at rhoptries, which we now know are reduced in the prevalence of HAARs. Therefore we were not able to determine whether the proteome has greater or fewer HAARs than what is expected from the genome.

Furthermore peptides containing HAARs are less likely to be unambiguously identified. However, from 5161 peptides, we identified 147 HAARs (>= 7mer) using SEG. Seventy-five were alanine, 64 glutamine and 8 serine repeats.

Indel analysis of repeats

Illumina reads from the Nippon strain were mapped using SMALT as above. The GATK v2.0.35 unified genotyper (DePristo et al. 2011) was run to call indels, with indel realignment and the following options -pnrm POOL -ploidy 1 -glm

POOLBOTH.

Functional enrichment in groups of genes

We used the GO::TermFinder Perl module to calculate the significance of Gene Ontology term overrepresentation in genes with and without repeats (Boyle et al. 2004). The hypergeometic test was used with a Bonferroni correction for multiple hypothesis testing. A P value cutoff of 0.001 was applied.

(17)

References

Abyzov A, Urban AE, Snyder M, Gerstein M. 2011. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 21(6): 974-984.

Allen JE, Salzberg SL. 2005. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21(18): 3596-3603.

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17): 3389-3402.

Arnold K, Bordoli L, Kopp J, Schwede T. 2006. The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling.

Bioinformatics 22(2): 195-201.

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):

25-29.

Aurrecoechea C, Brestelli J, Brunk BP, Dommer J, Fischer S, Gajria B, Gao X, Gingle A, Grant G, Harb OS et al. 2009. PlasmoDB: a functional genomic database for malaria parasites. Nucleic acids research 37(Database issue): D539- 543.

Barthelmes J, Ebeling C, Chang A, Schomburg I, Schomburg D. 2007. BRENDA, AMENDA and FRENDA: the enzyme information system in 2007. Nucleic Acids Res 35(Database issue): D511-514.

Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES. 2002. ARACHNE: a whole-genome shotgun assembler.

Genome Res 12(1): 177-189.

Blake DP, Alias H, Billington KJ, Clark EL, Mat-Isa MN, Mohamad AF, Mohd-Amin MR, Tay YL, Smith AL, Tomley FM et al. 2012. EmaxDB: Availability of a first draft genome sequence for the apicomplexan Eimeria maxima. Mol Biochem Parasitol 184(1): 48-51.

Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. 2011. Scaffolding pre- assembled contigs using SSPACE. Bioinformatics 27(4): 578-579.

Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G. 2004.

GO::TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 20(18): 3710-3715.

Carver T, Berriman M, Tivey A, Patel C, Bohme U, Barrell BG, Parkhill J, Rajandream MA. 2008. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database. Bioinformatics 24(23): 2672-2676.

Castresana J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17(4): 540-552.

Claudel-Renard C, Chevalet C, Faraut T, Kahn D. 2003. Enzyme-specific profiles for genome annotation: PRIAM. Nucleic Acids Res 31(22): 6633-6639.

DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M et al. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data.

(18)

Di Domenico T, Walsh I, Martin AJ, Tosatto SC. 2012. MobiDB: a comprehensive database of intrinsic protein disorder annotations. Bioinformatics 28(15):

2080-2081.

Eddy SR. 2011. Accelerated Profile HMM Searches. PLoS Comput Biol 7(10):

e1002195.

Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5): 1792-1797.

Eisenhaber B, Bork P, Eisenhaber F. 1999. Prediction of potential GPI-

modification sites in proprotein sequences. J Mol Biol 292(3): 741-758.

Ellinghaus D, Kurtz S, Willhoeft U. 2008. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC

Bioinformatics 9: 18.

Enright AJ, Van Dongen S, Ouzounis CA. 2002. An efficient algorithm for large- scale detection of protein families. Nucleic Acids Res 30(7): 1575-1584.

Guex N, Peitsch MC. 1997. SWISS-MODEL and the Swiss-PdbViewer: an

environment for comparative protein modeling. Electrophoresis 18(15):

2714-2723.

Hung SS, Wasmuth J, Sanford C, Parkinson J. 2010. DETECT--a density estimation tool for enzyme classification and its application to Plasmodium

falciparum. Bioinformatics 26(14): 1690-1698.

Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L et al. 2009. InterPro: the integrative protein signature database. Nucleic Acids Res 37(Database issue): D211-215.

Kabsch W, Sander C. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12): 2577-2637.

Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5: 59.

Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ. 2009.

Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat Methods 6(4): 291-295.

Ling KH, Rajandream MA, Rivailler P, Ivens A, Yap SJ, Madeira AM, Mungall K, Billington K, Yee WY, Bankier AT et al. 2007. Sequencing and analysis of chromosome 1 of Eimeria tenella reveals a unique segmental

organization. Genome Res 17(3): 311-319.

Majoros WH, Pertea M, Salzberg SL. 2004. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20(16): 2878- 2879.

Miranda-Saavedra D, Gabaldon T, Barton GJ, Langsley G, Doerig C. 2012. The kinomes of apicomplexan parasites. Microbes Infect 14(10): 796-810.

Morgulis A, Gertz EM, Schaffer AA, Agarwala R. 2006. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. Journal of computational biology : a journal of computational molecular cell biology 13(5): 1028-1040.

Oakes RD, Kurian D, Bromley E, Ward C, Lal K, Blake DP, Reid AJ, Pain A, Sinden RE, Wastling JM et al. 2013. The rhoptry proteome of Eimeria tenella

(19)

Otto TD, Sanders M, Berriman M, Newbold C. 2010. Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology. Bioinformatics 26(14): 1704-1707.

Parra G, Bradnam K, Korf I. 2007. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23(9): 1061-1067.

Peixoto L, Chen F, Harb OS, Davis PH, Beiting DP, Brownback CS, Ouloguem D, Roos DS. 2010. Integrative genomic approaches highlight a family of parasite-specific kinases that regulate host responses. Cell Host Microbe 8(2): 208-218.

Petersen TN, Brunak S, von Heijne G, Nielsen H. 2011. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8(10): 785- 786.

Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J et al. 2012. The Pfam protein families database.

Nucleic acids research 40(D1): D290-D301.

Simpson JT, Durbin R. 2012. Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22(3): 549-556.

Soding J, Biegert A, Lupas AN. 2005. The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33(Web Server issue): W244-248.

Stanke M, Schoffmann O, Morgenstern B, Waack S. 2006. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7: 62.

Talevich E, Kannan N. 2013. Structural and evolutionary adaptation of rhoptry kinases and pseudokinases, a family of coccidian virulence factors. BMC evolutionary biology 13: 117.

Trapnell C, Pachter L, Salzberg SL. 2009. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9): 1105-1111.

Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology 28(5): 511-515.

Tsai IJ, Otto TD, Berriman M. 2010. Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol 11(4): R41.

UniProt C. 2013. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res 41(Database issue): D43-47.

Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ. 2009. Jalview

Version 2--a multiple sequence alignment editor and analysis workbench.

Bioinformatics 25(9): 1189-1191.

Wootton JC, Federhen S. 1996. Analysis of compositionally biased regions in sequence databases. Methods in enzymology 266: 554-571.

Zerbino DR, Birney E. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5): 821-829.